-
Notifications
You must be signed in to change notification settings - Fork 1.1k
perf: reduce DB calls to GetWorkspaceByAgentID via caching workspace info
#20662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Callum Styan <callumstyan@gmail.com>
GetWorkspaceByAgentID for agent metadata updates Signed-off-by: Callum Styan <callumstyan@gmail.com>
GetWorkspaceByAgentID via caching workspace information in Agent APIGetWorkspaceByAgentID via caching workspace information in Agent API
coderd/agentapi/api.go
Outdated
| AutostartSchedule sql.NullString | ||
| DormantAt sql.NullTime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These can change in flight. Unsure if we need these to be up to date with any change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we actually need the AutostartSchedule for the statsAPI, and while it in theory can change after the workspace has started it seems like it will change infrequently enough that the 5m refresh time is okay here
| if err != nil { | ||
| a.opts.Log.Warn(ctx, "failed to refresh cached workspace fields", slog.Error(err)) | ||
| return | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error should probably nuke the cache? If the workspace can not be fetched, this could be a permission issue for example. Then the cache is stale with the last authorized called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it depends on whether we want to always nuke it, or only nuke for permissions issues? There could be transient DB connection problems that could cause an error here, but those are likely to impact the follow up calls in the Stats/Metadata APIs anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for security reasons, it feels safer to nuke it. If there is transient db issues, things are going to fail in other places too. Having a stale workspace is the least of our issues.
Is it easy to nuke it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
easy, nukes the cache now via call to Clear()
coderd/database/dbauthz/dbauthz.go
Outdated
| act, ok := ActorFromContext(ctx) | ||
| if !ok { | ||
| return ErrNoActor | ||
| } | ||
| err := q.auth.Authorize(ctx, act, policy.ActionUpdate, rbacObj) | ||
| if err == nil { | ||
| return q.db.UpdateWorkspaceAgentMetadata(ctx, arg) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| act, ok := ActorFromContext(ctx) | |
| if !ok { | |
| return ErrNoActor | |
| } | |
| err := q.auth.Authorize(ctx, act, policy.ActionUpdate, rbacObj) | |
| if err == nil { | |
| return q.db.UpdateWorkspaceAgentMetadata(ctx, arg) | |
| } | |
| err := q.authorizeContext(ctx, policy.ActionUpdate, rbacObj) | |
| if err == nil { | |
| return q.db.UpdateWorkspaceAgentMetadata(ctx, arg) | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need the actor here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The q.authorizeContext handles the actor for you. See
coder/coderd/database/dbauthz/dbauthz.go
Lines 138 to 150 in afe37c9
| // authorizeContext is a helper function to authorize an action on an object. | |
| func (q *querier) authorizeContext(ctx context.Context, action policy.Action, object rbac.Objecter) error { | |
| act, ok := ActorFromContext(ctx) | |
| if !ok { | |
| return ErrNoActor | |
| } | |
| err := q.auth.Authorize(ctx, act, action, object.RBACObject()) | |
| if err != nil { | |
| return logNotAuthorizedError(ctx, q.log, err) | |
| } | |
| return nil | |
| } |
Signed-off-by: Callum Styan <callumstyan@gmail.com>
and invalidate the cached workspace Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
context Signed-off-by: Callum Styan <callumstyan@gmail.com>
but also not log at error level Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
3ea6865 to
b8d3604
Compare
auto cache in the API constructor if the workspace is a prebuild Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
struct, add clear + update functions and very basic tests Signed-off-by: Callum Styan <callumstyan@gmail.com>
GetWorkspaceByAgentID via caching workspace information in Agent APIGetWorkspaceByAgentID via caching workspace information in Agent API
Signed-off-by: Callum Styan <callumstyan@gmail.com>
GetWorkspaceByAgentID via caching workspace information in Agent APIGetWorkspaceByAgentID via caching workspace info
| // Inject RBAC object into context for dbauthz fast path, avoid having to | ||
| // call GetWorkspaceByAgentID on every metadata update. | ||
| rbacCtx, err := dbauthz.WithWorkspaceRBAC(ctx, a.Workspace.AsDatabaseWorkspace().RBACObject()) | ||
| if err != nil { | ||
| // Don't use error level here; that will fail the call but in reality an invalid object here is okay | ||
| // since we will then just fallback to the actual GetWorkspaceByAgentID call in dbauthz. | ||
| a.Log.Warn(ctx, "cached RBAC Workspace context wrapping was invalid", slog.Error(err)) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this empty for prebuilds, so WithWorkspaceRBAC will always throw a log.Warn?
We should not log a warning when the right behavior is occurring (that being a prebuild workspace is not being cached)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this now only logs if the workspace was cached and the rbac object was still somehow invalid
coderd/agentapi/stats.go
Outdated
| return nil, xerrors.Errorf("get workspace by agent ID %q: %w", workspaceAgent.ID, err) | ||
|
|
||
| // If cache is empty (prebuild or invalid), fall back to DB | ||
| ws := a.Workspace.AsDatabaseWorkspace() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if AsDatabaseWorkspace returned a *database.Workspace. Or database.Workspace, ok?
Knowing to check workspace.ID is not ideal.
coderd/database/dbauthz/dbauthz.go
Outdated
| // NOTE: The cached RBAC object is refreshed every 5 minutes. After a prebuild | ||
| // claim, there may be up to a 5-minute window where this uses stale owner_id. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The owner is no longer stale because prebuilds are not cached.
coderd/rbac/object.go
Outdated
| func (z Object) IsEmpty() bool { | ||
| return z.Owner == "" || z.Owner == uuid.Nil.String() || z.OrgID == "" || z.OrgID == uuid.Nil.String() | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not correct for all objects. Some objects have no owner, some have no org. Some have neither (like deployment config)
coderd/database/dbauthz/dbauthz.go
Outdated
| if rbacObj, ok := WorkspaceRBACFromContext(ctx); ok { | ||
| // Verify the RBAC object is valid (has required fields). | ||
| if !rbacObj.IsEmpty() && rbacObj.Type == rbac.ResourceWorkspace.Type { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WorkspaceRBACFromContext should never return an invalid RBACObject imo. The WithWorkspaceRBAC protects against these errors like the wrong type.
So we can trust it and not guard against things if ok == true
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
valid/present checks, plus ensure proper fallback (less noisy logging) when a cached workspace is not found/valid in dbauthz Signed-off-by: Callum Styan <callumstyan@gmail.com>
55ee6d3 to
a2ee19d
Compare
object types there Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
coderd/agentapi/api.go
Outdated
| } | ||
|
|
||
| if ws.IsPrebuild() { | ||
| a.opts.Log.Debug(ctx, "workspace is a prebuild, not caching in AgentAPI") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably don't need this log, your call as the logs are infrequent
| a.opts.Log.Debug(ctx, "refreshed cached workspace fields", | ||
| slog.F("workspace_id", ws.ID), | ||
| slog.F("owner_id", ws.OwnerID), | ||
| slog.F("name", ws.Name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we only log if the update yields a different workspace then what we had?
Not a huge deal as the logs are infrequent. Just trying to keep the logs down.
| ctx, | ||
| a.now(), | ||
| workspace, | ||
| ws, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is unfortunate we are passing a possibly incomplete workspace to this function. The function does not know some workspace fields might not be present.
Can we add a concrete type, maybe to database/modelmethods.go of this subset type of a workspace? Then pass around this subset instead of database.Workspace. It can implement RBACObject like:
func (w WorkspaceIdentity) RBACObject() rbac.Object {
return Workspace{
// .. Copy fields
}.RBACObject()
}So the rbac stuff is kept in check, and the ReportAgentStats does not have the possibility of using unset fields.
Clear() becomes a bit easier,
cws.WorkspaceIdentity = database.WorkspaceIdentity{}workspace info. can't misuse database.Workspace and access fields which in theory may not be set Signed-off-by: Callum Styan <callumstyan@gmail.com>
Emyrk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The WorkspaceIdentity does add the type safety. 👍
Signed-off-by: Callum Styan <callumstyan@gmail.com>
test that fields are updated via the 5m update Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
This query is still relatively expensive because of its call volume, at least 1 million times per day over the last week. After #20430 the two main remaining call paths are via the Workspace Agents Stats and Metadata APIs.
Disclaimer: Blink helped write tests, identify usage of
database.Workspacefields in downstream functions, identify existing pattern for and write the ticker function, and write comments.This PR adds caching of the Workspace information on the AgentAPI struct for usage in both child APIs, Notably, for the Metadata API we need to also implement a "fast path" for the section that currently calls
GetWorkspaceByAgentIDsince it's nested within a dbauthz call. We do this by attaching the relevant RBAC object from the Workspace object to the request context, then checking inUpdateWorkspaceAgentMetadataif we have the RBAC object in the context and if it is correct for authorization purposes. If we do not have the object or it is invalid we fall back to still callingGetWorkspaceByAgentID.Note that while the workspace agent API struct is constructed at startup and (IIUC) is persistent for the lifetime of the agent/workspace, and makes a call to
GetWorkspaceByAgentIDduring this process, we also need to handle updating of some of the cached fields since the entries within the database can change as part of the workspace prebuild process. We could potentially due this via subscribing, via pubsub, to updates about prebuild workspace claims. However, for now this has simply been solved with a 5m ticker to make a call toGetWorkspaceByAgentID.This should mean we on average call
GetWorkspaceByAgentIDonce every 5 minutes per agent, as opposed to multiple times per-minute per agent for the Stats API and again for the Metadata API. Even in the worst case scenario of a workspace prebuild being claimed at t=N and the next ticker not occurring until t=N+5minutes we should have at most 5 minutes of the current behaviour for each agent in a workspace.NOTE: there is a remaining issue here; caching of the workspace for stats as it is currently implemented is actually invalid, at least for the ~5 minute period between when a prebuild is claimed and the next update of the ticker the cached Owner information will be incorrect and we'll update some incorrect metrics + skip the entire
if !workspace.IsPrebuild()block.We can either:
GetWorkspaceByAgentIDfor every stats updateGetWorkspaceByAgentIDif the workspace is a prebuild, and the first time it is not we could grab the lock and update the cached workspace informationGetWorkspaceByAgentID