Skip to content

Conversation

@cstyan
Copy link
Contributor

@cstyan cstyan commented Nov 4, 2025

This query is still relatively expensive because of its call volume, at least 1 million times per day over the last week. After #20430 the two main remaining call paths are via the Workspace Agents Stats and Metadata APIs.

Disclaimer: Blink helped write tests, identify usage of database.Workspace fields in downstream functions, identify existing pattern for and write the ticker function, and write comments.

This PR adds caching of the Workspace information on the AgentAPI struct for usage in both child APIs, Notably, for the Metadata API we need to also implement a "fast path" for the section that currently calls GetWorkspaceByAgentID since it's nested within a dbauthz call. We do this by attaching the relevant RBAC object from the Workspace object to the request context, then checking in UpdateWorkspaceAgentMetadata if we have the RBAC object in the context and if it is correct for authorization purposes. If we do not have the object or it is invalid we fall back to still calling GetWorkspaceByAgentID.

Note that while the workspace agent API struct is constructed at startup and (IIUC) is persistent for the lifetime of the agent/workspace, and makes a call to GetWorkspaceByAgentID during this process, we also need to handle updating of some of the cached fields since the entries within the database can change as part of the workspace prebuild process. We could potentially due this via subscribing, via pubsub, to updates about prebuild workspace claims. However, for now this has simply been solved with a 5m ticker to make a call to GetWorkspaceByAgentID.

This should mean we on average call GetWorkspaceByAgentID once every 5 minutes per agent, as opposed to multiple times per-minute per agent for the Stats API and again for the Metadata API. Even in the worst case scenario of a workspace prebuild being claimed at t=N and the next ticker not occurring until t=N+5minutes we should have at most 5 minutes of the current behaviour for each agent in a workspace.

NOTE: there is a remaining issue here; caching of the workspace for stats as it is currently implemented is actually invalid, at least for the ~5 minute period between when a prebuild is claimed and the next update of the ticker the cached Owner information will be incorrect and we'll update some incorrect metrics + skip the entire if !workspace.IsPrebuild() block.

We can either:

  1. revert this optimization here and still just always call GetWorkspaceByAgentID for every stats update
  2. similar to option 1, but only always call GetWorkspaceByAgentID if the workspace is a prebuild, and the first time it is not we could grab the lock and update the cached workspace information
  3. do the right thing for both stats and metadata, and hook into pubsub to get the prebuild claim update ASAP while still avoiding extra calls to GetWorkspaceByAgentID

Signed-off-by: Callum Styan <callumstyan@gmail.com>
GetWorkspaceByAgentID for agent metadata updates

Signed-off-by: Callum Styan <callumstyan@gmail.com>
@cstyan cstyan changed the title perf: significantly reduce DB calls to GetWorkspaceByAgentID via caching workspace information in Agent API WIP perf: significantly reduce DB calls to GetWorkspaceByAgentID via caching workspace information in Agent API Nov 4, 2025
@cstyan cstyan requested a review from Emyrk November 4, 2025 00:07
@github-actions github-actions bot added the stale This issue is like stale bread. label Nov 12, 2025
Comment on lines 66 to 67
AutostartSchedule sql.NullString
DormantAt sql.NullTime
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can change in flight. Unsure if we need these to be up to date with any change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we actually need the AutostartSchedule for the statsAPI, and while it in theory can change after the workspace has started it seems like it will change infrequently enough that the 5m refresh time is okay here

Comment on lines 343 to 346
if err != nil {
a.opts.Log.Warn(ctx, "failed to refresh cached workspace fields", slog.Error(err))
return
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error should probably nuke the cache? If the workspace can not be fetched, this could be a permission issue for example. Then the cache is stale with the last authorized called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it depends on whether we want to always nuke it, or only nuke for permissions issues? There could be transient DB connection problems that could cause an error here, but those are likely to impact the follow up calls in the Stats/Metadata APIs anyways.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for security reasons, it feels safer to nuke it. If there is transient db issues, things are going to fail in other places too. Having a stale workspace is the least of our issues.

Is it easy to nuke it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

easy, nukes the cache now via call to Clear()

Comment on lines 5536 to 5543
act, ok := ActorFromContext(ctx)
if !ok {
return ErrNoActor
}
err := q.auth.Authorize(ctx, act, policy.ActionUpdate, rbacObj)
if err == nil {
return q.db.UpdateWorkspaceAgentMetadata(ctx, arg)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
act, ok := ActorFromContext(ctx)
if !ok {
return ErrNoActor
}
err := q.auth.Authorize(ctx, act, policy.ActionUpdate, rbacObj)
if err == nil {
return q.db.UpdateWorkspaceAgentMetadata(ctx, arg)
}
err := q.authorizeContext(ctx, policy.ActionUpdate, rbacObj)
if err == nil {
return q.db.UpdateWorkspaceAgentMetadata(ctx, arg)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need the actor here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The q.authorizeContext handles the actor for you. See

// authorizeContext is a helper function to authorize an action on an object.
func (q *querier) authorizeContext(ctx context.Context, action policy.Action, object rbac.Objecter) error {
act, ok := ActorFromContext(ctx)
if !ok {
return ErrNoActor
}
err := q.auth.Authorize(ctx, act, action, object.RBACObject())
if err != nil {
return logNotAuthorizedError(ctx, q.log, err)
}
return nil
}

Signed-off-by: Callum Styan <callumstyan@gmail.com>
@github-actions github-actions bot removed the stale This issue is like stale bread. label Nov 15, 2025
and invalidate the cached workspace

Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
context

Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
but also not log at error level

Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
@cstyan cstyan force-pushed the callum/workspace-agent-db-call-pt2 branch from 3ea6865 to b8d3604 Compare November 18, 2025 00:45
auto cache in the API constructor if the workspace is a prebuild

Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
struct, add clear + update functions and very basic tests

Signed-off-by: Callum Styan <callumstyan@gmail.com>
@cstyan cstyan changed the title WIP perf: significantly reduce DB calls to GetWorkspaceByAgentID via caching workspace information in Agent API perf: significantly reduce DB calls to GetWorkspaceByAgentID via caching workspace information in Agent API Nov 18, 2025
Signed-off-by: Callum Styan <callumstyan@gmail.com>
@cstyan cstyan changed the title perf: significantly reduce DB calls to GetWorkspaceByAgentID via caching workspace information in Agent API perf: reduce DB calls to GetWorkspaceByAgentID via caching workspace info Nov 18, 2025
Comment on lines 112 to 119
// Inject RBAC object into context for dbauthz fast path, avoid having to
// call GetWorkspaceByAgentID on every metadata update.
rbacCtx, err := dbauthz.WithWorkspaceRBAC(ctx, a.Workspace.AsDatabaseWorkspace().RBACObject())
if err != nil {
// Don't use error level here; that will fail the call but in reality an invalid object here is okay
// since we will then just fallback to the actual GetWorkspaceByAgentID call in dbauthz.
a.Log.Warn(ctx, "cached RBAC Workspace context wrapping was invalid", slog.Error(err))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this empty for prebuilds, so WithWorkspaceRBAC will always throw a log.Warn?

We should not log a warning when the right behavior is occurring (that being a prebuild workspace is not being cached)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this now only logs if the workspace was cached and the rbac object was still somehow invalid

return nil, xerrors.Errorf("get workspace by agent ID %q: %w", workspaceAgent.ID, err)

// If cache is empty (prebuild or invalid), fall back to DB
ws := a.Workspace.AsDatabaseWorkspace()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if AsDatabaseWorkspace returned a *database.Workspace. Or database.Workspace, ok?

Knowing to check workspace.ID is not ideal.

Comment on lines 5514 to 5515
// NOTE: The cached RBAC object is refreshed every 5 minutes. After a prebuild
// claim, there may be up to a 5-minute window where this uses stale owner_id.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The owner is no longer stale because prebuilds are not cached.

Comment on lines 241 to 243
func (z Object) IsEmpty() bool {
return z.Owner == "" || z.Owner == uuid.Nil.String() || z.OrgID == "" || z.OrgID == uuid.Nil.String()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct for all objects. Some objects have no owner, some have no org. Some have neither (like deployment config)

Comment on lines 5516 to 5518
if rbacObj, ok := WorkspaceRBACFromContext(ctx); ok {
// Verify the RBAC object is valid (has required fields).
if !rbacObj.IsEmpty() && rbacObj.Type == rbac.ResourceWorkspace.Type {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WorkspaceRBACFromContext should never return an invalid RBACObject imo. The WithWorkspaceRBAC protects against these errors like the wrong type.

So we can trust it and not guard against things if ok == true

Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
valid/present checks, plus ensure proper fallback (less noisy logging)
when a cached workspace is not found/valid in dbauthz

Signed-off-by: Callum Styan <callumstyan@gmail.com>
@cstyan cstyan force-pushed the callum/workspace-agent-db-call-pt2 branch from 55ee6d3 to a2ee19d Compare November 21, 2025 18:29
object types there

Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
}

if ws.IsPrebuild() {
a.opts.Log.Debug(ctx, "workspace is a prebuild, not caching in AgentAPI")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't need this log, your call as the logs are infrequent

Comment on lines +293 to +296
a.opts.Log.Debug(ctx, "refreshed cached workspace fields",
slog.F("workspace_id", ws.ID),
slog.F("owner_id", ws.OwnerID),
slog.F("name", ws.Name))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we only log if the update yields a different workspace then what we had?

Not a huge deal as the logs are infrequent. Just trying to keep the logs down.

ctx,
a.now(),
workspace,
ws,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unfortunate we are passing a possibly incomplete workspace to this function. The function does not know some workspace fields might not be present.

Can we add a concrete type, maybe to database/modelmethods.go of this subset type of a workspace? Then pass around this subset instead of database.Workspace. It can implement RBACObject like:

func (w WorkspaceIdentity) RBACObject() rbac.Object {
	return Workspace{
		// .. Copy fields
	}.RBACObject()
}

So the rbac stuff is kept in check, and the ReportAgentStats does not have the possibility of using unset fields.

Clear() becomes a bit easier,

cws.WorkspaceIdentity = database.WorkspaceIdentity{}

workspace info. can't misuse database.Workspace and access fields which
in theory may not be set

Signed-off-by: Callum Styan <callumstyan@gmail.com>
Copy link
Member

@Emyrk Emyrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The WorkspaceIdentity does add the type safety. 👍

Signed-off-by: Callum Styan <callumstyan@gmail.com>
test that fields are updated via the 5m update

Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
@cstyan cstyan merged commit b0e8384 into main Nov 25, 2025
30 checks passed
@cstyan cstyan deleted the callum/workspace-agent-db-call-pt2 branch November 25, 2025 22:45
@github-actions github-actions bot locked and limited conversation to collaborators Nov 25, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants