Skip to content

Conversation

@cstyan
Copy link
Contributor

@cstyan cstyan commented Oct 23, 2025

This should resolve coder/internal#728 by refactoring the ResourceMonitorAPI struct to only require querying the resource monitor once for memory and once for volumes, then using the stored monitors on the API struct from that point on. This should eliminate the vast majority of calls to GetWorkspaceByAgentID and FetchVolumesResourceMonitorsUpdatedAfter/FetchMemoryResourceMonitorsUpdatedAfter (millions of calls per week).

Tests passed, and I ran an instance of coder via a workspace with a template that added resource monitoring every 10s. Note that this is the default docker container, so there are other sources of GetWorkspaceByAgentID db queries. Note that this workspace was running for ~15 minutes at the time I gathered this data.

Over 30s for the ResourceMonitor calls:

coder@callum-coder-2:~/coder$ curl localhost:19090/metrics | grep ResourceMonitor | grep count
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0coderd_db_query_latencies_seconds_count{query="FetchMemoryResourceMonitorsByAgentID"} 2
coderd_db_query_latencies_seconds_count{query="FetchMemoryResourceMonitorsUpdatedAfter"} 2
100  288k    0  288k    0     0  58.3M      0 --:--:-- --:--:-- --:--:-- 70.4M
coderd_db_query_latencies_seconds_count{query="FetchVolumesResourceMonitorsByAgentID"} 2
coderd_db_query_latencies_seconds_count{query="FetchVolumesResourceMonitorsUpdatedAfter"} 2
coderd_db_query_latencies_seconds_count{query="UpdateMemoryResourceMonitor"} 155
coderd_db_query_latencies_seconds_count{query="UpdateVolumeResourceMonitor"} 155
coder@callum-coder-2:~/coder$ curl localhost:19090/metrics | grep ResourceMonitor | grep count
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0coderd_db_query_latencies_seconds_count{query="FetchMemoryResourceMonitorsByAgentID"} 2
coderd_db_query_latencies_seconds_count{query="FetchMemoryResourceMonitorsUpdatedAfter"} 2
100  288k    0  288k    0     0  34.7M      0 --:--:-- --:--:-- --:--:-- 40.2M
coderd_db_query_latencies_seconds_count{query="FetchVolumesResourceMonitorsByAgentID"} 2
coderd_db_query_latencies_seconds_count{query="FetchVolumesResourceMonitorsUpdatedAfter"} 2
coderd_db_query_latencies_seconds_count{query="UpdateMemoryResourceMonitor"} 158
coderd_db_query_latencies_seconds_count{query="UpdateVolumeResourceMonitor"} 158

And over 1m for the GetWorkspaceAgentByID calls, the majority are from the workspace metadata stats updates:

coder@callum-coder-2:~/coder$ curl localhost:19090/metrics | grep GetWorkspaceByAgentID | grep count
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  284k    0  284k    0     0  42.4M      0 --:--:-- --:--:-- --:--:-- 46.3M
coderd_db_query_latencies_seconds_count{query="GetWorkspaceByAgentID"} 876
coder@callum-coder-2:~/coder$ curl localhost:19090/metrics | grep GetWorkspaceByAgentID | grep count
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  284k    0  284k    0     0  75.4M      0 --:--:-- --:--:-- --:--:-- 92.7M
coderd_db_query_latencies_seconds_count{query="GetWorkspaceByAgentID"} 918

for fetching workspaces/workspace agent monitor definitions

Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
@cstyan cstyan requested a review from spikecurtis October 23, 2025 15:53
}

a.monitorsLock.RLock()
defer a.monitorsLock.RUnlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could simplify this locking by just moving it to PushResourcesMonitoringUsage and enforcing that calls to this function are not concurrent. It's meant to be sequential, and that's how the agent uses it.

return xerrors.Errorf("fetch memory resource monitor: %w", err)
}
if err == nil {
a.memoryMonitor = &memMon
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this field a pointer if the fetch doesn't return a pointer? Seems ok as a value which can save allocation and GC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was using the nil check as a way to check for existence/proper instantiation, but we can alternatively use CreatedAt.IsZero to check.

// Load memory monitor once
var memoryErr error
a.memOnce.Do(func() {
memoryErr = a.fetchMemoryMonitor(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't really work, because you're passing a closure to Do() that includes the memErr, which is a local variable in this function. So, only the first call to this once can possibly capture the error. Every subsequent call will get memErr unchanged, even if there was an error.

What needs to happen here depends on the error handling strategy.

One option would be to fetch the initial value of these monitors at the time the *ResourceMonitoringAPI itself is instantiated (that is, when the agent connects to the RPC service). If we fail to fetch the monitors, we error out the connection and assume the agent will reconnect. That's simple, and probably OK, given that we expect DB errors to be rare.

It has the nominal drawback that it tears the whole connection down, when the agent could just retry the RPCs for resource monitoring, but at present the agent doesn't do that. If anything fails on the agentapi, it tends to just tear everything down and start again. So, being more sophisticated like hitting the db when we get RPC calls, and allowing retries for errors would be an improvement that our only client doesn't take advantage of.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's just a miss on my part during some refactoring of the sync Do functions. I was trying to avoid the refactor that would be required to return an error and have the client retry properly while also not exiting completely for a transient failure. With your added context though it sounds like we're fine with the teardown/early exit as it's what we're already doing everywhere else for similar paths in the agent code.

We can explore the refactor later if we decide it would be useful 👍

Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
@cstyan cstyan requested a review from spikecurtis October 28, 2025 06:12
Copy link
Contributor

@spikecurtis spikecurtis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small suggestion inline, but I don't need to review again.

}

if !monitor.Enabled {
if !a.memoryMonitor.Enabled || a.memoryMonitor.CreatedAt.IsZero() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The zero value of a.memoryMonitor.Enabled is false, so I think the first check is sufficient here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 yep, it should be sufficient to have just one (either check works here), will change this back to just checking Enabled.

I'd added CreatedAt.IsZero() here since we had to change if memoryErr != nil { in GetResourcesMonitoringConfiguration and there we need to use CreateAt.IsZero to indicate "no configuration".

Signed-off-by: Callum Styan <callumstyan@gmail.com>
@cstyan cstyan merged commit 45c43d4 into main Oct 28, 2025
44 of 46 checks passed
@cstyan cstyan deleted the callum/workspace-agent-call-volume branch October 28, 2025 20:38
@github-actions github-actions bot locked and limited conversation to collaborators Oct 28, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: GetWorkspaceByAgentID is called over a million times a day on dogfood

3 participants