Skip to content

fix(coderd/prometheusmetrics): filter deleted wsbuilds to reduce db load #19197

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mafredri
Copy link
Member

@mafredri mafredri commented Aug 6, 2025

This change removes the GetLatestWorkspaceBuilds query which includes
all workspaces for all time (including deleted). This allows us to also
stop using GetProvisionerJobsByIDs for said builds as the job status
is included in GetWorkspaces called separately.

Fixes coder/internal#717

require.NoError(t, err)
for range 1 + n {
insertDeleted(t, db, u, org)
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: This fails in the old implementation, but I thought it best to formalize the assumption. Considering the old implementation did not test for this, it seems like validation for the new approach.

@mafredri mafredri changed the title fix(coderd/prometheusmetrics): no deleted wsbuilds to reduce db load fix(coderd/prometheusmetrics): filter deleted wsbuilds to reduce db load Aug 6, 2025
This change removes the `GetLatestWorkspaceBuilds` query which includes
all workspaces for all time (including deleted). This allows us to also
stop using `GetProvisionerJobsByIDs` for said builds as the job status
is included in `GetWorkspaces` called separately.

Fixes coder/internal#717
@mafredri mafredri force-pushed the mafredri/fix-prommetrics-db-load branch from 068dbd6 to 7b1fc44 Compare August 6, 2025 10:31
@spikecurtis
Copy link
Contributor

Soft deleted workspaces have been around forever, as has this metric. I wonder if @kylecarbs can reach back that far in his memory and recall whether it was a deliberate decision to incluce builds from deleted workspaces in the coderd_api_workspace_latest_build_total metric, or just an oversight.

The description of the metric talks about "current" number of workspace builds by status, which makes me think it was just an oversight. Since we track the latest build per workspace, not just all builds, the metric isn't useful for tracking an overall rate of succeeding / failing builds. My take is that the most obvious interpretation is that it's for non-deleted workspaces and this change is fine.

@mafredri mafredri marked this pull request as ready for review August 6, 2025 11:23
@mafredri
Copy link
Member Author

mafredri commented Aug 6, 2025

That is the conclusion I also came to when I read the descriptions, which is why I went with this approach.

If need be, we can include deleted builds with a specialized query that selects the least amount of data needed. Previously we selected all data only to use a single column which is why optimizations were not possible.

Copy link
Member

@johnstcn johnstcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change makes sense to me.

Copy link
Contributor

@dannykopping dannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable 👍

Please ensure the semantics of this change are documented in the release notes for the next release.
We could clarify the definition of this metric too, which may help operators:

https://coder.com/docs/admin/integrations/prometheus#available-metrics

Currently it's quite brief: "The latest workspace builds with a status."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: GetLatestWorkspaceBuilds creates lots of DB load
4 participants