Skip to content

Conversation

@cstyan
Copy link
Contributor

@cstyan cstyan commented Nov 11, 2025

NOTE: this is still WIP as it introduces in-memory caching of some data, and there is currently no cleanup of that cache. To be decided; do we want to keep the cache or not based on benchmark results (see below), and if so do we want to introduce LRU caching or TTL caching. I would lean towards basic TTL caching with cleanup via a goroutine unless there's an argument for more strict bounding of our memory usage here. We're likely already over allocating here at the moment, with this PR memory usage should be more stable for a given set of live Agents.

Again, poking around in our profiling data this seemed like a quick win. This will be more impactful in larger organizations/with larger #'s of agents per workspace.

The MetricsAggregator Run function and the asPrometheus function, which previously was attached to the annotatedMetric type, are the source of ~11% of our memory allocations and ~7.5% of total memory allocated (24h period). This path itself is only ~1.6% of our CPU time but it should also have a knock on affect on garbage collection which is about ~7.6% of CPU time.

Screenshot 2025-11-11 at 3 18 19 PM Screenshot 2025-11-11 at 3 18 29 PM Screenshot 2025-11-11 at 3 27 19 PM Screenshot 2025-11-11 at 3 27 40 PM

The optimizations are relatively straightforward:

  • preallocate the string slice for the final label set with the correct length based on the base labels and extra labels lengths
  • move asPrometheus from the annotatedMetric to MetricsAggregator so we can cache metrics description structs for metrics (should be a big win, agents likely push the same metrics repeatedly over time)
  • use a stack allocated slice in asPrometheus of a 'reasonable size' (16 in this case) to avoid heap allocation (most of the time) for smaller label sets (in terms of # of pairs)

Benchmark Results:

goos: linux
goarch: amd64
pkg: github.com/coder/coder/v2/coderd/prometheusmetrics
cpu: AMD EPYC 9254 24-Core Processor                
                               │ /workspace/bench/baseline.txt │  /workspace/bench/opt_stackbuf.txt  │  /workspace/bench/opt_nocache.txt   │
                               │            sec/op             │    sec/op     vs base               │    sec/op     vs base               │
_asPrometheus/base4_extra0-48                     2.128µ ± ∞ ¹   1.384µ ± ∞ ¹  -34.96% (p=0.008 n=5)   2.660µ ± ∞ ¹  +25.00% (p=0.016 n=5)
_asPrometheus/base4_extra2-48                     3.246µ ± ∞ ¹   1.842µ ± ∞ ¹  -43.25% (p=0.008 n=5)   3.492µ ± ∞ ¹        ~ (p=0.151 n=5)
_asPrometheus/base4_extra5-48                     5.423µ ± ∞ ¹   2.596µ ± ∞ ¹  -52.13% (p=0.008 n=5)   5.560µ ± ∞ ¹        ~ (p=0.421 n=5)
_asPrometheus/base4_extra10-48                    7.168µ ± ∞ ¹   3.747µ ± ∞ ¹  -47.73% (p=0.008 n=5)   8.239µ ± ∞ ¹  +14.94% (p=0.008 n=5)
_asPrometheus/base2_extra5-48                     3.636µ ± ∞ ¹   2.073µ ± ∞ ¹  -42.99% (p=0.008 n=5)   3.766µ ± ∞ ¹        ~ (p=0.310 n=5)
geomean                                           3.962µ         2.199µ        -44.50%                 4.375µ        +10.42%
¹ need >= 6 samples for confidence interval at level 0.95

                               │ /workspace/bench/baseline.txt │  /workspace/bench/opt_stackbuf.txt   │   /workspace/bench/opt_nocache.txt   │
                               │             B/op              │     B/op       vs base               │     B/op       vs base               │
_asPrometheus/base4_extra0-48                     1184.0 ± ∞ ¹     816.0 ± ∞ ¹  -31.08% (p=0.008 n=5)    1008.0 ± ∞ ¹  -14.86% (p=0.008 n=5)
_asPrometheus/base4_extra2-48                     1496.0 ± ∞ ¹     912.0 ± ∞ ¹  -39.04% (p=0.008 n=5)    1288.0 ± ∞ ¹  -13.90% (p=0.008 n=5)
_asPrometheus/base4_extra5-48                    2.375Ki ± ∞ ¹   1.219Ki ± ∞ ¹  -48.68% (p=0.008 n=5)   2.125Ki ± ∞ ¹  -10.53% (p=0.008 n=5)
_asPrometheus/base4_extra10-48                   3.125Ki ± ∞ ¹   1.766Ki ± ∞ ¹  -43.50% (p=0.008 n=5)   2.797Ki ± ∞ ¹  -10.50% (p=0.008 n=5)
_asPrometheus/base2_extra5-48                    1.539Ki ± ∞ ¹   1.000Ki ± ∞ ¹  -35.03% (p=0.008 n=5)   1.383Ki ± ∞ ¹  -10.15% (p=0.008 n=5)
geomean                                          1.808Ki         1.088Ki        -39.79%                 1.590Ki        -12.01%
¹ need >= 6 samples for confidence interval at level 0.95

                               │ /workspace/bench/baseline.txt │ /workspace/bench/opt_stackbuf.txt  │  /workspace/bench/opt_nocache.txt  │
                               │           allocs/op           │  allocs/op   vs base               │  allocs/op   vs base               │
_asPrometheus/base4_extra0-48                      33.00 ± ∞ ¹   20.00 ± ∞ ¹  -39.39% (p=0.008 n=5)   29.00 ± ∞ ¹  -12.12% (p=0.008 n=5)
_asPrometheus/base4_extra2-48                      41.00 ± ∞ ¹   25.00 ± ∞ ¹  -39.02% (p=0.008 n=5)   37.00 ± ∞ ¹   -9.76% (p=0.008 n=5)
_asPrometheus/base4_extra5-48                      56.00 ± ∞ ¹   34.00 ± ∞ ¹  -39.29% (p=0.008 n=5)   52.00 ± ∞ ¹   -7.14% (p=0.008 n=5)
_asPrometheus/base4_extra10-48                     76.00 ± ∞ ¹   49.00 ± ∞ ¹  -35.53% (p=0.008 n=5)   72.00 ± ∞ ¹   -5.26% (p=0.008 n=5)
_asPrometheus/base2_extra5-48                      44.00 ± ∞ ¹   28.00 ± ∞ ¹  -36.36% (p=0.008 n=5)   41.00 ± ∞ ¹   -6.82% (p=0.008 n=5)
geomean                                            47.95         29.76        -37.94%                 43.99         -8.25%
¹ need >= 6 samples for confidence interval at level 0.95

Signed-off-by: Callum Styan <callumstyan@gmail.com>
slice growth + unnecsesary repeated metric description creation.

Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
}
}

var sinkMetric prometheus.Metric
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a package global?

}

// asPrometheus returns the annotatedMetric as a prometheus.Metric, it preallocates/fills by index, uses the aggregators
//
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spurious newline?

b.WriteString(ln)
}
for _, l := range extraLabels {
b.WriteByte("|"[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use a character literal here and line 385

Suggested change
b.WriteByte("|"[0])
b.WriteByte('|')

cleanupHistogram prometheus.Histogram
aggregateByLabels []string
// per-aggregator cache of descriptors
descCache map[string]*prometheus.Desc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just now learning how metrics are collected. To your question about cache strategy, I see there's already some timer based cleanup done by MetricAggregator. Might be easy to piggy-back on that for TTL.

I wonder if we can simplify/eliminate the map key and additional eviction by somehow storing a *prometheus.Desc alongside the corresponding annotatedMetric in the store map value. Then I think we could just call prometheus.NewDesc() and replace the stored descriptor when needed (which seems like it would be a fairly narrow criteria based on how hashKey() uniquely identifies an annotatedMetric).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I obviously had some horse blinders on when writing this hey? Didn't even see that this cleanup routine for the metrics themselves existed. We definitely can piggy back on that 👍 However, we can't fully reuse the store since that is a cache of individual unique metrics (per-agent, for example) while the description caching is per-metric family (a metric name + generic labelset, but for multiple agents, we're caching the description/looking it up based on metric label names but not values).

cache cleanup

Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
@cstyan cstyan changed the title WIP perf: improve performance of metricsAggregator path by reducing memory allocations perf: improve performance of metricsAggregator path by reducing memory allocations Nov 20, 2025
}
key := cacheKeyForDesc(name, baseLabelNames, extraLabels)
if d, ok := ma.descCache[key]; ok {
d.lastUsed = time.Now()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this update the timestamp in the map value? Since d is not a pointer I think this updates the timestamp in a copied entry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, you're right 👍 added a test to ensure the timestamp value is actually written and changed the line(s) just after this to reassign the new d to the map key


// then
require.Eventually(t, func() bool {
return len(ma.descCache) == 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure there's a data race here. We may be reading len(ma.descCache) here while the loop in MetricsAggregator.Run() is writing to the map.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it really would only have been updating the once when we first write the metric, but yes the race detector was complaining here

I've refactored the test to not need to call run so we can avoid this

Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Copy link
Member

@Emyrk Emyrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smart 👍

I am not aware of the frequency of these allocations, but I do imagine the quantity per metric collect is high since the number of agent metrics is probably growing.

If I am understanding this right, we hold those metric descriptions in memory. And this is across all agents, so the cache does not grow if more agents connect to the coderd.

The cache deletes metrics not seen in the last 2 minutes. In practice though, unless agent versions change, all metrics should be appearing every 2minutes. So this cache should realistically not purge anything right?

Keep the cache cleaning, just checking my understanding

@cstyan
Copy link
Contributor Author

cstyan commented Nov 24, 2025

Smart 👍

I am not aware of the frequency of these allocations, but I do imagine the quantity per metric collect is high since the number of agent metrics is probably growing.

For each coderd instance, it receives a stats/metrics update on updateCh from each agent every 30s (by default). So once every 30s * # of agents that are sending to that coderd * 2 * # of stats/metrics in a given agents update = number of calls per minute.

If I am understanding this right, we hold those metric descriptions in memory. And this is across all agents, so the cache does not grow if more agents connect to the coderd.

Correct, the existing ma.store grows/shrinks based on the # of active agents (and the # of metrics each agent is emitting). This new descCache only shrinks/grows based on the total # of unique metric name/label name combos across all the agents the aggregator is receiving from. Label value cardinality (so more running agents) does not have an impact on that descCache size.

The cache deletes metrics not seen in the last 2 minutes. In practice though, unless agent versions change, all metrics should be appearing every 2minutes. So this cache should realistically not purge anything right?

Yeah, store should only really clean up when agents come and go, descCache in theory would only cleanup if we're no longer emitting a given stat/metric at all from all agents.

@cstyan cstyan merged commit 658e8c3 into main Nov 24, 2025
30 checks passed
@cstyan cstyan deleted the callum/asPrometheus-perf branch November 24, 2025 23:45
@github-actions github-actions bot locked and limited conversation to collaborators Nov 24, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants