perf: improve performance of metricsAggregator path by reducing memory allocations #20724

cstyan · 2025-11-11T23:31:30Z

NOTE: this is still WIP as it introduces in-memory caching of some data, and there is currently no cleanup of that cache. To be decided; do we want to keep the cache or not based on benchmark results (see below), and if so do we want to introduce LRU caching or TTL caching. I would lean towards basic TTL caching with cleanup via a goroutine unless there's an argument for more strict bounding of our memory usage here. We're likely already over allocating here at the moment, with this PR memory usage should be more stable for a given set of live Agents.

Again, poking around in our profiling data this seemed like a quick win. This will be more impactful in larger organizations/with larger #'s of agents per workspace.

The MetricsAggregator Run function and the asPrometheus function, which previously was attached to the annotatedMetric type, are the source of ~11% of our memory allocations and ~7.5% of total memory allocated (24h period). This path itself is only ~1.6% of our CPU time but it should also have a knock on affect on garbage collection which is about ~7.6% of CPU time.

The optimizations are relatively straightforward:

preallocate the string slice for the final label set with the correct length based on the base labels and extra labels lengths
move asPrometheus from the annotatedMetric to MetricsAggregator so we can cache metrics description structs for metrics (should be a big win, agents likely push the same metrics repeatedly over time)
use a stack allocated slice in asPrometheus of a 'reasonable size' (16 in this case) to avoid heap allocation (most of the time) for smaller label sets (in terms of # of pairs)

Benchmark Results:

goos: linux
goarch: amd64
pkg: github.com/coder/coder/v2/coderd/prometheusmetrics
cpu: AMD EPYC 9254 24-Core Processor                
                               │ /workspace/bench/baseline.txt │  /workspace/bench/opt_stackbuf.txt  │  /workspace/bench/opt_nocache.txt   │
                               │            sec/op             │    sec/op     vs base               │    sec/op     vs base               │
_asPrometheus/base4_extra0-48                     2.128µ ± ∞ ¹   1.384µ ± ∞ ¹  -34.96% (p=0.008 n=5)   2.660µ ± ∞ ¹  +25.00% (p=0.016 n=5)
_asPrometheus/base4_extra2-48                     3.246µ ± ∞ ¹   1.842µ ± ∞ ¹  -43.25% (p=0.008 n=5)   3.492µ ± ∞ ¹        ~ (p=0.151 n=5)
_asPrometheus/base4_extra5-48                     5.423µ ± ∞ ¹   2.596µ ± ∞ ¹  -52.13% (p=0.008 n=5)   5.560µ ± ∞ ¹        ~ (p=0.421 n=5)
_asPrometheus/base4_extra10-48                    7.168µ ± ∞ ¹   3.747µ ± ∞ ¹  -47.73% (p=0.008 n=5)   8.239µ ± ∞ ¹  +14.94% (p=0.008 n=5)
_asPrometheus/base2_extra5-48                     3.636µ ± ∞ ¹   2.073µ ± ∞ ¹  -42.99% (p=0.008 n=5)   3.766µ ± ∞ ¹        ~ (p=0.310 n=5)
geomean                                           3.962µ         2.199µ        -44.50%                 4.375µ        +10.42%
¹ need >= 6 samples for confidence interval at level 0.95

                               │ /workspace/bench/baseline.txt │  /workspace/bench/opt_stackbuf.txt   │   /workspace/bench/opt_nocache.txt   │
                               │             B/op              │     B/op       vs base               │     B/op       vs base               │
_asPrometheus/base4_extra0-48                     1184.0 ± ∞ ¹     816.0 ± ∞ ¹  -31.08% (p=0.008 n=5)    1008.0 ± ∞ ¹  -14.86% (p=0.008 n=5)
_asPrometheus/base4_extra2-48                     1496.0 ± ∞ ¹     912.0 ± ∞ ¹  -39.04% (p=0.008 n=5)    1288.0 ± ∞ ¹  -13.90% (p=0.008 n=5)
_asPrometheus/base4_extra5-48                    2.375Ki ± ∞ ¹   1.219Ki ± ∞ ¹  -48.68% (p=0.008 n=5)   2.125Ki ± ∞ ¹  -10.53% (p=0.008 n=5)
_asPrometheus/base4_extra10-48                   3.125Ki ± ∞ ¹   1.766Ki ± ∞ ¹  -43.50% (p=0.008 n=5)   2.797Ki ± ∞ ¹  -10.50% (p=0.008 n=5)
_asPrometheus/base2_extra5-48                    1.539Ki ± ∞ ¹   1.000Ki ± ∞ ¹  -35.03% (p=0.008 n=5)   1.383Ki ± ∞ ¹  -10.15% (p=0.008 n=5)
geomean                                          1.808Ki         1.088Ki        -39.79%                 1.590Ki        -12.01%
¹ need >= 6 samples for confidence interval at level 0.95

                               │ /workspace/bench/baseline.txt │ /workspace/bench/opt_stackbuf.txt  │  /workspace/bench/opt_nocache.txt  │
                               │           allocs/op           │  allocs/op   vs base               │  allocs/op   vs base               │
_asPrometheus/base4_extra0-48                      33.00 ± ∞ ¹   20.00 ± ∞ ¹  -39.39% (p=0.008 n=5)   29.00 ± ∞ ¹  -12.12% (p=0.008 n=5)
_asPrometheus/base4_extra2-48                      41.00 ± ∞ ¹   25.00 ± ∞ ¹  -39.02% (p=0.008 n=5)   37.00 ± ∞ ¹   -9.76% (p=0.008 n=5)
_asPrometheus/base4_extra5-48                      56.00 ± ∞ ¹   34.00 ± ∞ ¹  -39.29% (p=0.008 n=5)   52.00 ± ∞ ¹   -7.14% (p=0.008 n=5)
_asPrometheus/base4_extra10-48                     76.00 ± ∞ ¹   49.00 ± ∞ ¹  -35.53% (p=0.008 n=5)   72.00 ± ∞ ¹   -5.26% (p=0.008 n=5)
_asPrometheus/base2_extra5-48                      44.00 ± ∞ ¹   28.00 ± ∞ ¹  -36.36% (p=0.008 n=5)   41.00 ± ∞ ¹   -6.82% (p=0.008 n=5)
geomean                                            47.95         29.76        -37.94%                 43.99         -8.25%
¹ need >= 6 samples for confidence interval at level 0.95

Signed-off-by: Callum Styan <callumstyan@gmail.com>

slice growth + unnecsesary repeated metric description creation. Signed-off-by: Callum Styan <callumstyan@gmail.com>

Signed-off-by: Callum Styan <callumstyan@gmail.com>

zedkipp · 2025-11-18T22:59:56Z

coderd/prometheusmetrics/prometheusmetrics_internal_test.go

 	}
 }
+
+var sinkMetric prometheus.Metric


Does this need to be a package global?

zedkipp · 2025-11-18T23:01:00Z