perf: introduce a terraform template/dynamic parameter render cache on prebuilds path #21201

cstyan · 2025-12-10T01:38:27Z

Trying out tasks in dogfood for this one.

I'm not in love with the metrics pattern here, nor the interface/RenderCacheImpl name the model chose, but it's nicer than everything being a pointer imo.

In this PR we're introducing a 'render cache' for rendered terraform templates. Currently the prebuilds portion of the system, via the path that calls out to dynamic parameters resolution and a terraform parser to then finally build a workspace, is responsible for a significant portion of our CPU usage and memory allocation internally. About 25% and 50% respectively.

The render cache will take the result of the terraform parsing and dynamic parameter resolution, and cache it for a given template version ID/ownerID/hash of parameters. The owner ID should always be consistent for prebuilds, and I believe both the parameters hash and template verison ID (which is a UUID, so it's globally unique meaning we don't need the template name or ID in the key) will change if a template is updated.

We can then include this cache in the prebuilds reconciler to avoid having to do the param resolution/terraform parsing the majority of the time.

At the moment the cache cleanup is pretty basic; not hooking into template version publishing, just running a periodic check for cache entries that haven't been used in the last hour and removing them.

This commit introduces an in-memory cache for preview.Preview results to avoid expensive Terraform file parsing on repeated renders with identical inputs. Key components: 1. RenderCache: Thread-safe cache keyed by (templateVersionID, ownerID, parameterHash) that stores preview.Output results 2. Integration with dynamicRenderer.Render(): - Check cache before calling preview.Preview() - Store successful renders in cache - Return cached results on cache hit 3. Comprehensive tests: - Basic cache operations (get/put) - Cache key separation (different versions/owners/params) - Parameter hash consistency (order-independent) - Prebuild scenario simulation The cache enables significant resource savings for prebuilds where multiple instances share the same template version and parameters. All prebuilds use database.PrebuildsSystemUserID, ensuring perfect cache sharing across instances of the same preset. Expected impact: ~80-90% cache hit rate for steady-state prebuild reconciliation cycles, reducing CPU/memory from Terraform parsing. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

This commit integrates the render cache into the prebuilds reconciliation flow, enabling cache sharing across all prebuild operations. Changes: 1. StoreReconciler: - Add renderCache field (*dynamicparameters.RenderCache) - Initialize cache in NewStoreReconciler() - Pass cache to builder in provision() method 2. wsbuilder.Builder: - Add renderCache field - Add RenderCache(cache) builder method - Pass cache to dynamicparameters.Prepare() when available How it works: - Single RenderCache instance shared across all prebuilds - Each workspace build passes the cache through the builder chain - Cache keyed by (templateVersionID, ownerID, parameterHash) - All prebuilds use PrebuildsSystemUserID, maximizing cache hits Expected behavior for a preset with 5 instances: - First instance: cache miss → full render → cache stored - Instances 2-5: cache hit → instant return - Result: ~80% reduction in render operations per cycle This addresses the resource cost identified in profiling where the prebuilds path consumed 25% CPU and 50% memory allocations, primarily from repeated Terraform file parsing. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Adds cache hits, misses, and size metrics to track render cache performance and enable monitoring of the cache's effectiveness in reducing expensive terraform parsing operations. Metrics added: - coderd_prebuilds_render_cache_hits_total: Counter for cache hits - coderd_prebuilds_render_cache_misses_total: Counter for cache misses - coderd_prebuilds_render_cache_size_entries: Gauge for current cache size The metrics are optional and only created when a Prometheus registerer is provided to the reconciler.

Implements automatic expiration and cleanup of cache entries: - 1 hour TTL for cached entries - Periodic cleanup every 15 minutes to remove expired entries - Expired entries are treated as cache misses on get() - Cache properly closed when reconciler stops to clean up goroutine Uses quartz.Clock instead of time package for testability, allowing tests to advance the clock without time.Sleep(). This prevents unbounded cache growth while maintaining high hit rates for frequently rendered template versions.

Updates the cache entry timestamp on every successful cache hit, implementing LRU-like behavior where frequently accessed entries stay in the cache longer. This ensures that actively used template versions remain cached even if they were initially added long ago, while rarely accessed entries expire and get cleaned up. Adds TestRenderCache_TimestampRefreshOnHit to verify that: - Cache hits refresh the entry timestamp - Refreshed entries remain valid beyond their original TTL - Entries eventually expire after no access for the full TTL period

Changes renderCache from a pointer to an interface to eliminate nil checks: - Created RenderCache interface with get/put methods - Renamed concrete type to RenderCacheImpl - Added noopRenderCache for default no-op behavior - Removed all nil checks in Render() method This simplifies the code by ensuring renderCache is always present, either as a real cache with metrics or as a no-op implementation. The nil checks were inconsistent with other non-lazy-loaded fields and added unnecessary complexity at each usage site.

github-actions · 2025-12-10T01:38:37Z

Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.

I have read the CLA Document and I hereby sign the CLA

1 out of 2 committers have signed the CLA.
✅ (cstyan)[https://github.com/cstyan]
❌ @callum Styan
Callum Styan seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You can retrigger this bot by commenting recheck in this Pull Request.}_{Posted by the CLA Assistant Lite bot.}

Signed-off-by: Callum Styan <callumstyan@gmail.com>

Previously, tests creating StoreReconciler instances would leak cleanup goroutines because they never called Stop() to close the render cache. This caused goleak test failures. Add t.Cleanup() calls to all tests creating reconcilers to ensure the render cache cleanup goroutines are properly stopped. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Add missing context import to metricscollector_test.go - Add cleanup calls for all remaining NewStoreReconciler instances in reconcile_test.go, including those in goroutines This completes the fix for goroutine leaks in render cache tests. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

The renderCache field in StoreReconciler was using the concrete type *RenderCacheImpl instead of the RenderCache interface, preventing proper cleanup of cache goroutines. Changes: - Add Close() method to RenderCache interface - Implement Close() as no-op in noopRenderCache - Update StoreReconciler.renderCache to use interface type - Update wsbuilder.Builder.renderCache to use interface type - Remove unnecessary nil check in reconciler Stop() method This ensures all render cache cleanup goroutines are properly stopped when reconcilers are stopped, fixing goleak test failures. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

The reconciler's Stop() method was returning early if the reconciler was never started (running == false), which meant the render cache cleanup goroutine was never stopped. This happened in tests that create reconcilers but never call Run() on them. When Stop() was called in t.Cleanup(), it would skip closing the render cache, causing goroutine leaks. Fix: Move renderCache.Close() to execute before the running check, so the cleanup goroutine is always stopped regardless of whether the reconciler was ever started. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

This test confirms that the cache correctly handles the prebuild flow where ResolveParameters() calls Render() twice per build. With identical preset parameters across multiple prebuilds, we get the expected metrics: - First prebuild: 2 misses (one for empty params, one for preset params) - Subsequent prebuilds: 2 hits each (both Render calls hit the same cache entries) The user's actual deployment shows more cache misses and entries than expected, indicating that parameter values differ between prebuild instances. This suggests parameters are being modified or injected somewhere in the prebuild creation flow.

Emyrk · 2025-12-11T20:53:33Z

coderd/dynamicparameters/render.go

+	if cached, ok := r.data.renderCache.get(r.data.templateVersionID, ownerID, values); ok {
+		return cached, nil
+	}


The terraform values have already been loaded into memory at this point, which I imagine is a good chunk of the allocation.

I guess it would help to see where the allocations/time is actually being spent.

TIL it's trivy

Emyrk · 2025-12-11T20:57:02Z

coderd/dynamicparameters/rendercache.go

+func newCache(clock quartz.Clock, ttl time.Duration, cacheHits, cacheMisses prometheus.Counter, cacheSize prometheus.Gauge) *RenderCacheImpl {
+	c := &RenderCacheImpl{
+		entries:     make(map[cacheKey]*cacheEntry),
+		clock:       clock,
+		cacheHits:   cacheHits,
+		cacheMisses: cacheMisses,
+		cacheSize:   cacheSize,
+		ttl:         ttl,
+		stopCh:      make(chan struct{}),
+		doneCh:      make(chan struct{}),
+	}
+


The pattern we use other places is we keep the metrics on the RenderCacheImpl. Then call cache.Register(prometheusregistry) from outside. Or just pass in the prometheus registry to New.

NewRenderCache(registry *prometheus.Registry)

If the registry is nil, like in tests, just don't attach the metrics.

The first render in ResolveParameters() is purely for introspection to identify ephemeral parameters. The second render validates the final values after ephemeral processing. Previously, both renders were cached, causing unnecessary cache entries. For prebuilds with constant parameters, this doubled cache size and created entries that were never reused. This optimization: - Adds RenderWithoutCache() method to Renderer interface - First render (introspection) skips cache operations - Second render (final validation) still uses cache - Reduces cache size by 50% for prebuild scenarios - Improves cache hit ratio (was 4 misses/11 hits, now 1 miss/2+ hits) For 3 prebuilds with identical parameters: - Before: 2 misses, 4 hits, 2 entries (introspection + final) - After: 1 miss, 2 hits, 1 entry (final values only)

cstyan · 2025-12-11T21:54:49Z

closing, after further investigation we discovered that the underlying issue is actually:
a) no backoff if a previous prebuild has failed which is compounded by
b) not validating that a template is valid for a given set of prebuild and preset definitions, which means that we can endlessly try (on the configured interval) to build the # of prebuilds defined by a preset even though they will never succeed

Callum Styan and others added 6 commits December 9, 2025 21:47

cstyan requested a review from Emyrk as a code owner December 10, 2025 01:38

github-actions bot assigned cstyan Dec 10, 2025

cstyan requested a review from SasSwart December 10, 2025 17:14

cstyan and others added 6 commits December 10, 2025 18:24

fix lint and fmt, required refactoring of the metrics testing

890a058

Signed-off-by: Callum Styan <callumstyan@gmail.com>

fix missing Close calls in tests

f7daab4

Signed-off-by: Callum Styan <callumstyan@gmail.com>

cstyan force-pushed the feat/prebuild-render-cache branch from 9b0c42b to 4f9b23a Compare December 11, 2025 19:37

Emyrk reviewed Dec 11, 2025

View reviewed changes

cstyan closed this Dec 11, 2025

github-actions bot locked and limited conversation to collaborators Dec 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: introduce a terraform template/dynamic parameter render cache on prebuilds path #21201

perf: introduce a terraform template/dynamic parameter render cache on prebuilds path #21201

Uh oh!

cstyan commented Dec 10, 2025

Uh oh!

github-actions bot commented Dec 10, 2025 •

edited

Loading

Uh oh!

Emyrk Dec 11, 2025

Uh oh!

Emyrk Dec 11, 2025

Uh oh!

Emyrk Dec 11, 2025

Uh oh!

cstyan commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

perf: introduce a terraform template/dynamic parameter render cache on prebuilds path #21201

perf: introduce a terraform template/dynamic parameter render cache on prebuilds path #21201

Uh oh!

Conversation

cstyan commented Dec 10, 2025

Uh oh!

github-actions bot commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Emyrk Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

Emyrk Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

Emyrk Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

cstyan commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Dec 10, 2025 •

edited

Loading