-
Notifications
You must be signed in to change notification settings - Fork 1.1k
perf: introduce a terraform template/dynamic parameter render cache on prebuilds path #21201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This commit introduces an in-memory cache for preview.Preview results to avoid expensive Terraform file parsing on repeated renders with identical inputs. Key components: 1. RenderCache: Thread-safe cache keyed by (templateVersionID, ownerID, parameterHash) that stores preview.Output results 2. Integration with dynamicRenderer.Render(): - Check cache before calling preview.Preview() - Store successful renders in cache - Return cached results on cache hit 3. Comprehensive tests: - Basic cache operations (get/put) - Cache key separation (different versions/owners/params) - Parameter hash consistency (order-independent) - Prebuild scenario simulation The cache enables significant resource savings for prebuilds where multiple instances share the same template version and parameters. All prebuilds use database.PrebuildsSystemUserID, ensuring perfect cache sharing across instances of the same preset. Expected impact: ~80-90% cache hit rate for steady-state prebuild reconciliation cycles, reducing CPU/memory from Terraform parsing. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit integrates the render cache into the prebuilds reconciliation flow, enabling cache sharing across all prebuild operations. Changes: 1. StoreReconciler: - Add renderCache field (*dynamicparameters.RenderCache) - Initialize cache in NewStoreReconciler() - Pass cache to builder in provision() method 2. wsbuilder.Builder: - Add renderCache field - Add RenderCache(cache) builder method - Pass cache to dynamicparameters.Prepare() when available How it works: - Single RenderCache instance shared across all prebuilds - Each workspace build passes the cache through the builder chain - Cache keyed by (templateVersionID, ownerID, parameterHash) - All prebuilds use PrebuildsSystemUserID, maximizing cache hits Expected behavior for a preset with 5 instances: - First instance: cache miss → full render → cache stored - Instances 2-5: cache hit → instant return - Result: ~80% reduction in render operations per cycle This addresses the resource cost identified in profiling where the prebuilds path consumed 25% CPU and 50% memory allocations, primarily from repeated Terraform file parsing. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Adds cache hits, misses, and size metrics to track render cache performance and enable monitoring of the cache's effectiveness in reducing expensive terraform parsing operations. Metrics added: - coderd_prebuilds_render_cache_hits_total: Counter for cache hits - coderd_prebuilds_render_cache_misses_total: Counter for cache misses - coderd_prebuilds_render_cache_size_entries: Gauge for current cache size The metrics are optional and only created when a Prometheus registerer is provided to the reconciler.
Implements automatic expiration and cleanup of cache entries: - 1 hour TTL for cached entries - Periodic cleanup every 15 minutes to remove expired entries - Expired entries are treated as cache misses on get() - Cache properly closed when reconciler stops to clean up goroutine Uses quartz.Clock instead of time package for testability, allowing tests to advance the clock without time.Sleep(). This prevents unbounded cache growth while maintaining high hit rates for frequently rendered template versions.
Updates the cache entry timestamp on every successful cache hit, implementing LRU-like behavior where frequently accessed entries stay in the cache longer. This ensures that actively used template versions remain cached even if they were initially added long ago, while rarely accessed entries expire and get cleaned up. Adds TestRenderCache_TimestampRefreshOnHit to verify that: - Cache hits refresh the entry timestamp - Refreshed entries remain valid beyond their original TTL - Entries eventually expire after no access for the full TTL period
Changes renderCache from a pointer to an interface to eliminate nil checks: - Created RenderCache interface with get/put methods - Renamed concrete type to RenderCacheImpl - Added noopRenderCache for default no-op behavior - Removed all nil checks in Render() method This simplifies the code by ensuring renderCache is always present, either as a real cache with metrics or as a no-op implementation. The nil checks were inconsistent with other non-lazy-loaded fields and added unnecessary complexity at each usage site.
|
I have read the CLA Document and I hereby sign the CLA 1 out of 2 committers have signed the CLA. |
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Previously, tests creating StoreReconciler instances would leak cleanup goroutines because they never called Stop() to close the render cache. This caused goleak test failures. Add t.Cleanup() calls to all tests creating reconcilers to ensure the render cache cleanup goroutines are properly stopped. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Add missing context import to metricscollector_test.go - Add cleanup calls for all remaining NewStoreReconciler instances in reconcile_test.go, including those in goroutines This completes the fix for goroutine leaks in render cache tests. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The renderCache field in StoreReconciler was using the concrete type *RenderCacheImpl instead of the RenderCache interface, preventing proper cleanup of cache goroutines. Changes: - Add Close() method to RenderCache interface - Implement Close() as no-op in noopRenderCache - Update StoreReconciler.renderCache to use interface type - Update wsbuilder.Builder.renderCache to use interface type - Remove unnecessary nil check in reconciler Stop() method This ensures all render cache cleanup goroutines are properly stopped when reconcilers are stopped, fixing goleak test failures. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The reconciler's Stop() method was returning early if the reconciler was never started (running == false), which meant the render cache cleanup goroutine was never stopped. This happened in tests that create reconcilers but never call Run() on them. When Stop() was called in t.Cleanup(), it would skip closing the render cache, causing goroutine leaks. Fix: Move renderCache.Close() to execute before the running check, so the cleanup goroutine is always stopped regardless of whether the reconciler was ever started. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
9b0c42b to
4f9b23a
Compare
This test confirms that the cache correctly handles the prebuild flow where ResolveParameters() calls Render() twice per build. With identical preset parameters across multiple prebuilds, we get the expected metrics: - First prebuild: 2 misses (one for empty params, one for preset params) - Subsequent prebuilds: 2 hits each (both Render calls hit the same cache entries) The user's actual deployment shows more cache misses and entries than expected, indicating that parameter values differ between prebuild instances. This suggests parameters are being modified or injected somewhere in the prebuild creation flow.
coderd/dynamicparameters/render.go
Outdated
| if cached, ok := r.data.renderCache.get(r.data.templateVersionID, ownerID, values); ok { | ||
| return cached, nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The terraform values have already been loaded into memory at this point, which I imagine is a good chunk of the allocation.
I guess it would help to see where the allocations/time is actually being spent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL it's trivy
| func newCache(clock quartz.Clock, ttl time.Duration, cacheHits, cacheMisses prometheus.Counter, cacheSize prometheus.Gauge) *RenderCacheImpl { | ||
| c := &RenderCacheImpl{ | ||
| entries: make(map[cacheKey]*cacheEntry), | ||
| clock: clock, | ||
| cacheHits: cacheHits, | ||
| cacheMisses: cacheMisses, | ||
| cacheSize: cacheSize, | ||
| ttl: ttl, | ||
| stopCh: make(chan struct{}), | ||
| doneCh: make(chan struct{}), | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pattern we use other places is we keep the metrics on the RenderCacheImpl. Then call cache.Register(prometheusregistry) from outside. Or just pass in the prometheus registry to New.
NewRenderCache(registry *prometheus.Registry)
If the registry is nil, like in tests, just don't attach the metrics.
The first render in ResolveParameters() is purely for introspection to identify ephemeral parameters. The second render validates the final values after ephemeral processing. Previously, both renders were cached, causing unnecessary cache entries. For prebuilds with constant parameters, this doubled cache size and created entries that were never reused. This optimization: - Adds RenderWithoutCache() method to Renderer interface - First render (introspection) skips cache operations - Second render (final validation) still uses cache - Reduces cache size by 50% for prebuild scenarios - Improves cache hit ratio (was 4 misses/11 hits, now 1 miss/2+ hits) For 3 prebuilds with identical parameters: - Before: 2 misses, 4 hits, 2 entries (introspection + final) - After: 1 miss, 2 hits, 1 entry (final values only)
|
closing, after further investigation we discovered that the underlying issue is actually: |
Trying out tasks in dogfood for this one.
I'm not in love with the metrics pattern here, nor the interface/
RenderCacheImplname the model chose, but it's nicer than everything being a pointer imo.In this PR we're introducing a 'render cache' for rendered terraform templates. Currently the prebuilds portion of the system, via the path that calls out to dynamic parameters resolution and a terraform parser to then finally build a workspace, is responsible for a significant portion of our CPU usage and memory allocation internally. About 25% and 50% respectively.
The render cache will take the result of the terraform parsing and dynamic parameter resolution, and cache it for a given template version ID/ownerID/hash of parameters. The owner ID should always be consistent for prebuilds, and I believe both the parameters hash and template verison ID (which is a UUID, so it's globally unique meaning we don't need the template name or ID in the key) will change if a template is updated.
We can then include this cache in the prebuilds reconciler to avoid having to do the param resolution/terraform parsing the majority of the time.
At the moment the cache cleanup is pretty basic; not hooking into template version publishing, just running a periodic check for cache entries that haven't been used in the last hour and removing them.