Skip to content

Conversation

@cstyan
Copy link
Contributor

@cstyan cstyan commented Dec 10, 2025

Trying out tasks in dogfood for this one.

I'm not in love with the metrics pattern here, nor the interface/RenderCacheImpl name the model chose, but it's nicer than everything being a pointer imo.

In this PR we're introducing a 'render cache' for rendered terraform templates. Currently the prebuilds portion of the system, via the path that calls out to dynamic parameters resolution and a terraform parser to then finally build a workspace, is responsible for a significant portion of our CPU usage and memory allocation internally. About 25% and 50% respectively.

The render cache will take the result of the terraform parsing and dynamic parameter resolution, and cache it for a given template version ID/ownerID/hash of parameters. The owner ID should always be consistent for prebuilds, and I believe both the parameters hash and template verison ID (which is a UUID, so it's globally unique meaning we don't need the template name or ID in the key) will change if a template is updated.

We can then include this cache in the prebuilds reconciler to avoid having to do the param resolution/terraform parsing the majority of the time.

At the moment the cache cleanup is pretty basic; not hooking into template version publishing, just running a periodic check for cache entries that haven't been used in the last hour and removing them.

Callum Styan and others added 6 commits December 9, 2025 21:47
This commit introduces an in-memory cache for preview.Preview results
to avoid expensive Terraform file parsing on repeated renders with
identical inputs.

Key components:

1. RenderCache: Thread-safe cache keyed by (templateVersionID, ownerID,
   parameterHash) that stores preview.Output results

2. Integration with dynamicRenderer.Render():
   - Check cache before calling preview.Preview()
   - Store successful renders in cache
   - Return cached results on cache hit

3. Comprehensive tests:
   - Basic cache operations (get/put)
   - Cache key separation (different versions/owners/params)
   - Parameter hash consistency (order-independent)
   - Prebuild scenario simulation

The cache enables significant resource savings for prebuilds where
multiple instances share the same template version and parameters.
All prebuilds use database.PrebuildsSystemUserID, ensuring perfect
cache sharing across instances of the same preset.

Expected impact: ~80-90% cache hit rate for steady-state prebuild
reconciliation cycles, reducing CPU/memory from Terraform parsing.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit integrates the render cache into the prebuilds
reconciliation flow, enabling cache sharing across all prebuild
operations.

Changes:

1. StoreReconciler:
   - Add renderCache field (*dynamicparameters.RenderCache)
   - Initialize cache in NewStoreReconciler()
   - Pass cache to builder in provision() method

2. wsbuilder.Builder:
   - Add renderCache field
   - Add RenderCache(cache) builder method
   - Pass cache to dynamicparameters.Prepare() when available

How it works:
- Single RenderCache instance shared across all prebuilds
- Each workspace build passes the cache through the builder chain
- Cache keyed by (templateVersionID, ownerID, parameterHash)
- All prebuilds use PrebuildsSystemUserID, maximizing cache hits

Expected behavior for a preset with 5 instances:
- First instance: cache miss → full render → cache stored
- Instances 2-5: cache hit → instant return
- Result: ~80% reduction in render operations per cycle

This addresses the resource cost identified in profiling where the
prebuilds path consumed 25% CPU and 50% memory allocations, primarily
from repeated Terraform file parsing.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Adds cache hits, misses, and size metrics to track render cache
performance and enable monitoring of the cache's effectiveness
in reducing expensive terraform parsing operations.

Metrics added:
- coderd_prebuilds_render_cache_hits_total: Counter for cache hits
- coderd_prebuilds_render_cache_misses_total: Counter for cache misses
- coderd_prebuilds_render_cache_size_entries: Gauge for current cache size

The metrics are optional and only created when a Prometheus registerer
is provided to the reconciler.
Implements automatic expiration and cleanup of cache entries:
- 1 hour TTL for cached entries
- Periodic cleanup every 15 minutes to remove expired entries
- Expired entries are treated as cache misses on get()
- Cache properly closed when reconciler stops to clean up goroutine

Uses quartz.Clock instead of time package for testability, allowing
tests to advance the clock without time.Sleep().

This prevents unbounded cache growth while maintaining high hit rates
for frequently rendered template versions.
Updates the cache entry timestamp on every successful cache hit,
implementing LRU-like behavior where frequently accessed entries
stay in the cache longer.

This ensures that actively used template versions remain cached
even if they were initially added long ago, while rarely accessed
entries expire and get cleaned up.

Adds TestRenderCache_TimestampRefreshOnHit to verify that:
- Cache hits refresh the entry timestamp
- Refreshed entries remain valid beyond their original TTL
- Entries eventually expire after no access for the full TTL period
Changes renderCache from a pointer to an interface to eliminate nil checks:

- Created RenderCache interface with get/put methods
- Renamed concrete type to RenderCacheImpl
- Added noopRenderCache for default no-op behavior
- Removed all nil checks in Render() method

This simplifies the code by ensuring renderCache is always present,
either as a real cache with metrics or as a no-op implementation.
The nil checks were inconsistent with other non-lazy-loaded fields
and added unnecessary complexity at each usage site.
@cstyan cstyan requested a review from Emyrk as a code owner December 10, 2025 01:38
@github-actions
Copy link

github-actions bot commented Dec 10, 2025


Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.


I have read the CLA Document and I hereby sign the CLA


1 out of 2 committers have signed the CLA.
✅ (cstyan)[https://github.com/cstyan]
@callum Styan
Callum Styan seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

cstyan and others added 6 commits December 10, 2025 18:24
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Previously, tests creating StoreReconciler instances would leak cleanup
goroutines because they never called Stop() to close the render cache.
This caused goleak test failures.

Add t.Cleanup() calls to all tests creating reconcilers to ensure the
render cache cleanup goroutines are properly stopped.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Add missing context import to metricscollector_test.go
- Add cleanup calls for all remaining NewStoreReconciler instances
  in reconcile_test.go, including those in goroutines

This completes the fix for goroutine leaks in render cache tests.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The renderCache field in StoreReconciler was using the concrete type
*RenderCacheImpl instead of the RenderCache interface, preventing
proper cleanup of cache goroutines.

Changes:
- Add Close() method to RenderCache interface
- Implement Close() as no-op in noopRenderCache
- Update StoreReconciler.renderCache to use interface type
- Update wsbuilder.Builder.renderCache to use interface type
- Remove unnecessary nil check in reconciler Stop() method

This ensures all render cache cleanup goroutines are properly stopped
when reconcilers are stopped, fixing goleak test failures.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The reconciler's Stop() method was returning early if the reconciler
was never started (running == false), which meant the render cache
cleanup goroutine was never stopped.

This happened in tests that create reconcilers but never call Run() on
them. When Stop() was called in t.Cleanup(), it would skip closing the
render cache, causing goroutine leaks.

Fix: Move renderCache.Close() to execute before the running check, so
the cleanup goroutine is always stopped regardless of whether the
reconciler was ever started.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@cstyan cstyan force-pushed the feat/prebuild-render-cache branch from 9b0c42b to 4f9b23a Compare December 11, 2025 19:37
This test confirms that the cache correctly handles the prebuild flow where
ResolveParameters() calls Render() twice per build. With identical preset
parameters across multiple prebuilds, we get the expected metrics:
- First prebuild: 2 misses (one for empty params, one for preset params)
- Subsequent prebuilds: 2 hits each (both Render calls hit the same cache entries)

The user's actual deployment shows more cache misses and entries than expected,
indicating that parameter values differ between prebuild instances. This suggests
parameters are being modified or injected somewhere in the prebuild creation flow.
Comment on lines 263 to 265
if cached, ok := r.data.renderCache.get(r.data.templateVersionID, ownerID, values); ok {
return cached, nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The terraform values have already been loaded into memory at this point, which I imagine is a good chunk of the allocation.

I guess it would help to see where the allocations/time is actually being spent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL it's trivy

Comment on lines +58 to +69
func newCache(clock quartz.Clock, ttl time.Duration, cacheHits, cacheMisses prometheus.Counter, cacheSize prometheus.Gauge) *RenderCacheImpl {
c := &RenderCacheImpl{
entries: make(map[cacheKey]*cacheEntry),
clock: clock,
cacheHits: cacheHits,
cacheMisses: cacheMisses,
cacheSize: cacheSize,
ttl: ttl,
stopCh: make(chan struct{}),
doneCh: make(chan struct{}),
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pattern we use other places is we keep the metrics on the RenderCacheImpl. Then call cache.Register(prometheusregistry) from outside. Or just pass in the prometheus registry to New.

NewRenderCache(registry *prometheus.Registry)

If the registry is nil, like in tests, just don't attach the metrics.

The first render in ResolveParameters() is purely for introspection to
identify ephemeral parameters. The second render validates the final
values after ephemeral processing.

Previously, both renders were cached, causing unnecessary cache entries.
For prebuilds with constant parameters, this doubled cache size and
created entries that were never reused.

This optimization:
- Adds RenderWithoutCache() method to Renderer interface
- First render (introspection) skips cache operations
- Second render (final validation) still uses cache
- Reduces cache size by 50% for prebuild scenarios
- Improves cache hit ratio (was 4 misses/11 hits, now 1 miss/2+ hits)

For 3 prebuilds with identical parameters:
- Before: 2 misses, 4 hits, 2 entries (introspection + final)
- After: 1 miss, 2 hits, 1 entry (final values only)
@cstyan
Copy link
Contributor Author

cstyan commented Dec 11, 2025

closing, after further investigation we discovered that the underlying issue is actually:
a) no backoff if a previous prebuild has failed which is compounded by
b) not validating that a template is valid for a given set of prebuild and preset definitions, which means that we can endlessly try (on the configured interval) to build the # of prebuilds defined by a preset even though they will never succeed

@cstyan cstyan closed this Dec 11, 2025
@github-actions github-actions bot locked and limited conversation to collaborators Dec 11, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants