Skip to content

Conversation

@kacpersaw
Copy link
Contributor

@kacpersaw kacpersaw commented Dec 8, 2025

Summary

This adds configurable overload protection to the AI Bridge daemon to prevent the server from being overwhelmed during periods of high load.

Partially addresses coder/internal#1153 (rate limits and concurrency control; circuit breakers are deferred to a follow-up).

New Configuration Options

Option Environment Variable Description Default
--aibridge-max-concurrency CODER_AIBRIDGE_MAX_CONCURRENCY Maximum number of concurrent AI Bridge requests. Set to 0 to disable (unlimited). 0
--aibridge-rate-limit CODER_AIBRIDGE_RATE_LIMIT Maximum number of AI Bridge requests per second. Set to 0 to disable rate limiting. 0

Behavior

When limits are exceeded:

  • Concurrency limit: Returns HTTP 503 Service Unavailable with message "AI Bridge is currently at capacity. Please try again later."
  • Rate limit: Returns HTTP 429 Too Many Requests with Retry-After header.

Both protections are optional and disabled by default (0 values).

Implementation

The overload protection is implemented as reusable middleware in coderd/httpmw/ratelimit.go:

  1. RateLimitByAuthToken: Per-user rate limiting that uses APITokenFromRequest to extract the authentication token, with fallback to X-Api-Key header for AI provider compatibility (e.g., Anthropic). Falls back to IP-based rate limiting if no token is present. Includes Retry-After header for backpressure signaling.
  2. ConcurrencyLimit: Uses an atomic counter to track in-flight requests and reject when at capacity.

The middleware is applied in enterprise/coderd/aibridge.go via r.Group in the following order:

  1. Concurrency check (faster rejection for load shedding)
  2. Rate limit check

Note: Rate limiting currently applies to all AI Bridge requests, including pass-through requests. Ideally only actual interceptions should count, but this would require changes in the aibridge library.

Testing

Added comprehensive tests for:

  • Rate limiting by auth token (Bearer token, X-Api-Key, no token fallback to IP)
  • Different tokens not rate limited against each other
  • Disabled when limit is zero
  • Retry-After header is set on 429 responses
  • Concurrency limiting (allows within limit, rejects over limit, disabled when zero)

@kacpersaw kacpersaw changed the title feat(aibridged): add overload protection with rate limiting and concurrency control feat(aibridge): add overload protection with rate limiting and concurrency control Dec 8, 2025
@kacpersaw kacpersaw changed the title feat(aibridge): add overload protection with rate limiting and concurrency control feat(coderd): add overload protection with rate limiting and concurrency control Dec 8, 2025
Copy link
Contributor

@dannykopping dannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of IP-based rate-limiting. It's brittle and lacks the specificity we need.

Limits need to apply per user. We could do that by using the API key in the request.

Also, a nit, your description says:

Fixes coder/internal#1153

But it only addresses the first two points AFAICS.

@kacpersaw kacpersaw marked this pull request as draft December 9, 2025 08:26
@pawbana
Copy link
Contributor

pawbana commented Dec 9, 2025

Please update PR description before merging (eg. remove --aibridge-rate-window).

Copy link
Contributor

@dannykopping dannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Tested it against my deployment and it works as expected.

Can you please add an integration test (one per mechanism) just so we can be assured that this works with the whole API stack?

@dannykopping
Copy link
Contributor

@kacpersaw I also think it's worth documenting how/why this differs from the normal API rate-limiting.

Copy link
Contributor Author

@dannykopping Regarding your question about > vs >= on line 132:

Using > is correct for "max concurrent" semantics. Here's why:

c := current.Add(1)  // Returns NEW value after incrementing
defer current.Add(-1)

if c > maxConcurrent {  // Reject if over limit

With maxConcurrent = 2:

  • Request 1: c = 1, check 1 > 2 → false → allowed (1 concurrent)
  • Request 2: c = 2, check 2 > 2 → false → allowed (2 concurrent)
  • Request 3: c = 3, check 3 > 2 → true → rejected

If we used >=:

  • Request 1: c = 1, check 1 >= 2 → false → allowed
  • Request 2: c = 2, check 2 >= 2 → true → rejected ✗ (we want 2 concurrent allowed!)

The test TestConcurrencyLimit/LimitsConcurrentRequests validates this behavior - it sets maxConcurrency = 2, starts 2 requests that block in handlers, then verifies the 3rd request gets 503.

@dannykopping dannykopping self-requested a review December 9, 2025 12:29
@kacpersaw kacpersaw marked this pull request as ready for review December 9, 2025 13:20
Copy link
Contributor Author

Regarding the ExtractAuthToken duplication comment:

We can't import enterprise/aibridged from coderd/httpmw due to the AGPL licensing restriction (AGPL code cannot import enterprise packages, as enforced by scripts/check_enterprise_imports.sh).

The current approach uses APITokenFromRequest from coderd/httpmw/apikey.go (which handles Bearer tokens, cookies, and standard Coder auth methods) with a small fallback for X-Api-Key header for AI provider compatibility (Anthropic uses this header).

The X-Api-Key fallback is only 3 lines, so the duplication is minimal and unavoidable given the package boundaries.

@dannykopping
Copy link
Contributor

Regarding the ExtractAuthToken duplication comment:

We can't import enterprise/aibridged from coderd/httpmw due to the AGPL licensing restriction (AGPL code cannot import enterprise packages, as enforced by scripts/check_enterprise_imports.sh).

The current approach uses APITokenFromRequest from coderd/httpmw/apikey.go (which handles Bearer tokens, cookies, and standard Coder auth methods) with a small fallback for X-Api-Key header for AI provider compatibility (Anthropic uses this header).

The X-Api-Key fallback is only 3 lines, so the duplication is minimal and unavoidable given the package boundaries.

We should extract that logic out then. The duplication is not the problem, it's the lack of a single source of truth.

Copy link
Contributor

@dannykopping dannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once ExtractAPIKeyFromHeader is named/moved appropriately,

…rrency control

This adds configurable overload protection to the AI Bridge daemon to prevent
the server from being overwhelmed during periods of high load.

New configuration options:
- CODER_AIBRIDGE_MAX_CONCURRENCY: Maximum number of concurrent AI Bridge requests (0 to disable)
- CODER_AIBRIDGE_RATE_LIMIT: Maximum number of requests per rate window (0 to disable)
- CODER_AIBRIDGE_RATE_WINDOW: Duration of the rate limiting window (default: 1m)

When limits are exceeded:
- Concurrency limit: Returns HTTP 503 Service Unavailable
- Rate limit: Returns HTTP 429 Too Many Requests

The overload protection is implemented as middleware in coderd/httpmw/ratelimit.go
and applied to the aibridged HTTP handler:
- RateLimitByIP: Rate limiting keyed by IP address for endpoints without API key context
- ConcurrencyLimit: Limits concurrent requests using an atomic counter

Both protections are optional and disabled by default (0 values).

Fixes coder/internal#1153
Address review feedback from @dannykopping:

1. Changed RateLimitByIP to RateLimitByAuthToken which extracts the
   authentication token from the Authorization header (Bearer token) or
   X-Api-Key header. Falls back to IP-based rate limiting if no token
   is present.

2. Added Retry-After header to 429 responses for better backpressure
   signaling.

3. Added comprehensive tests for the new auth token-based rate limiting
   including tests for Bearer token, X-Api-Key, different tokens, and
   fallback to IP.
Changes:
1. Remove --aibridge-rate-window config option, hardcode to 1 minute for simpler
   configuration (rate limit is now 'requests per minute').

2. Use APITokenFromRequest instead of custom extractAuthToken function, with
   fallback to X-Api-Key header for AI provider compatibility (Anthropic).

3. Simplify rate limit error message to 'You've been rate limited. Please try
   again later.' (Retry-After header provides timing info).

4. Add TODO comment noting that rate limiting currently applies to all AI Bridge
   requests including pass-through. Ideally only interceptions should count,
   but this requires changes in the aibridge library.

5. Consolidate rate limit tests with table-driven approach and use time.Hour
   instead of time.Second for more reliable test behavior.
1. Move rate limiters to r.Group middleware instead of manually wrapping
   in HandleFunc (aibridge.go).

2. Improve concurrency test synchronization by using sync.WaitGroup as a
   barrier to ensure all requests are in the handler before testing rejection.
   Also use close(releaseHandler) instead of sending to each goroutine.
Per @pawbana's feedback, per-second is easier to reason about than per-minute.
1. Add 'per replica' to description for both max concurrency and rate limit
   to clarify these limits are replica-local.

2. Use consistent '(unlimited)' wording for disabled state in rate limit
   description.

3. Remove TODO comment about pass-through requests - they should be rate
   limited since they still consume resources.

4. Add timeout to concurrency test using testutil.WaitShort.
Add detailed documentation explaining the key differences:
- Extracts token from headers vs context (for internal auth handling)
- No bypass header support for Owners
- No per-endpoint keying (limit applies across all endpoints)
- Includes Retry-After header for backpressure
Creates a new non-enterprise coderd/aibridge package for AI Bridge
utilities. This properly encapsulates the auth token extraction function
that's specific to AI provider authentication patterns (Bearer token and
X-Api-Key headers).

- Add coderd/aibridge/aibridge.go with ExtractAuthToken function
- Update coderd/httpmw/ratelimit.go to use aibridge.ExtractAuthToken
- Update enterprise/aibridged to use aibridge.ExtractAuthToken
- Remove ExtractAPIKeyFromHeader from coderd/httpmw/apikey.go
Adds two integration tests that verify the overload protection middleware
works correctly with the full API stack:

- TestAIBridgeRateLimiting: Verifies requests are rate limited (429) after
  exceeding the configured limit, with Retry-After header set.
- TestAIBridgeConcurrencyLimiting: Verifies concurrent requests beyond the
  limit are rejected (503) while allowing requests up to the limit.
@kacpersaw kacpersaw force-pushed the kacpersaw/aibridge-overload-protection branch from 7aed14b to 1317128 Compare December 11, 2025 10:51
@pawbana pawbana self-requested a review December 11, 2025 11:06
@kacpersaw kacpersaw merged commit 6f86f67 into main Dec 11, 2025
54 of 56 checks passed
@kacpersaw kacpersaw deleted the kacpersaw/aibridge-overload-protection branch December 11, 2025 15:38
@github-actions github-actions bot locked and limited conversation to collaborators Dec 11, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants