Skip to content

Conversation

@mafredri
Copy link
Member

@mafredri mafredri commented Dec 5, 2025

It's common to create a context early in a test body, then do setup work
unrelated to that context. By the time the context is actually used, it
may have already timed out. This was detected as test failures in #21091.

The new Context function returns a context that resets its timeout when
accessed from new lines in the test file. The timeout does not begin until
the context is first used (lazy initialization).

This is useful for integration tests that pass contexts through various
subsystems, where each subsystem should get a fresh timeout window.

Key behaviors:

  • Timer starts on first Done, Deadline, or Err call
  • Value does not trigger initialization
  • Each unique line in a _test.go file resets the timeout window
  • Same-line access (e.g., in loops) does not reset
  • Expired contexts cannot be resurrected

Limitations:

  • Wrapping with a child context (e.g., context.WithCancel) prevents
    reset since the childs methods shadow the parent
  • Storing the Done channel circumvents resets

The original fixed-timeout behavior is available via ContextFixed.


Arguably, it might suffice to simplify this implementation to only
implement the start timer on first use functionality. But given our
massively parallel test-suite, I'm biased towards enabling the reset as
well.

@mafredri mafredri force-pushed the mafredri/lazy-timeout-context branch 2 times, most recently from 9896675 to e6e4582 Compare December 5, 2025 13:00
@mafredri mafredri marked this pull request as ready for review December 8, 2025 14:13
@mafredri mafredri force-pushed the mafredri/lazy-timeout-context branch 3 times, most recently from 195362e to c9238e2 Compare December 8, 2025 15:59
It's common to create a context early in a test body, then do setup work
unrelated to that context. By the time the context is actually used, it
may have already timed out. This was detected as test failures in #21091.

The new Context() function returns a context that resets its timeout when
accessed from new lines in the test file. The timeout does not begin until
the context is first used (lazy initialization).

This is useful for integration tests that pass contexts through many
subsystems, where each subsystem should get a fresh timeout window.

Key behaviors:
- Timer starts on first Done(), Deadline(), or Err() call
- Value() does not trigger initialization (used for tracing/logging)
- Each unique line in a _test.go file gets a fresh timeout window
- Same-line access (e.g., in loops) does not reset
- Expired contexts cannot be resurrected

Limitations:
- Wrapping with a child context (e.g., context.WithCancel) prevents resets
  since the child's methods don't call through to the parent
- Storing the Done() channel prevents resets on subsequent accesses

The original fixed-timeout behavior is available via ContextFixed().
@mafredri mafredri force-pushed the mafredri/lazy-timeout-context branch from c9238e2 to 166e3b0 Compare December 8, 2025 16:02
@deansheather
Copy link
Member

We can't change the timeout for each individual call which is unfortunate. I know a lot of tests just use a single long context for the entire test, but some tests do have multiple contexts with different timeouts for different calls.

Also, the behavior of resetting on every new call means we can't set test-wide timeouts anymore (except with the -timeout flag on go test, which is intentionally very large to account for the entire test suite).

Maybe we need a different API to solve these things? Here's something off the top of my head:

func TestBlah(t *testing.T) {
  xctx := testutil.ContextProvider(t, testutil.WaitLong) // full test timeout, lazily creates a parent context when we hit the first call
  // ...
  err := MyFunction(xctx.New(testutil.WaitMedium), param) // makes a new "child" context from the parent
}



func testutil.ContextProvider(t *testing.T, testTimeout time.Duration) *testutil.ParentContext {
  return &testutil.ParentContext{
    testTimeout: testTimeout,
  }
}

type testutil.ParentContext struct {
  testTimeout time.Duration

  // todo: synchronization fields...
  ctx context.Context
}

func (c *testutil.ParentContext) New(timeout time.Duration) context.Context {
  c.ensureInitialized() // ensures c.ctx is set
  return context.WithTimeout(c.ctx, timeout)
}

// todo: Other methods as needed, e.g. NewWithCancel, NewWithDeadline

^ Which would just be a very light wrapper around the functions in the context package.

It's a shame that we have tests that initialize the context early followed by heavy non-context consuming startup cost. Does this mostly just stem from tests using coderdtest?

Copy link
Member

@deansheather deansheather left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment above oops

@mafredri
Copy link
Member Author

mafredri commented Dec 9, 2025

Thanks for taking a look and the discussion @deansheather, I didn't make it clear in the ticket but this is more of a design/proposal to spark discussion.

Also, the behavior of resetting on every new call means we can't set test-wide timeouts anymore (except with the -timeout flag on go test, which is intentionally very large to account for the entire test suite).

This is a valid concern. The reasoning is that in a massively parallel test suite, it's hard to reason about what length of timeout is actually valid, and rather move the thought process to "maximum time we want this test to sit idle".

Do keep in mind that call-site is tracked, so you do get an approximate timeout for the test, even though it's not exactly 15 seconds, perhaps more like 25 seconds.

Granted, this implementation is not a great way to shift the thinking from "max test time" to "max idle time".

Maybe we need a different API to solve these things? Here's something off the top of my head:

This is an interesting alternative, and one we could probably enforce/lint. (It's very hard to lint the referenced bug-case without also causing false positives, which sparked this PR.)

Taking some inspiration from you proposal, I could see an more ergonomic alternative that also functions as a drop-in.

package testutil

var _ context.Context = (*TestContext)(nil)

func Context(...) *TestContext { ... } // always lazy start (or option?)

type TestContext struct{}

func (tc *TestContext) Cancel() { ... }                  // we were missing this previously
func (tc *TestContext) New(timeout) *TestContext { ... } // make timeout optional?
func (tc *TestContext) Reset() *TestContext { ... }      // resets the timer

func (tc *TestContext) Join(ctx ...context.Context) *TestContext { ... } // sometimes you have two independent contexts, but can only pass one along
// more?

This would still allow the most basic use to be ctx := testutil.Context(t, testutil.WaitMedium) whilst using it as a normal context.

Copy link
Contributor

My opinion is that this is a lot of complexity for a relatively narrow benefit.

in a massively parallel test suite, it's hard to reason about what length of timeout is actually valid, and rather move the thought process to "maximum time we want this test to sit idle".

This trades one thing that is hard to reason about for something that is much harder. Speaking about the original lazy context, when you make a call in the test function and pass the context, it is very difficult to know whether it will trigger a reset. If the call synchronously "uses" the context, then it is reset, but if it spawns a goroutine that "uses" the context, then it is not.

Even if the resets are explicit, like in your second proposal, the question of what duration to use is still hard to reason about. Like, if your test has steps A, B, C, and D with resets in between, with the current, basic context you have to estimate the sum of the durations. With resets you have to estimate the max. Have we actually made life easier for test authors? Dubious, IMO.

@mafredri
Copy link
Member Author

@spikecurtis thanks for taking a look and for the feedback. You raise valid concerns and I honestly don’t disagree with you on any point.

It seems the only concept here that might hold some value is the delayed start, but is it worth the decrease in predictability? Probably not.

I briefly considered flipping the logic, error of context is not used within 1/2 or 2/3 of the timeout. But now we’re just breaking valid use-cases. Maybe something takes a context but doesn’t use it and takes long enough to process you hit the error. Rejected.

One more option comes to mind, swapping out the implementation via build tag allowing you to run a form of ”lint” on the test-suite, where we could enforce the error case. But again, having this code in the code base simply might not be worth catching the few weird flakes we run into.

This was a fun thought experiment, thanks for participating! Unless you guys have any better ideas, I’m going to close this one out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants