Skip to content

flake: TestAgent_Lifecycle/ShutdownTimeout #1179

@flake-investigator

Description

@flake-investigator

CI Run Link: https://github.com/coder/coder/actions/runs/20029668075
Failing Job: ci / test-go-pg (macos-latest)
Completed at: 2025-12-08T13:29:57Z (within minutes of Slack alert)
Run attempt: 1

Commit Info:

Root Cause Classification: Flaky test (timing-dependent lifecycle state sequence)

Failure Evidence (from macOS job logs):

--- FAIL: TestAgent_Lifecycle/ShutdownTimeout (4.36s)
    agent_test.go:1597:
        Error Trace: /Users/runner/work/coder/coder/agent/agent_test.go:1597
        Error:        Not equal:
                      expected: []codersdk.WorkspaceAgentLifecycle{"starting", "ready", "shutting_down", "shutdown_timeout"}
                      actual  : []codersdk.WorkspaceAgentLifecycle{"starting", "ready", "shutting_down", "shutting_down"}
        Test:         TestAgent_Lifecycle/ShutdownTimeout
...
2025-12-08 13:28:44.663 [warn]  agent: shutdown script(s) failed ...
        error= run agent script "00000000-0000-0000-0000-000000000000":
                  - script timed out

Relevant test code (agent/agent_test.go around the failing subtest):

  • Subtest expects sequence: starting -> ready -> shutting_down -> shutdown_timeout
  • The shutdown script intentionally times out (Timeout: 1ms; Script: "sleep 3").
  • Assertion: require.Equal(t, want, got[:len(want)]) after waiting for ShutdownTimeout to appear in lifecycle states.

Data Race Check: No WARNING: DATA RACE or race detected during execution of test found in logs.
Process Crash/OOM Check: No panic, OOM, or resource exhaustion indicators found.
Matrix Cancellation Artifact: Not applicable. Only macOS job failed; other matrix jobs were cancelled (run_attempt = 1).

Comprehensive Duplicate Search (coder/internal):

  • Query: "TestAgent_Lifecycle" -> Found flake: TestAgent_Lifecycle/ShutdownScriptOnce #576 (ShutdownScriptOnce) [closed]
  • Query: "ShutdownTimeout" -> No results
  • Query: "shutdown_timeout shutting_down" -> No results
  • Query: "InmemoryListener is already closed" -> Related to shutdown errors in other tests, but not this specific subtest
    Conclusion: No existing issue for this specific flake.

Precise Assignment Analysis:

  • Primary (git blame on specific function lines): Not available via automation here.
  • Secondary (recent test file contributors): agent/agent_test.go has multiple recent, meaningful changes by component owners, including Spike Curtis (e.g., commit 5807fe01e4c46aeb4de5680f1e793a2b8d20914b: coder/coder@5807fe0), as well as Danielle/Mathias in adjacent agent lifecycle/devcontainer areas.
  • Tertiary (component ownership): This is in the agent lifecycle area; assigning to Spike Curtis for triage.

Analysis:

  • The lifecycle state remained "shutting_down" and did not progress to "shutdown_timeout" within the expected window, even though the shutdown script timed out. This suggests a timing/ordering flake in lifecycle state reporting on macOS.

Reproduction Hints:

  • Re-run ci / test-go-pg (macos-latest) or locally run go test ./agent -run TestAgent_Lifecycle/ShutdownTimeout on macOS. Intermittent.

Next Steps:

  • Investigate lifecycle transition from ShuttingDown -> ShutdownTimeout on macOS.
  • Consider relaxing the equality assertion to tolerate transient duplicate states, or ensure the state machine reliably emits ShutdownTimeout before teardown.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions