-
Notifications
You must be signed in to change notification settings - Fork 0
Description
CI Run Link: https://github.com/coder/coder/actions/runs/20029668075
Failing Job: ci / test-go-pg (macos-latest)
Completed at: 2025-12-08T13:29:57Z (within minutes of Slack alert)
Run attempt: 1
Commit Info:
- SHA: 52243557a26320ec21be0cf4cbc6be2d2135ac68
- Author: Ehab Younes
- Link: coder/coder@5224355
Root Cause Classification: Flaky test (timing-dependent lifecycle state sequence)
Failure Evidence (from macOS job logs):
--- FAIL: TestAgent_Lifecycle/ShutdownTimeout (4.36s)
agent_test.go:1597:
Error Trace: /Users/runner/work/coder/coder/agent/agent_test.go:1597
Error: Not equal:
expected: []codersdk.WorkspaceAgentLifecycle{"starting", "ready", "shutting_down", "shutdown_timeout"}
actual : []codersdk.WorkspaceAgentLifecycle{"starting", "ready", "shutting_down", "shutting_down"}
Test: TestAgent_Lifecycle/ShutdownTimeout
...
2025-12-08 13:28:44.663 [warn] agent: shutdown script(s) failed ...
error= run agent script "00000000-0000-0000-0000-000000000000":
- script timed out
Relevant test code (agent/agent_test.go around the failing subtest):
- Subtest expects sequence: starting -> ready -> shutting_down -> shutdown_timeout
- The shutdown script intentionally times out (Timeout: 1ms; Script: "sleep 3").
- Assertion: require.Equal(t, want, got[:len(want)]) after waiting for ShutdownTimeout to appear in lifecycle states.
Data Race Check: No WARNING: DATA RACE or race detected during execution of test found in logs.
Process Crash/OOM Check: No panic, OOM, or resource exhaustion indicators found.
Matrix Cancellation Artifact: Not applicable. Only macOS job failed; other matrix jobs were cancelled (run_attempt = 1).
Comprehensive Duplicate Search (coder/internal):
- Query: "TestAgent_Lifecycle" -> Found flake: TestAgent_Lifecycle/ShutdownScriptOnce #576 (ShutdownScriptOnce) [closed]
- Query: "ShutdownTimeout" -> No results
- Query: "shutdown_timeout shutting_down" -> No results
- Query: "InmemoryListener is already closed" -> Related to shutdown errors in other tests, but not this specific subtest
Conclusion: No existing issue for this specific flake.
Precise Assignment Analysis:
- Primary (git blame on specific function lines): Not available via automation here.
- Secondary (recent test file contributors): agent/agent_test.go has multiple recent, meaningful changes by component owners, including Spike Curtis (e.g., commit 5807fe01e4c46aeb4de5680f1e793a2b8d20914b: coder/coder@5807fe0), as well as Danielle/Mathias in adjacent agent lifecycle/devcontainer areas.
- Tertiary (component ownership): This is in the agent lifecycle area; assigning to Spike Curtis for triage.
Analysis:
- The lifecycle state remained "shutting_down" and did not progress to "shutdown_timeout" within the expected window, even though the shutdown script timed out. This suggests a timing/ordering flake in lifecycle state reporting on macOS.
Reproduction Hints:
- Re-run
ci / test-go-pg (macos-latest)or locally rungo test ./agent -run TestAgent_Lifecycle/ShutdownTimeouton macOS. Intermittent.
Next Steps:
- Investigate lifecycle transition from ShuttingDown -> ShutdownTimeout on macOS.
- Consider relaxing the equality assertion to tolerate transient duplicate states, or ensure the state machine reliably emits ShutdownTimeout before teardown.