flake: TestTasks/UpdateInput/TaskStatusError #1178

Closed

Assignees

Labels

flake-investigator

bot

opened

CI Failure Details

CI Run Link: https://github.com/coder/coder/actions/runs/20016436160
Job: test-go-pg (macos-latest)
Timestamp: 2025-12-08T04:25Z (same minute as Slack alert)
Commit: 25400fedca9661de43031d6262cb47ee342da03a by Jake Howell

Failing Test

Package: github.com/coder/coder/v2/coderd
Test: TestTasks/UpdateInput/TaskStatusError
Location: coderd/aitasks_test.go:819 (macOS)

Error Evidence

=== FAIL: coderd TestTasks/UpdateInput/TaskStatusError (2.59s)
    aitasks_test.go:819:
        Error:      Received unexpected error:
                    PATCH http://127.0.0.1:54206/api/v2/workspacebuilds/9c22fa2d-a7ba-4f32-a94b-8edbb785588d/cancel?expect_status=: unexpected status code 400: Job has already completed!
        Test:       TestTasks/UpdateInput/TaskStatusError

More context (logs show the build transitioned and completed quickly, then cancel was attempted):

Build created and reached succeeded, then cancel attempt returned 400 Job has already completed!

Root Cause Classification

Flaky Test (timing-dependent)
The test sets cancelTransition=true and then issues CancelWorkspaceBuild expecting to cancel the in-flight START transition. On macOS CI, the transition can complete before cancel executes, producing 400 "Job has already completed!".
Not infra, not a race, not a process crash.

Duplicate Search (coder/internal)

Queried: "TestTasks/UpdateInput/TaskStatusError", "Job has already completed!", "aitasks_test.go", "TestTasks"
Found related but different closed issue: flake: TestTasks/Logs/UpstreamError (flake: TestTasks/Logs/UpstreamError #1067)
No existing issue for UpdateInput/TaskStatusError; this appears new/different failure mode.

Precise Assignment Analysis (test blame)

The failing subtest lives under TestTasks -> UpdateInput -> "TaskStatusError" block.
History via commit diffs:
- Added the UpdateInput test block (including cancelTransition logic) in 82f525baf36a2341bc92c2f6b6a27cc565d28a08 (feat(coderd): add task prompt modification endpoint) — author: Danielle Maywood.
- Latest modifications to this block in b255827a5269f767c2dba476c7189ef6157ff574 (chore: promote tasks to stable from experimental) — also by Danielle Maywood.
Based on last modification of the failing test lines, ownership points to Danielle Maywood.

Suggested Fix Direction

Make the test resilient to rapid state transitions:
- After creating the transition, poll the build status and only attempt cancel if status is running/pending; otherwise assert completed and proceed.
- Alternatively, accept 400 "already completed" as a valid outcome for the TaskStatusError scenario or use expect_status to tolerate either canceled or completed.
- Introduce a brief synchronization (e.g., check that build has entered running) before calling CancelWorkspaceBuild to reduce races.
Keep the test using testutil.WaitLong but avoid relying on cancellation timing.

Reproduction Hints

On macOS runner: go test ./coderd -run 'TestTasks/UpdateInput/TaskStatusError' -count=20
This is timing-sensitive; reproduces intermittently when build completes before cancel.

Related Issues

flake: TestTasks/Logs/UpstreamError #1067 (flake in the same file, different subtest family)

Quality Checklist

Identified exact failing test and captured error output
Verified not a matrix cancellation artifact (run_attempt=1; only windows job canceled due to macOS failure)
No race/panic/OOM signatures in logs
Searched coder/internal for duplicates with multiple queries
Assignment based on last modification of the failing test block

Metadata

Assignees

DanielleMaywood

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests