chore: abstract pg test logic and double runner sizes #21091

dannykopping · 2025-12-04T09:22:16Z

This PR does two things, both in service of helping to (hopefully!) speed up CI:

abstracts the parallelism logic into a common action and has all PG-related jobs use it
doubles runner sizes from 8->16 CPUs & 32->64GiB RAM* and concomitantly increases parallelism

I only focused on the PG-related jobs since they are generally slowest & most RAM-intensive.

* test-go-race-pg doubles from 16->32 CPUs & 64->128GiB RAM and likewise for the Windows runners; MacOS runners have only one size

NOTE: don't use the speed of the PG-related jobs in this PR's CI run as indicative. Tests run outside main may use cache, so the speed may seem artificially low.

dannykopping · 2025-12-04T09:57:41Z

Hhmm I'm thinking the increased parallelism might be causing these two (newly-created) flakes:

coder/internal#1174
coder/internal#1173

There might be some contention / starvation occurring. Looking into it...

It's common to create a context early in a test body, then do setup work unrelated to that context. By the time the context is actually used, it may have already timed out. This was detected as test failures in #21091. The new Context() function returns a context that resets its timeout when accessed from new lines in the test file. The timeout does not begin until the context is first used (lazy initialization). This is useful for integration tests that pass contexts through many subsystems, where each subsystem should get a fresh timeout window. Key behaviors: - Timer starts on first Done(), Deadline(), or Err() call - Value() does not trigger initialization (used for tracing/logging) - Each unique line in a _test.go file gets a fresh timeout window - Same-line access (e.g., in loops) does not reset - Expired contexts cannot be resurrected Limitations: - Wrapping with a child context (e.g., context.WithCancel) prevents resets since the child's methods don't call through to the parent - Storing the Done() channel prevents resets on subsequent accesses The original fixed-timeout behavior is available via ContextFixed().

dannykopping · 2025-12-05T13:06:44Z

Haven't identified the source of the contention yet, but at least #21121 will prevent these tests from flaking.

Signed-off-by: Danny Kopping <danny@coder.com>

It's common to create a context early in a test body, then do setup work unrelated to that context. By the time the context is actually used, it may have already timed out. This was detected as test failures in #21091. The new Context() function returns a context that resets its timeout when accessed from new lines in the test file. The timeout does not begin until the context is first used (lazy initialization). This is useful for integration tests that pass contexts through many subsystems, where each subsystem should get a fresh timeout window. Key behaviors: - Timer starts on first Done(), Deadline(), or Err() call - Value() does not trigger initialization (used for tracing/logging) - Each unique line in a _test.go file gets a fresh timeout window - Same-line access (e.g., in loops) does not reset - Expired contexts cannot be resurrected Limitations: - Wrapping with a child context (e.g., context.WithCancel) prevents resets since the child's methods don't call through to the parent - Storing the Done() channel prevents resets on subsequent accesses The original fixed-timeout behavior is available via ContextFixed().

.github/workflows/ci.yaml

spikecurtis · 2025-12-10T09:24:02Z

.github/workflows/ci.yaml

+          postgres-version: "13"
+          # Our macOS runners have 8 cores.
+          test-parallelism-packages: "8"
+          test-parallelism-tests: "16"


Linux has 16 cores and 16x8 parallelism, but macOS has 8 cores and 8x16 parallelism --- seems wrong, since in both cases you can have 128 tests running concurrently.

We can't scale MacOS any further, and for Linux I just naïvely doubled the package parallelism since we now have double the CPUs.

It was like this before:

elif [ "${RUNNER_OS}" == "macOS" ]; then # Our macOS runners have 8 cores. We set NUM_PARALLEL_TESTS to 16 # because the tests complete faster and Postgres doesn't choke. It seems # that macOS's tmpfs is faster than the one on Windows. export TEST_NUM_PARALLEL_PACKAGES=8 export TEST_NUM_PARALLEL_TESTS=16 # Only the CLI and Agent are officially supported on macOS and the rest are too flaky export TEST_PACKAGES="./cli/... ./enterprise/cli/... ./agent/..." elif [ "${RUNNER_OS}" == "Linux" ]; then # Our Linux runners have 8 cores. export TEST_NUM_PARALLEL_PACKAGES=8 export TEST_NUM_PARALLEL_TESTS=8 fi

Are you suggesting I bump test-parallelism-tests to 16 for Linux as well? i.e. 256 parallelism.
That would be quadruple what we had before (8*8), where I was attempting to keep the resources to parallelism scaling linear.

If anything I'd cut the macOS parallelism. If you leave it as is, then maybe a comment explaining that the numbers were kinda determined empirically where things don't break horribly.

https://github.com/coder/coder/blob/main/.github/workflows/ci.yaml#L467-L472

I haven't change the parallelism (sorry, it hard to track changes because of the reorganisation); if you're aware that it's the same, why do you want to cut parallelism?

According to this, it seems kinda OK?

Ideally we'd have some theoretical consistency --- a model of parallelism that maps to CPU cores.

Cutting macOS parallelism would align with that consistent model and be easier to reason about. I guess upping Linux parallelism would also be consistent, but I'm gun shy about increasing things and potentially causing more flakes.

In the absence of consistency, we can just document what we've observed and 🤷

Cool, documented in 983515f 👍

Ideally we'd have some theoretical consistency --- a model of parallelism that maps to CPU cores.

Let's link up next week to try reason through this and develop a heuristic which will set the parallelism automatically and consistently across all these different platforms & jobs?

.github/workflows/ci.yaml

.github/actions/test-go-pg/action.yaml

mafredri

Looking forward to seeing speed improvements, nice work! Also interested in experimenting with PostgreSQL settings on Windows and Mac, perhaps we can eliminate ramdisk entirely as it should only be making things slower given a well configured pg.

.github/actions/test-go-pg/action.yaml

mafredri · 2025-12-10T16:04:51Z

.github/workflows/ci.yaml

+        if: runner.os == 'macOS'
+        shell: bash
+        run: |
+          # Postgres runs faster on a ramdisk on macOS.


Have we verified this recently? I'd especially be interested in adjusting PostgreSQL settings to see if we can alleviate it rather than using ramdisk. We simply need to increase RAM retention for PG on macOS and it should be more efficient than placing both storage and cache in RAM.

Guessing this applies to Windows as well.

I'm not changing anything about this right now. We can follow this PR with some PG changes.

.github/workflows/ci.yaml

…ments Signed-off-by: Danny Kopping <danny@coder.com>

Signed-off-by: Danny Kopping <danny@coder.com>

… tests on main Signed-off-by: Danny Kopping <danny@coder.com>

Signed-off-by: Danny Kopping <danny@coder.com>

github-actions bot assigned dannykopping Dec 4, 2025

dannykopping changed the title ~~chore: abstract pg test logic, double runner sizes~~ chore: abstract pg test logic and double runner sizes Dec 4, 2025

dannykopping marked this pull request as ready for review December 4, 2025 09:40

dannykopping requested a review from jdomeracki-coder as a code owner December 4, 2025 09:40

dannykopping requested review from Emyrk and spikecurtis December 4, 2025 09:40

mafredri mentioned this pull request Dec 5, 2025

feat(testutil): add lazy timeout context with location-based reset #21120

Open

dannykopping added 3 commits December 5, 2025 15:15

chore: abstract pg test logic, increase runner sizes

e60da78

Signed-off-by: Danny Kopping <danny@coder.com>

chore: make lint

86246ba

Signed-off-by: Danny Kopping <danny@coder.com>

chore: bump windows

211e727

Signed-off-by: Danny Kopping <danny@coder.com>

dannykopping force-pushed the dk/fat-ci-bois branch from 4fd7e1b to 211e727 Compare December 5, 2025 13:16

spikecurtis reviewed Dec 10, 2025

View reviewed changes

Emyrk reviewed Dec 10, 2025

View reviewed changes

.github/actions/test-go-pg/action.yaml Show resolved Hide resolved

mafredri reviewed Dec 10, 2025

View reviewed changes

dannykopping added 6 commits December 11, 2025 06:38

Merge branch 'main' of github.com:/coder/coder into dk/fat-ci-bois

7e3702f

chore: fix nonsensical Windows comment & add more detail to other com…

4af1430

…ments Signed-off-by: Danny Kopping <danny@coder.com>

chore: max 1 test per core

7040c95

Signed-off-by: Danny Kopping <danny@coder.com>

chore: bash improvements

6cd78b7

Signed-off-by: Danny Kopping <danny@coder.com>

chore: align nightly-gauntlet.yaml with ci.yaml, only run mac/windows…

6c691c4

… tests on main Signed-off-by: Danny Kopping <danny@coder.com>

chore: only run mac/windows jobs on main

f4d2a44

Signed-off-by: Danny Kopping <danny@coder.com>

dannykopping force-pushed the dk/fat-ci-bois branch from 5762ad6 to f4d2a44 Compare December 11, 2025 05:50

dannykopping requested review from mafredri and spikecurtis December 11, 2025 06:07

chore: restore mac/windows steps on PRs

9f29ad0

Signed-off-by: Danny Kopping <danny@coder.com>

spikecurtis approved these changes Dec 11, 2025

View reviewed changes

chore: document high macos parallelism

983515f

Signed-off-by: Danny Kopping <danny@coder.com>

dannykopping enabled auto-merge (squash) December 11, 2025 09:58

dannykopping merged commit 84b7a03 into main Dec 11, 2025
32 checks passed

dannykopping deleted the dk/fat-ci-bois branch December 11, 2025 10:12

github-actions bot locked and limited conversation to collaborators Dec 11, 2025

chore: abstract pg test logic and double runner sizes #21091

chore: abstract pg test logic and double runner sizes #21091

Conversation

dannykopping commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dannykopping commented Dec 4, 2025

Uh oh!

dannykopping commented Dec 5, 2025

Uh oh!

Uh oh!

spikecurtis Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

dannykopping Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

spikecurtis Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

dannykopping Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

spikecurtis Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dannykopping Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mafredri left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mafredri Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

dannykopping Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dannykopping commented Dec 4, 2025 •

edited

Loading

spikecurtis Dec 11, 2025 •

edited

Loading