Skip to content

Conversation

@dannykopping
Copy link
Contributor

@dannykopping dannykopping commented Dec 4, 2025

This PR does two things, both in service of helping to (hopefully!) speed up CI:

  1. abstracts the parallelism logic into a common action and has all PG-related jobs use it
  2. doubles runner sizes from 8->16 CPUs & 32->64GiB RAM* and concomitantly increases parallelism

I only focused on the PG-related jobs since they are generally slowest & most RAM-intensive.

image

* test-go-race-pg doubles from 16->32 CPUs & 64->128GiB RAM and likewise for the Windows runners; MacOS runners have only one size

NOTE: don't use the speed of the PG-related jobs in this PR's CI run as indicative. Tests run outside main may use cache, so the speed may seem artificially low.

@dannykopping dannykopping changed the title chore: abstract pg test logic, double runner sizes chore: abstract pg test logic and double runner sizes Dec 4, 2025
@dannykopping dannykopping marked this pull request as ready for review December 4, 2025 09:40
@dannykopping
Copy link
Contributor Author

Hhmm I'm thinking the increased parallelism might be causing these two (newly-created) flakes:

coder/internal#1174
coder/internal#1173

There might be some contention / starvation occurring. Looking into it...

mafredri added a commit that referenced this pull request Dec 5, 2025
It's common to create a context early in a test body, then do setup work
unrelated to that context. By the time the context is actually used, it
may have already timed out. This was detected as test failures in #21091.

The new Context() function returns a context that resets its timeout when
accessed from new lines in the test file. The timeout does not begin until
the context is first used (lazy initialization).

This is useful for integration tests that pass contexts through many
subsystems, where each subsystem should get a fresh timeout window.

Key behaviors:
- Timer starts on first Done(), Deadline(), or Err() call
- Value() does not trigger initialization (used for tracing/logging)
- Each unique line in a _test.go file gets a fresh timeout window
- Same-line access (e.g., in loops) does not reset
- Expired contexts cannot be resurrected

Limitations:
- Wrapping with a child context (e.g., context.WithCancel) prevents resets
  since the child's methods don't call through to the parent
- Storing the Done() channel prevents resets on subsequent accesses

The original fixed-timeout behavior is available via ContextFixed().
mafredri added a commit that referenced this pull request Dec 5, 2025
It's common to create a context early in a test body, then do setup work
unrelated to that context. By the time the context is actually used, it
may have already timed out. This was detected as test failures in #21091.

The new Context() function returns a context that resets its timeout when
accessed from new lines in the test file. The timeout does not begin until
the context is first used (lazy initialization).

This is useful for integration tests that pass contexts through many
subsystems, where each subsystem should get a fresh timeout window.

Key behaviors:
- Timer starts on first Done(), Deadline(), or Err() call
- Value() does not trigger initialization (used for tracing/logging)
- Each unique line in a _test.go file gets a fresh timeout window
- Same-line access (e.g., in loops) does not reset
- Expired contexts cannot be resurrected

Limitations:
- Wrapping with a child context (e.g., context.WithCancel) prevents resets
  since the child's methods don't call through to the parent
- Storing the Done() channel prevents resets on subsequent accesses

The original fixed-timeout behavior is available via ContextFixed().
mafredri added a commit that referenced this pull request Dec 5, 2025
It's common to create a context early in a test body, then do setup work
unrelated to that context. By the time the context is actually used, it
may have already timed out. This was detected as test failures in #21091.

The new Context() function returns a context that resets its timeout when
accessed from new lines in the test file. The timeout does not begin until
the context is first used (lazy initialization).

This is useful for integration tests that pass contexts through many
subsystems, where each subsystem should get a fresh timeout window.

Key behaviors:
- Timer starts on first Done(), Deadline(), or Err() call
- Value() does not trigger initialization (used for tracing/logging)
- Each unique line in a _test.go file gets a fresh timeout window
- Same-line access (e.g., in loops) does not reset
- Expired contexts cannot be resurrected

Limitations:
- Wrapping with a child context (e.g., context.WithCancel) prevents resets
  since the child's methods don't call through to the parent
- Storing the Done() channel prevents resets on subsequent accesses

The original fixed-timeout behavior is available via ContextFixed().
@dannykopping
Copy link
Contributor Author

Haven't identified the source of the contention yet, but at least #21121 will prevent these tests from flaking.

Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
mafredri added a commit that referenced this pull request Dec 8, 2025
It's common to create a context early in a test body, then do setup work
unrelated to that context. By the time the context is actually used, it
may have already timed out. This was detected as test failures in #21091.

The new Context() function returns a context that resets its timeout when
accessed from new lines in the test file. The timeout does not begin until
the context is first used (lazy initialization).

This is useful for integration tests that pass contexts through many
subsystems, where each subsystem should get a fresh timeout window.

Key behaviors:
- Timer starts on first Done(), Deadline(), or Err() call
- Value() does not trigger initialization (used for tracing/logging)
- Each unique line in a _test.go file gets a fresh timeout window
- Same-line access (e.g., in loops) does not reset
- Expired contexts cannot be resurrected

Limitations:
- Wrapping with a child context (e.g., context.WithCancel) prevents resets
  since the child's methods don't call through to the parent
- Storing the Done() channel prevents resets on subsequent accesses

The original fixed-timeout behavior is available via ContextFixed().
mafredri added a commit that referenced this pull request Dec 8, 2025
It's common to create a context early in a test body, then do setup work
unrelated to that context. By the time the context is actually used, it
may have already timed out. This was detected as test failures in #21091.

The new Context() function returns a context that resets its timeout when
accessed from new lines in the test file. The timeout does not begin until
the context is first used (lazy initialization).

This is useful for integration tests that pass contexts through many
subsystems, where each subsystem should get a fresh timeout window.

Key behaviors:
- Timer starts on first Done(), Deadline(), or Err() call
- Value() does not trigger initialization (used for tracing/logging)
- Each unique line in a _test.go file gets a fresh timeout window
- Same-line access (e.g., in loops) does not reset
- Expired contexts cannot be resurrected

Limitations:
- Wrapping with a child context (e.g., context.WithCancel) prevents resets
  since the child's methods don't call through to the parent
- Storing the Done() channel prevents resets on subsequent accesses

The original fixed-timeout behavior is available via ContextFixed().
mafredri added a commit that referenced this pull request Dec 8, 2025
It's common to create a context early in a test body, then do setup work
unrelated to that context. By the time the context is actually used, it
may have already timed out. This was detected as test failures in #21091.

The new Context() function returns a context that resets its timeout when
accessed from new lines in the test file. The timeout does not begin until
the context is first used (lazy initialization).

This is useful for integration tests that pass contexts through many
subsystems, where each subsystem should get a fresh timeout window.

Key behaviors:
- Timer starts on first Done(), Deadline(), or Err() call
- Value() does not trigger initialization (used for tracing/logging)
- Each unique line in a _test.go file gets a fresh timeout window
- Same-line access (e.g., in loops) does not reset
- Expired contexts cannot be resurrected

Limitations:
- Wrapping with a child context (e.g., context.WithCancel) prevents resets
  since the child's methods don't call through to the parent
- Storing the Done() channel prevents resets on subsequent accesses

The original fixed-timeout behavior is available via ContextFixed().
mafredri added a commit that referenced this pull request Dec 8, 2025
It's common to create a context early in a test body, then do setup work
unrelated to that context. By the time the context is actually used, it
may have already timed out. This was detected as test failures in #21091.

The new Context() function returns a context that resets its timeout when
accessed from new lines in the test file. The timeout does not begin until
the context is first used (lazy initialization).

This is useful for integration tests that pass contexts through many
subsystems, where each subsystem should get a fresh timeout window.

Key behaviors:
- Timer starts on first Done(), Deadline(), or Err() call
- Value() does not trigger initialization (used for tracing/logging)
- Each unique line in a _test.go file gets a fresh timeout window
- Same-line access (e.g., in loops) does not reset
- Expired contexts cannot be resurrected

Limitations:
- Wrapping with a child context (e.g., context.WithCancel) prevents resets
  since the child's methods don't call through to the parent
- Storing the Done() channel prevents resets on subsequent accesses

The original fixed-timeout behavior is available via ContextFixed().
postgres-version: "13"
# Our macOS runners have 8 cores.
test-parallelism-packages: "8"
test-parallelism-tests: "16"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linux has 16 cores and 16x8 parallelism, but macOS has 8 cores and 8x16 parallelism --- seems wrong, since in both cases you can have 128 tests running concurrently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't scale MacOS any further, and for Linux I just naïvely doubled the package parallelism since we now have double the CPUs.

It was like this before:

          elif [ "${RUNNER_OS}" == "macOS" ]; then
            # Our macOS runners have 8 cores. We set NUM_PARALLEL_TESTS to 16
            # because the tests complete faster and Postgres doesn't choke. It seems
            # that macOS's tmpfs is faster than the one on Windows.
            export TEST_NUM_PARALLEL_PACKAGES=8
            export TEST_NUM_PARALLEL_TESTS=16
            # Only the CLI and Agent are officially supported on macOS and the rest are too flaky
            export TEST_PACKAGES="./cli/... ./enterprise/cli/... ./agent/..."
          elif [ "${RUNNER_OS}" == "Linux" ]; then
            # Our Linux runners have 8 cores.
            export TEST_NUM_PARALLEL_PACKAGES=8
            export TEST_NUM_PARALLEL_TESTS=8
          fi

Are you suggesting I bump test-parallelism-tests to 16 for Linux as well? i.e. 256 parallelism.
That would be quadruple what we had before (8*8), where I was attempting to keep the resources to parallelism scaling linear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If anything I'd cut the macOS parallelism. If you leave it as is, then maybe a comment explaining that the numbers were kinda determined empirically where things don't break horribly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/coder/coder/blob/main/.github/workflows/ci.yaml#L467-L472

I haven't change the parallelism (sorry, it hard to track changes because of the reorganisation); if you're aware that it's the same, why do you want to cut parallelism?

Image

According to this, it seems kinda OK?

Copy link
Contributor

@spikecurtis spikecurtis Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we'd have some theoretical consistency --- a model of parallelism that maps to CPU cores.

Cutting macOS parallelism would align with that consistent model and be easier to reason about. I guess upping Linux parallelism would also be consistent, but I'm gun shy about increasing things and potentially causing more flakes.

In the absence of consistency, we can just document what we've observed and 🤷

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, documented in 983515f 👍

Ideally we'd have some theoretical consistency --- a model of parallelism that maps to CPU cores.

Let's link up next week to try reason through this and develop a heuristic which will set the parallelism automatically and consistently across all these different platforms & jobs?

Copy link
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking forward to seeing speed improvements, nice work! Also interested in experimenting with PostgreSQL settings on Windows and Mac, perhaps we can eliminate ramdisk entirely as it should only be making things slower given a well configured pg.

if: runner.os == 'macOS'
shell: bash
run: |
# Postgres runs faster on a ramdisk on macOS.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we verified this recently? I'd especially be interested in adjusting PostgreSQL settings to see if we can alleviate it rather than using ramdisk. We simply need to increase RAM retention for PG on macOS and it should be more efficient than placing both storage and cache in RAM.

Guessing this applies to Windows as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not changing anything about this right now. We can follow this PR with some PG changes.

…ments

Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
… tests on main

Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
@dannykopping dannykopping enabled auto-merge (squash) December 11, 2025 09:58
@dannykopping dannykopping merged commit 84b7a03 into main Dec 11, 2025
32 checks passed
@dannykopping dannykopping deleted the dk/fat-ci-bois branch December 11, 2025 10:12
@github-actions github-actions bot locked and limited conversation to collaborators Dec 11, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants