-
Notifications
You must be signed in to change notification settings - Fork 1.1k
chore: abstract pg test logic and double runner sizes #21091
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hhmm I'm thinking the increased parallelism might be causing these two (newly-created) flakes: coder/internal#1174 There might be some contention / starvation occurring. Looking into it... |
It's common to create a context early in a test body, then do setup work unrelated to that context. By the time the context is actually used, it may have already timed out. This was detected as test failures in #21091. The new Context() function returns a context that resets its timeout when accessed from new lines in the test file. The timeout does not begin until the context is first used (lazy initialization). This is useful for integration tests that pass contexts through many subsystems, where each subsystem should get a fresh timeout window. Key behaviors: - Timer starts on first Done(), Deadline(), or Err() call - Value() does not trigger initialization (used for tracing/logging) - Each unique line in a _test.go file gets a fresh timeout window - Same-line access (e.g., in loops) does not reset - Expired contexts cannot be resurrected Limitations: - Wrapping with a child context (e.g., context.WithCancel) prevents resets since the child's methods don't call through to the parent - Storing the Done() channel prevents resets on subsequent accesses The original fixed-timeout behavior is available via ContextFixed().
It's common to create a context early in a test body, then do setup work unrelated to that context. By the time the context is actually used, it may have already timed out. This was detected as test failures in #21091. The new Context() function returns a context that resets its timeout when accessed from new lines in the test file. The timeout does not begin until the context is first used (lazy initialization). This is useful for integration tests that pass contexts through many subsystems, where each subsystem should get a fresh timeout window. Key behaviors: - Timer starts on first Done(), Deadline(), or Err() call - Value() does not trigger initialization (used for tracing/logging) - Each unique line in a _test.go file gets a fresh timeout window - Same-line access (e.g., in loops) does not reset - Expired contexts cannot be resurrected Limitations: - Wrapping with a child context (e.g., context.WithCancel) prevents resets since the child's methods don't call through to the parent - Storing the Done() channel prevents resets on subsequent accesses The original fixed-timeout behavior is available via ContextFixed().
It's common to create a context early in a test body, then do setup work unrelated to that context. By the time the context is actually used, it may have already timed out. This was detected as test failures in #21091. The new Context() function returns a context that resets its timeout when accessed from new lines in the test file. The timeout does not begin until the context is first used (lazy initialization). This is useful for integration tests that pass contexts through many subsystems, where each subsystem should get a fresh timeout window. Key behaviors: - Timer starts on first Done(), Deadline(), or Err() call - Value() does not trigger initialization (used for tracing/logging) - Each unique line in a _test.go file gets a fresh timeout window - Same-line access (e.g., in loops) does not reset - Expired contexts cannot be resurrected Limitations: - Wrapping with a child context (e.g., context.WithCancel) prevents resets since the child's methods don't call through to the parent - Storing the Done() channel prevents resets on subsequent accesses The original fixed-timeout behavior is available via ContextFixed().
|
Haven't identified the source of the contention yet, but at least #21121 will prevent these tests from flaking. |
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
4fd7e1b to
211e727
Compare
It's common to create a context early in a test body, then do setup work unrelated to that context. By the time the context is actually used, it may have already timed out. This was detected as test failures in #21091. The new Context() function returns a context that resets its timeout when accessed from new lines in the test file. The timeout does not begin until the context is first used (lazy initialization). This is useful for integration tests that pass contexts through many subsystems, where each subsystem should get a fresh timeout window. Key behaviors: - Timer starts on first Done(), Deadline(), or Err() call - Value() does not trigger initialization (used for tracing/logging) - Each unique line in a _test.go file gets a fresh timeout window - Same-line access (e.g., in loops) does not reset - Expired contexts cannot be resurrected Limitations: - Wrapping with a child context (e.g., context.WithCancel) prevents resets since the child's methods don't call through to the parent - Storing the Done() channel prevents resets on subsequent accesses The original fixed-timeout behavior is available via ContextFixed().
It's common to create a context early in a test body, then do setup work unrelated to that context. By the time the context is actually used, it may have already timed out. This was detected as test failures in #21091. The new Context() function returns a context that resets its timeout when accessed from new lines in the test file. The timeout does not begin until the context is first used (lazy initialization). This is useful for integration tests that pass contexts through many subsystems, where each subsystem should get a fresh timeout window. Key behaviors: - Timer starts on first Done(), Deadline(), or Err() call - Value() does not trigger initialization (used for tracing/logging) - Each unique line in a _test.go file gets a fresh timeout window - Same-line access (e.g., in loops) does not reset - Expired contexts cannot be resurrected Limitations: - Wrapping with a child context (e.g., context.WithCancel) prevents resets since the child's methods don't call through to the parent - Storing the Done() channel prevents resets on subsequent accesses The original fixed-timeout behavior is available via ContextFixed().
It's common to create a context early in a test body, then do setup work unrelated to that context. By the time the context is actually used, it may have already timed out. This was detected as test failures in #21091. The new Context() function returns a context that resets its timeout when accessed from new lines in the test file. The timeout does not begin until the context is first used (lazy initialization). This is useful for integration tests that pass contexts through many subsystems, where each subsystem should get a fresh timeout window. Key behaviors: - Timer starts on first Done(), Deadline(), or Err() call - Value() does not trigger initialization (used for tracing/logging) - Each unique line in a _test.go file gets a fresh timeout window - Same-line access (e.g., in loops) does not reset - Expired contexts cannot be resurrected Limitations: - Wrapping with a child context (e.g., context.WithCancel) prevents resets since the child's methods don't call through to the parent - Storing the Done() channel prevents resets on subsequent accesses The original fixed-timeout behavior is available via ContextFixed().
It's common to create a context early in a test body, then do setup work unrelated to that context. By the time the context is actually used, it may have already timed out. This was detected as test failures in #21091. The new Context() function returns a context that resets its timeout when accessed from new lines in the test file. The timeout does not begin until the context is first used (lazy initialization). This is useful for integration tests that pass contexts through many subsystems, where each subsystem should get a fresh timeout window. Key behaviors: - Timer starts on first Done(), Deadline(), or Err() call - Value() does not trigger initialization (used for tracing/logging) - Each unique line in a _test.go file gets a fresh timeout window - Same-line access (e.g., in loops) does not reset - Expired contexts cannot be resurrected Limitations: - Wrapping with a child context (e.g., context.WithCancel) prevents resets since the child's methods don't call through to the parent - Storing the Done() channel prevents resets on subsequent accesses The original fixed-timeout behavior is available via ContextFixed().
| postgres-version: "13" | ||
| # Our macOS runners have 8 cores. | ||
| test-parallelism-packages: "8" | ||
| test-parallelism-tests: "16" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Linux has 16 cores and 16x8 parallelism, but macOS has 8 cores and 8x16 parallelism --- seems wrong, since in both cases you can have 128 tests running concurrently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't scale MacOS any further, and for Linux I just naïvely doubled the package parallelism since we now have double the CPUs.
It was like this before:
elif [ "${RUNNER_OS}" == "macOS" ]; then
# Our macOS runners have 8 cores. We set NUM_PARALLEL_TESTS to 16
# because the tests complete faster and Postgres doesn't choke. It seems
# that macOS's tmpfs is faster than the one on Windows.
export TEST_NUM_PARALLEL_PACKAGES=8
export TEST_NUM_PARALLEL_TESTS=16
# Only the CLI and Agent are officially supported on macOS and the rest are too flaky
export TEST_PACKAGES="./cli/... ./enterprise/cli/... ./agent/..."
elif [ "${RUNNER_OS}" == "Linux" ]; then
# Our Linux runners have 8 cores.
export TEST_NUM_PARALLEL_PACKAGES=8
export TEST_NUM_PARALLEL_TESTS=8
fi
Are you suggesting I bump test-parallelism-tests to 16 for Linux as well? i.e. 256 parallelism.
That would be quadruple what we had before (8*8), where I was attempting to keep the resources to parallelism scaling linear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If anything I'd cut the macOS parallelism. If you leave it as is, then maybe a comment explaining that the numbers were kinda determined empirically where things don't break horribly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/coder/coder/blob/main/.github/workflows/ci.yaml#L467-L472
I haven't change the parallelism (sorry, it hard to track changes because of the reorganisation); if you're aware that it's the same, why do you want to cut parallelism?
According to this, it seems kinda OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we'd have some theoretical consistency --- a model of parallelism that maps to CPU cores.
Cutting macOS parallelism would align with that consistent model and be easier to reason about. I guess upping Linux parallelism would also be consistent, but I'm gun shy about increasing things and potentially causing more flakes.
In the absence of consistency, we can just document what we've observed and 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, documented in 983515f 👍
Ideally we'd have some theoretical consistency --- a model of parallelism that maps to CPU cores.
Let's link up next week to try reason through this and develop a heuristic which will set the parallelism automatically and consistently across all these different platforms & jobs?
mafredri
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking forward to seeing speed improvements, nice work! Also interested in experimenting with PostgreSQL settings on Windows and Mac, perhaps we can eliminate ramdisk entirely as it should only be making things slower given a well configured pg.
| if: runner.os == 'macOS' | ||
| shell: bash | ||
| run: | | ||
| # Postgres runs faster on a ramdisk on macOS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we verified this recently? I'd especially be interested in adjusting PostgreSQL settings to see if we can alleviate it rather than using ramdisk. We simply need to increase RAM retention for PG on macOS and it should be more efficient than placing both storage and cache in RAM.
Guessing this applies to Windows as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not changing anything about this right now. We can follow this PR with some PG changes.
…ments Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
… tests on main Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
5762ad6 to
f4d2a44
Compare
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
This PR does two things, both in service of helping to (hopefully!) speed up CI:
I only focused on the PG-related jobs since they are generally slowest & most RAM-intensive.
*
test-go-race-pgdoubles from 16->32 CPUs & 64->128GiB RAM and likewise for the Windows runners; MacOS runners have only one sizeNOTE: don't use the speed of the PG-related jobs in this PR's CI run as indicative. Tests run outside
mainmay use cache, so the speed may seem artificially low.