Skip to content

Conversation

@zedkipp
Copy link
Contributor

@zedkipp zedkipp commented Nov 17, 2025

Retries were previously added when starting embedded postgres to mitigate port allocation conflicts (we can't use an ephemeral port for tests). Retries alone seemingly did not fix the test flakes. A new failure mode appeared on the retries: timing out connecting to the database.

When a port discovery error occurrs, embedded-postgres does not create the database. If the data directory exists on the next attempt, embedded-postgres will assume the database has already been created. This seems to cause the timeout error. Wipe all state between retries to ensure attempts execute the same logic that creates the database.

#658

This isn't easily testable or reproducible. That said, I tested this on a Linux machine by removing the testing.Testing() call that enables the retries, hard-coding the initial postgres port and using nc to force a port conflict.

@Emyrk
Copy link
Member

Emyrk commented Nov 17, 2025

Could this accidentally delete real postgres data?

@zedkipp
Copy link
Contributor Author

zedkipp commented Nov 17, 2025

@Emyrk I don't think so because the retries are only attempted when running tests.

coder/cli/server.go

Lines 2148 to 2149 in a272843

_, err = cfg.PostgresPort().Read()
retryPortDiscovery := errors.Is(err, os.ErrNotExist) && testing.Testing()

@zedkipp zedkipp marked this pull request as ready for review November 17, 2025 20:56
@zedkipp zedkipp requested review from Emyrk and hugodutka November 17, 2025 23:46
@cstyan
Copy link
Contributor

cstyan commented Nov 20, 2025

To Steven's question/your answer, I still think it would be a good idea to leave a potential footgun comment on the assignment of the value for retryPortDiscovery to make it clear that if the assignment logic is changed to not include testing.Testing() then we should also update the conditional check on the new block you've added.

Alternatively we could preemptively guard against that case by checking for testing.Testing() on your block as well, or maybe slightly more involved but refactor so that the retry function is definable on the server struct and define the server within tests to have a retry function that does the filesystem cleaning.

I think the former (comment or add the additional check for testing.Testing()) is enough for now to see if that resolves the flakiness.

Retries were previously added when starting embedded postgres to mitigate port
allocation conflicts (we can't use an ephemeral port). Retries alone seemingly
did not fix the test flakes. A new failure mode appeared on the retries: timing
out connecting to the database.

When a port discovery error occurrs, embedded-postgres does not create the
database. If the data directory exists on the next attempt, embedded-postgres
will assume the database has already been created. This seems to cause the
timeout error. Wipe all state between retries to ensure attempts execute the
same logic that creates the database.
@zedkipp zedkipp force-pushed the zedkipp/testserver-retry-state branch from 9e8ff44 to 44948bc Compare November 20, 2025 23:54
@zedkipp
Copy link
Contributor Author

zedkipp commented Nov 20, 2025

Yeah, that's a really good point. I updated the comments to better document the intent and potential footgun to future folks that may change this code.

@zedkipp zedkipp merged commit b4cc982 into main Nov 21, 2025
30 checks passed
@zedkipp zedkipp deleted the zedkipp/testserver-retry-state branch November 21, 2025 15:55
@github-actions github-actions bot locked and limited conversation to collaborators Nov 21, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants