Skip to content

fix(enterprise): mark nodes from unhealthy coordinators as lost #13123

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

coadler
Copy link
Contributor

@coadler coadler commented May 1, 2024

Fixes #13041

Instead of removing the mappings of unhealthy coordinators entirely,
mark them as lost. This prevents peers from disappearing from
other peers if a coordinator misses a heartbeat.

Copy link
Contributor Author

coadler commented May 1, 2024

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @coadler and the rest of your teammates on Graphite Graphite

Instead of removing the mappings of unhealthy coordinators entirely,
mark them as lost instead. This prevents peers from disappearing from
other peers if a coordinator misses a heartbeat.
@coadler coadler force-pushed the colin/fix_enterprise_mark_nodes_from_unhealthy_coordinators_as_lost branch from 58801cf to a8ac205 Compare May 1, 2024 23:07
@coadler coadler requested a review from spikecurtis May 1, 2024 23:07
@coadler coadler marked this pull request as ready for review May 1, 2024 23:14
Copy link
Contributor

@spikecurtis spikecurtis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, but I'm relying on reading code to convince myself that it solves the high level use case of sending LOST updates to peers when a coordinator fails heartbeats.

In addition to the unit test on the heartbeats subcomponent, I'd like to see a test in a similar vein to TestPGCoordinatorSingle_MissedHeartbeats where we simulate a second coordinator entirely by DB calls, make it miss its heartbeats, and then verify that the first coordinator sends out a LOST update.

@coadler coadler requested a review from spikecurtis May 2, 2024 22:44
Copy link
Contributor

@spikecurtis spikecurtis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Minor suggestion inline, but I don't need to review again

defer coordinator.Close()

agent := test.NewPeer(ctx, t, coordinator, "agent")
defer agent.Close(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the agent on the real coordinator is superfluous to this test. Just need to test the agent update on the fake coordinator, then have it miss heartbeats.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good point.

@coadler coadler enabled auto-merge (squash) May 3, 2024 19:07
@coadler coadler merged commit 205c43d into main May 3, 2024
@coadler coadler deleted the colin/fix_enterprise_mark_nodes_from_unhealthy_coordinators_as_lost branch May 3, 2024 19:07
@github-actions github-actions bot locked and limited conversation to collaborators May 3, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix error in PGCoordinator handling of missed heartbeats
2 participants