-
Notifications
You must be signed in to change notification settings - Fork 956
fix(enterprise): mark nodes from unhealthy coordinators as lost #13123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(enterprise): mark nodes from unhealthy coordinators as lost #13123
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
Instead of removing the mappings of unhealthy coordinators entirely, mark them as lost instead. This prevents peers from disappearing from other peers if a coordinator misses a heartbeat.
58801cf
to
a8ac205
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM, but I'm relying on reading code to convince myself that it solves the high level use case of sending LOST updates to peers when a coordinator fails heartbeats.
In addition to the unit test on the heartbeats subcomponent, I'd like to see a test in a similar vein to TestPGCoordinatorSingle_MissedHeartbeats
where we simulate a second coordinator entirely by DB calls, make it miss its heartbeats, and then verify that the first coordinator sends out a LOST update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Minor suggestion inline, but I don't need to review again
enterprise/tailnet/pgcoord_test.go
Outdated
defer coordinator.Close() | ||
|
||
agent := test.NewPeer(ctx, t, coordinator, "agent") | ||
defer agent.Close(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the agent on the real coordinator is superfluous to this test. Just need to test the agent update on the fake coordinator, then have it miss heartbeats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good point.
Fixes #13041
Instead of removing the mappings of unhealthy coordinators entirely,
mark them as lost. This prevents peers from disappearing from
other peers if a coordinator misses a heartbeat.