-
Notifications
You must be signed in to change notification settings - Fork 608
fix(xds-server): clear snapshot on stream close #6618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Zachary Vacura <zvacura@digitalocean.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #6618 +/- ##
==========================================
- Coverage 71.03% 71.02% -0.01%
==========================================
Files 225 225
Lines 39153 39173 +20
==========================================
+ Hits 27811 27823 +12
- Misses 9732 9743 +11
+ Partials 1610 1607 -3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
thanks for flagging this edge case instead of clearing the snapshot in a cheaper alternative could be to hold a |
trying to understand when this realistically occurs - is this a race between pod restart and Gateway API config update ? |
We've seen this happen at random in our clusters. If there is a brief network interruption a proxy may get disconnected from envoy-gateway and then connect to a different instance (one that it was connected to previously). |
…shot Signed-off-by: Zachary Vacura <zvacura@digitalocean.com>
arkodg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks !
|
@Hackzzila - Thanks for this. I would really appreciate an E2E test like the ones that we have in the resilience suite, demonstrating that this works as expected after creating an disconnection with a network policy. If that's too much work, can you confirm that you manually tested the change in a similar way ? |
I did a test by connecting to envoy-gateway with grpcurl and sending requests manually. I connected and verified the correct config was being returned, then disconnected and restarted a deployment, and then reconnected and saw that the IPs were not updated. I then did this same test with the patch and verified that it returned the correct data. I did not test the delta variant, or multiple streams for the same node. If nothing major comes up at work I can investigate writing a test for this on Thursday/Friday 👍 |
jukie
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on adding an E2E but non-blocking
|
Thanks! If you're unable to do so before merging could you create an issue tracking the E2E followup? |
* fix(xds-server): clear snapshot on stream close Signed-off-by: Zachary Vacura <zvacura@digitalocean.com> * check if there are other active connections before clearning the snapshot Signed-off-by: Zachary Vacura <zvacura@digitalocean.com>
* fix(xds-server): clear snapshot on stream close Signed-off-by: Zachary Vacura <zvacura@digitalocean.com> * check if there are other active connections before clearning the snapshot Signed-off-by: Zachary Vacura <zvacura@digitalocean.com> Signed-off-by: shawnh2 <shawnhxh@outlook.com>
* fix(xds-server): clear snapshot on stream close (#6618) * fix(xds-server): clear snapshot on stream close Signed-off-by: Zachary Vacura <zvacura@digitalocean.com> * check if there are other active connections before clearning the snapshot Signed-off-by: Zachary Vacura <zvacura@digitalocean.com> Signed-off-by: shawnh2 <shawnhxh@outlook.com> * bug: disable x-envoy-ratelimited by default (#7110) * bug: disable x-envoy-ratelimited by default * can be enabled with `enableEnvoyHeaders` in CTP Relates to #7034 Signed-off-by: Arko Dasgupta <arko@tetrate.io> * fix tests and release note Signed-off-by: Arko Dasgupta <arko@tetrate.io> * fix testdata Signed-off-by: Arko Dasgupta <arko@tetrate.io> --------- Signed-off-by: Arko Dasgupta <arko@tetrate.io> Signed-off-by: shawnh2 <shawnhxh@outlook.com> * fix(translator): Fix panic with request mirror + grpcroute (#6875) Signed-off-by: Andrew Moreland <andy@andymoreland.com> Signed-off-by: shawnh2 <shawnhxh@outlook.com> * fix: bug in overlap detection of cert SANs (#7234) Signed-off-by: shawnh2 <shawnhxh@outlook.com> * bump golang for crypto/x509 reggression (#7236) Signed-off-by: zirain <zirain2009@gmail.com> Signed-off-by: shawnh2 <shawnhxh@outlook.com> * fix gen-check Signed-off-by: shawnh2 <shawnhxh@outlook.com> --------- Signed-off-by: Zachary Vacura <zvacura@digitalocean.com> Signed-off-by: shawnh2 <shawnhxh@outlook.com> Signed-off-by: Arko Dasgupta <arko@tetrate.io> Signed-off-by: Andrew Moreland <andy@andymoreland.com> Signed-off-by: zirain <zirain2009@gmail.com> Co-authored-by: Zach Vacura <zach@hackzzila.com> Co-authored-by: Arko Dasgupta <arkodg@users.noreply.github.com> Co-authored-by: Andrew Moreland <andy@andymo.org> Co-authored-by: Rudrakh Panigrahi <rudrakh97@gmail.com> Co-authored-by: zirain <zirain2009@gmail.com>
* fix(xds-server): clear snapshot on stream close Signed-off-by: Zachary Vacura <zvacura@digitalocean.com> * check if there are other active connections before clearning the snapshot Signed-off-by: Zachary Vacura <zvacura@digitalocean.com>
* fix(xds-server): clear snapshot on stream close Signed-off-by: Zachary Vacura <zvacura@digitalocean.com> * check if there are other active connections before clearning the snapshot Signed-off-by: Zachary Vacura <zvacura@digitalocean.com> Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
* fix(xds-server): clear snapshot on stream close (#6618) * fix(xds-server): clear snapshot on stream close Signed-off-by: Zachary Vacura <zvacura@digitalocean.com> * check if there are other active connections before clearning the snapshot Signed-off-by: Zachary Vacura <zvacura@digitalocean.com> Signed-off-by: jukie <10012479+jukie@users.noreply.github.com> * fix: oidc authentication endpoint was overwritten by discovered value (#7460) fix: oid authentication endpoint was overriden by discovered value Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com> Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com> Signed-off-by: jukie <10012479+jukie@users.noreply.github.com> * ci: add script to free disk space (#7534) * feat: free disk space Signed-off-by: Shreemaan Abhishek <shreemaanabhishek@apache.org> * lint Signed-off-by: Shreemaan Abhishek <shreemaanabhishek@apache.org> * cleanup Signed-off-by: Shreemaan Abhishek <shreemaanabhishek@apache.org> * make target and tools/hack Signed-off-by: Shreemaan Abhishek <shreemaanabhishek@apache.org> * lint Signed-off-by: Shreemaan Abhishek <shreemaanabhishek@apache.org> * modular action Signed-off-by: Shreemaan Abhishek <shreemaanabhishek@apache.org> --------- Signed-off-by: Shreemaan Abhishek <shreemaanabhishek@apache.org> Signed-off-by: jukie <10012479+jukie@users.noreply.github.com> * treat too many addresses as programmed (#7542) Signed-off-by: cong <q1875486458@gmail.com> Signed-off-by: jukie <10012479+jukie@users.noreply.github.com> * feat: reclaim space in release pipeline (#7587) Signed-off-by: Shreemaan Abhishek <shreemaanabhishek@apache.org> Signed-off-by: jukie <10012479+jukie@users.noreply.github.com> * chore: bump golang.org/x/crypto (#7588) * chore: bump golang.org/x/crypto Signed-off-by: zirain <zirain2009@gmail.com> * fix gen Signed-off-by: zirain <zirain2009@gmail.com> --------- Signed-off-by: zirain <zirain2009@gmail.com> Signed-off-by: jukie <10012479+jukie@users.noreply.github.com> * findOwningGateway should return controller based on linked GatewayClass (#7611) * fix: filter Gateway by controller in findOwningGateway Prevent cross-controller Gateway mutations by validating GatewayClass Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com> Signed-off-by: jukie <10012479+jukie@users.noreply.github.com> * fix: use default when namespace is unset (#7612) * fix: use default when namespace is unset Signed-off-by: zirain <zirain2009@gmail.com> * fix Signed-off-by: zirain <zirain2009@gmail.com> * fix test Signed-off-by: zirain <zirain2009@gmail.com> --------- Signed-off-by: zirain <zirain2009@gmail.com> Signed-off-by: jukie <10012479+jukie@users.noreply.github.com> * fix: prevent skeleton route status entries for unmanaged GatewayClasses (#7536) * fix: prevent skeleton route status entries for unmanaged GatewayClasses When processing policies (EnvoyExtensionPolicy, SecurityPolicy), the translator was calling GetRouteParentContext for ALL parentRefs in a route, even those referencing gateways with different GatewayClasses not managed by this translator. GetRouteParentContext creates a skeleton RouteParentStatus entry with just the controllerName when called on a parentRef that hasn't been processed yet. Since all GatewayClass instances share the same controller name, these skeleton entries persisted in status without conditions. The fix checks if a parentRef context already exists before attempting to apply policy configuration to it. If the context doesn't exist, it means this parentRef wasn't processed by this translator and should be skipped. Signed-off-by: Raj Singh <raj@tailscale.com> * fix: also prevent skeleton entries in BackendTrafficPolicy processing The same issue exists in BackendTrafficPolicy route processing - calling GetRouteParentContext for all parentRefs creates skeleton status entries. Apply the same fix: check if parentRef context exists before adding to list. Signed-off-by: Raj Singh <raj@tailscale.com> --------- Signed-off-by: Raj Singh <raj@tailscale.com> Signed-off-by: jukie <10012479+jukie@users.noreply.github.com> * lint Signed-off-by: jukie <10012479+jukie@users.noreply.github.com> --------- Signed-off-by: Zachary Vacura <zvacura@digitalocean.com> Signed-off-by: jukie <10012479+jukie@users.noreply.github.com> Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com> Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com> Signed-off-by: Shreemaan Abhishek <shreemaanabhishek@apache.org> Signed-off-by: cong <q1875486458@gmail.com> Signed-off-by: zirain <zirain2009@gmail.com> Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com> Signed-off-by: Raj Singh <raj@tailscale.com> Co-authored-by: Zach Vacura <zach@hackzzila.com> Co-authored-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com> Co-authored-by: shreealt <shreemaanabhishek@apache.org> Co-authored-by: 聪 <q1875486458@gmail.com> Co-authored-by: zirain <zirain2009@gmail.com> Co-authored-by: Sudipto Baral <sudiptobaral.me@gmail.com> Co-authored-by: Raj Singh <raj@tailscale.com>
In the snapshotcache, only snapshots for nodes with active connections are updated.
gateway/internal/xds/cache/snapshotcache.go
Lines 92 to 95 in ceec666
As far as I can tell, these snapshots are never deleted. When a node disconnects and then reconnects it will receive the configuration from the time it disconnected. It will only get the latest configuration when there is an update while it is still connected.
This is a big problem when you are running multiple replicas of envoy-gateway, if a proxy flips between which instance it is connected to it can receive data that is several days out of date. We have seen proxy instances suddenly start using certificates that are expired and have already been rotated, and use pod IPs that do not exist anymore.
If we clear the snapshot for a node when it disconnects, it will get the latest snapshot when it reconnects (this does not happen if a node already has a cached snapshot).
gateway/internal/xds/cache/snapshotcache.go
Lines 208 to 210 in ceec666