Skip to content

Conversation

@Hackzzila
Copy link
Contributor

In the snapshotcache, only snapshots for nodes with active connections are updated.

for _, node := range s.getNodeIDs(irKey) {
s.log.Debugf("Generating a snapshot with Node %s", node)
if err = s.SetSnapshot(context.TODO(), node, snapshot); err != nil {

As far as I can tell, these snapshots are never deleted. When a node disconnects and then reconnects it will receive the configuration from the time it disconnected. It will only get the latest configuration when there is an update while it is still connected.

This is a big problem when you are running multiple replicas of envoy-gateway, if a proxy flips between which instance it is connected to it can receive data that is several days out of date. We have seen proxy instances suddenly start using certificates that are expired and have already been rotated, and use pod IPs that do not exist anymore.

If we clear the snapshot for a node when it disconnects, it will get the latest snapshot when it reconnects (this does not happen if a node already has a cached snapshot).

_, err := s.GetSnapshot(nodeID)
if err != nil {
err = s.SetSnapshot(context.TODO(), nodeID, s.lastSnapshot[cluster])

Signed-off-by: Zachary Vacura <zvacura@digitalocean.com>
@Hackzzila Hackzzila requested a review from a team as a code owner July 28, 2025 18:26
@codecov
Copy link

codecov bot commented Jul 28, 2025

Codecov Report

❌ Patch coverage is 0% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.02%. Comparing base (ceec666) to head (cfb177c).
⚠️ Report is 483 commits behind head on main.

Files with missing lines Patch % Lines
internal/xds/cache/snapshotcache.go 0.00% 21 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6618      +/-   ##
==========================================
- Coverage   71.03%   71.02%   -0.01%     
==========================================
  Files         225      225              
  Lines       39153    39173      +20     
==========================================
+ Hits        27811    27823      +12     
- Misses       9732     9743      +11     
+ Partials     1610     1607       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@arkodg
Copy link
Contributor

arkodg commented Jul 28, 2025

thanks for flagging this edge case
although for ADS there should be 1 stream per node, we cannot rely on it

instead of clearing the snapshot in OnStreamClosed , we could compare snapshots using GetVersionMap in OnStreamRequest which executesGetSnapshot

a cheaper alternative could be to hold a nodeId frequency map of stream counts, and only clear snapshot when no streams are available, my preference would be this approach

@arkodg arkodg added this to the v1.5.0 Release milestone Jul 28, 2025
@arkodg
Copy link
Contributor

arkodg commented Jul 28, 2025

This is a big problem when you are running multiple replicas of envoy-gateway, if a proxy flips between which instance it is connected to it can receive data that is several days out of date. We have seen proxy instances suddenly start using certificates that are expired and have already been rotated, and use pod IPs that do not exist anymore.

trying to understand when this realistically occurs - is this a race between pod restart and Gateway API config update ?

@Hackzzila
Copy link
Contributor Author

trying to understand when this realistically occurs - is this a race between pod restart and Gateway API config update ?

We've seen this happen at random in our clusters. If there is a brief network interruption a proxy may get disconnected from envoy-gateway and then connect to a different instance (one that it was connected to previously).

…shot

Signed-off-by: Zachary Vacura <zvacura@digitalocean.com>
Copy link
Contributor

@arkodg arkodg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks !

@arkodg arkodg requested review from a team July 29, 2025 19:35
@guydc
Copy link
Contributor

guydc commented Jul 29, 2025

@Hackzzila - Thanks for this. I would really appreciate an E2E test like the ones that we have in the resilience suite, demonstrating that this works as expected after creating an disconnection with a network policy.

If that's too much work, can you confirm that you manually tested the change in a similar way ?

@Hackzzila
Copy link
Contributor Author

@Hackzzila - Thanks for this. I would really appreciate an E2E test like the ones that we have in the resilience suite, demonstrating that this works as expected after creating an disconnection with a network policy.

If that's too much work, can you confirm that you manually tested the change in a similar way ?

I did a test by connecting to envoy-gateway with grpcurl and sending requests manually. I connected and verified the correct config was being returned, then disconnected and restarted a deployment, and then reconnected and saw that the IPs were not updated. I then did this same test with the patch and verified that it returned the correct data. I did not test the delta variant, or multiple streams for the same node.

If nothing major comes up at work I can investigate writing a test for this on Thursday/Friday 👍

Copy link
Contributor

@jukie jukie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on adding an E2E but non-blocking

@jukie
Copy link
Contributor

jukie commented Jul 30, 2025

Thanks! If you're unable to do so before merging could you create an issue tracking the E2E followup?

@arkodg arkodg merged commit 6e21739 into envoyproxy:main Jul 30, 2025
27 of 28 checks passed
shawnh2 pushed a commit to shawnh2/gateway that referenced this pull request Oct 15, 2025
* fix(xds-server): clear snapshot on stream close

Signed-off-by: Zachary Vacura <zvacura@digitalocean.com>

* check if there are other active connections before clearning the snapshot

Signed-off-by: Zachary Vacura <zvacura@digitalocean.com>
shawnh2 pushed a commit to shawnh2/gateway that referenced this pull request Oct 15, 2025
* fix(xds-server): clear snapshot on stream close

Signed-off-by: Zachary Vacura <zvacura@digitalocean.com>

* check if there are other active connections before clearning the snapshot

Signed-off-by: Zachary Vacura <zvacura@digitalocean.com>
Signed-off-by: shawnh2 <shawnhxh@outlook.com>
arkodg added a commit that referenced this pull request Oct 15, 2025
* fix(xds-server): clear snapshot on stream close (#6618)

* fix(xds-server): clear snapshot on stream close

Signed-off-by: Zachary Vacura <zvacura@digitalocean.com>

* check if there are other active connections before clearning the snapshot

Signed-off-by: Zachary Vacura <zvacura@digitalocean.com>
Signed-off-by: shawnh2 <shawnhxh@outlook.com>

* bug: disable x-envoy-ratelimited by default (#7110)

* bug: disable x-envoy-ratelimited by default

* can be enabled with `enableEnvoyHeaders` in CTP

Relates to #7034

Signed-off-by: Arko Dasgupta <arko@tetrate.io>

* fix tests and release note

Signed-off-by: Arko Dasgupta <arko@tetrate.io>

* fix testdata

Signed-off-by: Arko Dasgupta <arko@tetrate.io>

---------

Signed-off-by: Arko Dasgupta <arko@tetrate.io>
Signed-off-by: shawnh2 <shawnhxh@outlook.com>

* fix(translator): Fix panic with request mirror + grpcroute (#6875)

Signed-off-by: Andrew Moreland <andy@andymoreland.com>
Signed-off-by: shawnh2 <shawnhxh@outlook.com>

* fix: bug in overlap detection of cert SANs (#7234)

Signed-off-by: shawnh2 <shawnhxh@outlook.com>

* bump golang for crypto/x509 reggression (#7236)

Signed-off-by: zirain <zirain2009@gmail.com>
Signed-off-by: shawnh2 <shawnhxh@outlook.com>

* fix gen-check

Signed-off-by: shawnh2 <shawnhxh@outlook.com>

---------

Signed-off-by: Zachary Vacura <zvacura@digitalocean.com>
Signed-off-by: shawnh2 <shawnhxh@outlook.com>
Signed-off-by: Arko Dasgupta <arko@tetrate.io>
Signed-off-by: Andrew Moreland <andy@andymoreland.com>
Signed-off-by: zirain <zirain2009@gmail.com>
Co-authored-by: Zach Vacura <zach@hackzzila.com>
Co-authored-by: Arko Dasgupta <arkodg@users.noreply.github.com>
Co-authored-by: Andrew Moreland <andy@andymo.org>
Co-authored-by: Rudrakh Panigrahi <rudrakh97@gmail.com>
Co-authored-by: zirain <zirain2009@gmail.com>
jukie pushed a commit to jukie/gateway that referenced this pull request Dec 5, 2025
* fix(xds-server): clear snapshot on stream close

Signed-off-by: Zachary Vacura <zvacura@digitalocean.com>

* check if there are other active connections before clearning the snapshot

Signed-off-by: Zachary Vacura <zvacura@digitalocean.com>
jukie pushed a commit to jukie/gateway that referenced this pull request Dec 5, 2025
* fix(xds-server): clear snapshot on stream close

Signed-off-by: Zachary Vacura <zvacura@digitalocean.com>

* check if there are other active connections before clearning the snapshot

Signed-off-by: Zachary Vacura <zvacura@digitalocean.com>
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
jukie added a commit that referenced this pull request Dec 5, 2025
* fix(xds-server): clear snapshot on stream close (#6618)

* fix(xds-server): clear snapshot on stream close

Signed-off-by: Zachary Vacura <zvacura@digitalocean.com>

* check if there are other active connections before clearning the snapshot

Signed-off-by: Zachary Vacura <zvacura@digitalocean.com>
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>

* fix: oidc authentication endpoint was overwritten by discovered value (#7460)

fix: oid authentication endpoint was overriden by discovered value

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>
Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>

* ci: add script to free disk space (#7534)

* feat: free disk space

Signed-off-by: Shreemaan Abhishek <shreemaanabhishek@apache.org>

* lint

Signed-off-by: Shreemaan Abhishek <shreemaanabhishek@apache.org>

* cleanup

Signed-off-by: Shreemaan Abhishek <shreemaanabhishek@apache.org>

* make target and tools/hack

Signed-off-by: Shreemaan Abhishek <shreemaanabhishek@apache.org>

* lint

Signed-off-by: Shreemaan Abhishek <shreemaanabhishek@apache.org>

* modular action

Signed-off-by: Shreemaan Abhishek <shreemaanabhishek@apache.org>

---------

Signed-off-by: Shreemaan Abhishek <shreemaanabhishek@apache.org>
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>

* treat too many addresses as programmed (#7542)

Signed-off-by: cong <q1875486458@gmail.com>
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>

* feat: reclaim space in release pipeline (#7587)

Signed-off-by: Shreemaan Abhishek <shreemaanabhishek@apache.org>
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>

* chore: bump golang.org/x/crypto (#7588)

* chore: bump golang.org/x/crypto

Signed-off-by: zirain <zirain2009@gmail.com>

* fix gen

Signed-off-by: zirain <zirain2009@gmail.com>

---------

Signed-off-by: zirain <zirain2009@gmail.com>
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>

* findOwningGateway should return controller based on linked GatewayClass (#7611)

* fix: filter Gateway by controller in findOwningGateway

Prevent cross-controller Gateway mutations by validating GatewayClass

Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>

* fix: use default when namespace is unset (#7612)

* fix: use default when namespace is unset

Signed-off-by: zirain <zirain2009@gmail.com>

* fix

Signed-off-by: zirain <zirain2009@gmail.com>

* fix test

Signed-off-by: zirain <zirain2009@gmail.com>

---------

Signed-off-by: zirain <zirain2009@gmail.com>
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>

* fix: prevent skeleton route status entries for unmanaged GatewayClasses (#7536)

* fix: prevent skeleton route status entries for unmanaged GatewayClasses

When processing policies (EnvoyExtensionPolicy, SecurityPolicy), the translator
was calling GetRouteParentContext for ALL parentRefs in a route, even those
referencing gateways with different GatewayClasses not managed by this translator.

GetRouteParentContext creates a skeleton RouteParentStatus entry with just the
controllerName when called on a parentRef that hasn't been processed yet. Since
all GatewayClass instances share the same controller name, these skeleton entries
persisted in status without conditions.

The fix checks if a parentRef context already exists before attempting to apply
policy configuration to it. If the context doesn't exist, it means this parentRef
wasn't processed by this translator and should be skipped.

Signed-off-by: Raj Singh <raj@tailscale.com>

* fix: also prevent skeleton entries in BackendTrafficPolicy processing

The same issue exists in BackendTrafficPolicy route processing - calling
GetRouteParentContext for all parentRefs creates skeleton status entries.

Apply the same fix: check if parentRef context exists before adding to list.

Signed-off-by: Raj Singh <raj@tailscale.com>

---------

Signed-off-by: Raj Singh <raj@tailscale.com>
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>

* lint

Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>

---------

Signed-off-by: Zachary Vacura <zvacura@digitalocean.com>
Signed-off-by: jukie <10012479+jukie@users.noreply.github.com>
Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>
Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>
Signed-off-by: Shreemaan Abhishek <shreemaanabhishek@apache.org>
Signed-off-by: cong <q1875486458@gmail.com>
Signed-off-by: zirain <zirain2009@gmail.com>
Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>
Signed-off-by: Raj Singh <raj@tailscale.com>
Co-authored-by: Zach Vacura <zach@hackzzila.com>
Co-authored-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>
Co-authored-by: shreealt <shreemaanabhishek@apache.org>
Co-authored-by: 聪 <q1875486458@gmail.com>
Co-authored-by: zirain <zirain2009@gmail.com>
Co-authored-by: Sudipto Baral <sudiptobaral.me@gmail.com>
Co-authored-by: Raj Singh <raj@tailscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants