Skip to content

Commit ce9e7ad

Browse files
authored
fix(agent): ignore EOF errors during shutdown (#21187)
fixes: coder/internal#1179 The problem in that flake is that dRPC doensn't consistently return `context.Canceled` if you make an RPC call and then cancel it: sometimes it returns EOF. Without this PR, if we get an EOF on one of the routines that uses the agentapi connection, we tear down the whole connection and reconnect to coderd --- even if we are in the middle of a graceful shutdown. What happened in the linked flake is that writing stats failed with EOF, which then caused us to reconnect and write the lifecycle "SHUTTING DOWN" twice.
1 parent b199eb1 commit ce9e7ad

File tree

1 file changed

+11
-6
lines changed

1 file changed

+11
-6
lines changed

agent/agent.go

Lines changed: 11 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2189,14 +2189,19 @@ func (a *apiConnRoutineManager) startTailnetAPI(
21892189
a.eg.Go(func() error {
21902190
logger.Debug(ctx, "starting tailnet routine")
21912191
err := f(ctx, a.tAPI)
2192-
if xerrors.Is(err, context.Canceled) && ctx.Err() != nil {
2193-
logger.Debug(ctx, "swallowing context canceled")
2192+
if (xerrors.Is(err, context.Canceled) ||
2193+
xerrors.Is(err, io.EOF)) &&
2194+
ctx.Err() != nil {
2195+
logger.Debug(ctx, "swallowing error because context is canceled", slog.Error(err))
21942196
// Don't propagate context canceled errors to the error group, because we don't want the
21952197
// graceful context being canceled to halt the work of routines with
2196-
// gracefulShutdownBehaviorRemain. Note that we check both that the error is
2197-
// context.Canceled and that *our* context is currently canceled, because when Coderd
2198-
// unilaterally closes the API connection (for example if the build is outdated), it can
2199-
// sometimes show up as context.Canceled in our RPC calls.
2198+
// gracefulShutdownBehaviorRemain. Unfortunately, the dRPC library closes the stream
2199+
// when context is canceled on an RPC, so canceling the context can also show up as
2200+
// io.EOF. Also, when Coderd unilaterally closes the API connection (for example if the
2201+
// build is outdated), it can sometimes show up as context.Canceled in our RPC calls.
2202+
// We can't reliably distinguish between a context cancelation and a legit EOF, so we
2203+
// also check that *our* context is currently canceled. If it is, we can safely ignore
2204+
// the error.
22002205
return nil
22012206
}
22022207
logger.Debug(ctx, "routine exited", slog.Error(err))

0 commit comments

Comments
 (0)