Healthcheck/connection errors during deleting VM snapshot of PostgreSQL database #916

Open

Open

Healthcheck/connection errors during deleting VM snapshot of PostgreSQL database#916

opened

on Mar 12, 2025

Describe the bug
Please help setup PgCat correctly. During VM backup (VMware) of database server to which PgCat host has active connections, a snapshot is created and deleted after the backup is complete. This is a very I/O intensive operation and as a result, the database may respond more slowly. Sometimes, right at the time of deleting the snapshot, the following errors occur in the PgCat log (for greater clarity, I removed the parameters listed in {} ):

Terminating server Address because of: SocketError("Error flushing socket - Error: Os { code: 110, kind: TimedOut, message: "Connection timed out" }")
Failed health check on instance Address error: SocketError("Error flushing socket - Error: Os { code: 110, kind: TimedOut, message: "Connection timed out" }")
Server Address marked bad, reason: failed health check
Server connection terminated Address
Could not get connection from pool error: "AllServersDown"

I managed to fix this scenario of consecutive errors by increasing the tcp_user_timeout parameter (which has a default value of 10s).

Unfortunately, I still have a problem with another (slightly different) error scenario:

Health check timeout on instance Address error: Elapsed(())
Server Address marked bad, reason: failed health check
Server connection terminated Address
Could not get connection from pool error: "AllServersDown"

So far, no parameter adjustments have helped me reliably. I have tried increasing connect_timeout, healthcheck_timeout, healthcheck delay in particular.
Please help!! Thank you!

To Reproduce
Steps to reproduce the behavior:
Run VM backup of PostgreSQL host with active connections with PgCat host

Expected behavior
No errors. No terminated connections.

Additional context
OS: AlmaLinux 9
PgCat version: v1.2.0

Selected config parameters
pool_size = 160 (definitely a sufficient size)
min_pool_size = 2
connect_timeout = 60000
healthcheck_timeout = 60000
tcp_user_timeout = 60000

Metadata

Assignees

No one assigned

Labels

No labels

No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests