Skip to content

Healthcheck/connection errors during deleting VM snapshot of PostgreSQL database #916

@mirakrejci2

Description

@mirakrejci2

Describe the bug
Please help setup PgCat correctly. During VM backup (VMware) of database server to which PgCat host has active connections, a snapshot is created and deleted after the backup is complete. This is a very I/O intensive operation and as a result, the database may respond more slowly. Sometimes, right at the time of deleting the snapshot, the following errors occur in the PgCat log (for greater clarity, I removed the parameters listed in {} ):

Terminating server Address because of: SocketError("Error flushing socket - Error: Os { code: 110, kind: TimedOut, message: "Connection timed out" }")
Failed health check on instance Address error: SocketError("Error flushing socket - Error: Os { code: 110, kind: TimedOut, message: "Connection timed out" }")
Server Address marked bad, reason: failed health check
Server connection terminated Address
Could not get connection from pool error: "AllServersDown"

I managed to fix this scenario of consecutive errors by increasing the tcp_user_timeout parameter (which has a default value of 10s).

Unfortunately, I still have a problem with another (slightly different) error scenario:

Health check timeout on instance Address error: Elapsed(())
Server Address marked bad, reason: failed health check
Server connection terminated Address
Could not get connection from pool error: "AllServersDown"

So far, no parameter adjustments have helped me reliably. I have tried increasing connect_timeout, healthcheck_timeout, healthcheck delay in particular.
Please help!! Thank you!

To Reproduce
Steps to reproduce the behavior:
Run VM backup of PostgreSQL host with active connections with PgCat host

Expected behavior
No errors. No terminated connections.

Additional context
OS: AlmaLinux 9
PgCat version: v1.2.0

Selected config parameters
pool_size = 160 (definitely a sufficient size)
min_pool_size = 2
connect_timeout = 60000
healthcheck_timeout = 60000
tcp_user_timeout = 60000

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions