Cluster readiness service stops after global snapshot restore #122429

Closed

Labels

:Core/Infra/Core>bugTeam:Core/Infraauto-backport

opened

on Feb 12, 2025

Elasticsearch Version

8.15.2

Installed Plugins

No response

Java Version

OpenJDK 64-Bit Server VM (build 22.0.1+8-16)

OS Version

Linux 6.6.56+ x86_64

Problem Description

In #84375 a new readiness service was introduced that allows a simple tcp check. We've been using this service successfully for some time as a Kubernetes readiness probe endpoint.

As of Elasticsearch 8.14+ the readiness check condition was modified in #106437 to also check for fileSettingsApplied and not just masterElected.

Since this change if you restore a snapshot with include_global_state: true the readiness service will shut down and all your nodes will become unready. You will see the following log messages:

{"type": "ElasticSearch", "timestamp": "2025-02-03T15:30:06,500Z", "level": "INFO", "component": "o.e.r.ReadinessService", "cluster.name": "test-cluster", "node.name": "test-data-1, "message": "readiness change: masterElected=true" }
{"type": "ElasticSearch", "timestamp": "2025-02-03T15:30:06,502Z", "level": "INFO", "component": "o.e.r.ReadinessService", "cluster.name": "test-cluster", "node.name": "test-data-1, "message": "readiness service up and running on 0.0.0.0:9399" }
{"type": "ElasticSearch", "timestamp": "2025-02-03T15:32:29,127Z", "level": "INFO", "component": "o.e.r.ReadinessService", "cluster.name": "test-cluster", "node.name": "test-data-1", "message": "readiness change: fileSettingsApplied=false", "cluster.uuid": "REDACTED", "node.id": "REDACTED  }
{"type": "ElasticSearch", "timestamp": "2025-02-03T15:32:29,128Z", "level": "INFO", "component": "o.e.r.ReadinessService", "cluster.name": "test-cluster", "node.name": "test-data-1" "message": "stopping readiness service on channel /0.0.0.0:9399", "cluster.uuid": "REDACTED", "node.id": "REDACTED"  }

The cluster will NOT recover by itself until you manually force a master re-election by restarting the existing master. After that you will see:

{"type": "ElasticSearch", "timestamp": "2025-02-03T17:09:51,808Z", "level": "INFO", "component": "o.e.r.ReadinessService", "cluster.name": "test-cluster", "node.name": "test-data-1", "message": "readiness change: masterElected=false", "cluster.uuid": "REDACTED", "node.id": "REDACTED"  }
{"type": "ElasticSearch", "timestamp": "2025-02-03T17:09:52,349Z", "level": "INFO", "component": "o.e.r.ReadinessService", "cluster.name": "test-cluster", "node.name": "test-data-1", "message": "readiness change: masterElected=true", "cluster.uuid": "REDACTED", "node.id": "REDACTED"  }

The workaround appears to be providing an empty operator setting.json file on each node's disk:

{
  "metadata": {
      "version": "1",
      "compatibility": "8.4.0"
  },
  "state": {
      "cluster_settings": {}
  }
}

If this file is present, the problem will not occur. I believe this may be due to the file's presence forcing a different code path in handleSnapshotRestore (see https://github.com/elastic/elasticsearch/pull/89321/files#diff-5a2c0417a5b4bdbcc6873fc853db7a8c531b47536bd41b84e20bb6c7b21ad6d9R160).

It is unexpected that doing a global snapshot restore breaks the readiness service. If you are using the readiness service in a Kubernetes environment your cluster may become entirely unreachable causing downtime.

Steps to Reproduce

Create a new ES 8.15.2 cluster.
Take a snapshot.
Restore the snapshot with include_global_state: true
At this point the readiness service shuts down on all nodes and will not come back until you re-elect a master (e.g. by forcefully restarting the master node).

Logs (if relevant)

See issue description.

Metadata

Assignees

No one assigned

Labels

:Core/Infra/Core>bugTeam:Core/Infraauto-backport

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests