-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Description
Elasticsearch Version
8.15.2
Installed Plugins
No response
Java Version
OpenJDK 64-Bit Server VM (build 22.0.1+8-16)
OS Version
Linux 6.6.56+ x86_64
Problem Description
In #84375 a new readiness service was introduced that allows a simple tcp check. We've been using this service successfully for some time as a Kubernetes readiness probe endpoint.
As of Elasticsearch 8.14+ the readiness check condition was modified in #106437 to also check for fileSettingsApplied and not just masterElected.
Since this change if you restore a snapshot with include_global_state: true the readiness service will shut down and all your nodes will become unready. You will see the following log messages:
{"type": "ElasticSearch", "timestamp": "2025-02-03T15:30:06,500Z", "level": "INFO", "component": "o.e.r.ReadinessService", "cluster.name": "test-cluster", "node.name": "test-data-1, "message": "readiness change: masterElected=true" }
{"type": "ElasticSearch", "timestamp": "2025-02-03T15:30:06,502Z", "level": "INFO", "component": "o.e.r.ReadinessService", "cluster.name": "test-cluster", "node.name": "test-data-1, "message": "readiness service up and running on 0.0.0.0:9399" }
{"type": "ElasticSearch", "timestamp": "2025-02-03T15:32:29,127Z", "level": "INFO", "component": "o.e.r.ReadinessService", "cluster.name": "test-cluster", "node.name": "test-data-1", "message": "readiness change: fileSettingsApplied=false", "cluster.uuid": "REDACTED", "node.id": "REDACTED }
{"type": "ElasticSearch", "timestamp": "2025-02-03T15:32:29,128Z", "level": "INFO", "component": "o.e.r.ReadinessService", "cluster.name": "test-cluster", "node.name": "test-data-1" "message": "stopping readiness service on channel /0.0.0.0:9399", "cluster.uuid": "REDACTED", "node.id": "REDACTED" }
The cluster will NOT recover by itself until you manually force a master re-election by restarting the existing master. After that you will see:
{"type": "ElasticSearch", "timestamp": "2025-02-03T17:09:51,808Z", "level": "INFO", "component": "o.e.r.ReadinessService", "cluster.name": "test-cluster", "node.name": "test-data-1", "message": "readiness change: masterElected=false", "cluster.uuid": "REDACTED", "node.id": "REDACTED" }
{"type": "ElasticSearch", "timestamp": "2025-02-03T17:09:52,349Z", "level": "INFO", "component": "o.e.r.ReadinessService", "cluster.name": "test-cluster", "node.name": "test-data-1", "message": "readiness change: masterElected=true", "cluster.uuid": "REDACTED", "node.id": "REDACTED" }
The workaround appears to be providing an empty operator setting.json file on each node's disk:
{
"metadata": {
"version": "1",
"compatibility": "8.4.0"
},
"state": {
"cluster_settings": {}
}
}
If this file is present, the problem will not occur. I believe this may be due to the file's presence forcing a different code path in handleSnapshotRestore (see https://github.com/elastic/elasticsearch/pull/89321/files#diff-5a2c0417a5b4bdbcc6873fc853db7a8c531b47536bd41b84e20bb6c7b21ad6d9R160).
It is unexpected that doing a global snapshot restore breaks the readiness service. If you are using the readiness service in a Kubernetes environment your cluster may become entirely unreachable causing downtime.
Steps to Reproduce
- Create a new ES 8.15.2 cluster.
- Take a snapshot.
- Restore the snapshot with
include_global_state: true - At this point the readiness service shuts down on all nodes and will not come back until you re-elect a master (e.g. by forcefully restarting the master node).
Logs (if relevant)
See issue description.