Description
Elasticsearch Version
5.6.16
Installed Plugins
No response
Java Version
bundled
OS Version
Linux cluster-1-master-2 4.14.326-245.539.amzn2.x86_64 #1 SMP Tue Sep 26 09:59:02 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Problem Description
We encountered an issue where Elasticsearch did not automatically rebalance shards after a hardware failure on one of our data nodes. This caused an extended period of degraded performance until manual intervention was performed.
EC2 instance went unhealthy on 2024-10-28 20:40 UTC
EC2 instance became healthy back on 2024-10-28 21:10 UTC
Elastic search rebalanced the shards at 2024-10-28 21:10 UTC only.
Wanted to know the reason for high time to rebalance the shard. Have attached the master node logs
ES cluster settings
Cluster level settings.
cluster.name: fc-use1-00-conversation-cluster-1
Discovery settings.
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.unicast.hosts: [XXX]
discovery.zen.ping.timeout: 15s
discovery.zen.fd.ping_interval: 10s
discovery.zen.fd.ping_timeout: 10s
discovery.zen.fd.ping_retries: 6
action.auto_create_index: true
indices.memory.index_buffer_size: 30%
indices.store.throttle.max_bytes_per_sec: 100mb
Node level settings.
node.data: false
node.master: true
node.name: cluster-1-master-1
cluster.routing.allocation.awareness.force.zone.values: zone-a,zone-b,zone-c,zone-d,zone-e,zone-f
cluster.routing.allocation.awareness.attributes: zone
http.enabled: true
Loopback interface
Steps to Reproduce
Steps to Reproduce:
- Simulate a hardware failure on one of the data nodes by stopping or disconnecting the instance.
- Observe the state of shard allocation and rebalancing.
- Notice that Elasticsearch does not immediately initiate shard rebalancing across available nodes.
Logs (if relevant)
No response