Skip to content

Shards did not rebalance immediately after hardware failure on data node #116203

Closed as not planned
@Harif-Rahman

Description

@Harif-Rahman

Elasticsearch Version

5.6.16

Installed Plugins

No response

Java Version

bundled

OS Version

Linux cluster-1-master-2 4.14.326-245.539.amzn2.x86_64 #1 SMP Tue Sep 26 09:59:02 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Problem Description

We encountered an issue where Elasticsearch did not automatically rebalance shards after a hardware failure on one of our data nodes. This caused an extended period of degraded performance until manual intervention was performed.

EC2 instance went unhealthy on 2024-10-28 20:40 UTC
EC2 instance became healthy back on 2024-10-28 21:10 UTC

Elastic search rebalanced the shards at 2024-10-28 21:10 UTC only.

Wanted to know the reason for high time to rebalance the shard. Have attached the master node logs

es_oct_28.log

ES cluster settings

Cluster level settings.

cluster.name: fc-use1-00-conversation-cluster-1

Discovery settings.

discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.unicast.hosts: [XXX]

discovery.zen.ping.timeout: 15s

discovery.zen.fd.ping_interval: 10s
discovery.zen.fd.ping_timeout: 10s
discovery.zen.fd.ping_retries: 6

action.auto_create_index: true

indices.memory.index_buffer_size: 30%

indices.store.throttle.max_bytes_per_sec: 100mb

Node level settings.

node.data: false
node.master: true
node.name: cluster-1-master-1

cluster.routing.allocation.awareness.force.zone.values: zone-a,zone-b,zone-c,zone-d,zone-e,zone-f
cluster.routing.allocation.awareness.attributes: zone

http.enabled: true

Loopback interface

Steps to Reproduce

Steps to Reproduce:

  • Simulate a hardware failure on one of the data nodes by stopping or disconnecting the instance.
  • Observe the state of shard allocation and rebalancing.
  • Notice that Elasticsearch does not immediately initiate shard rebalancing across available nodes.

Logs (if relevant)

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    >bugneeds:triageRequires assignment of a team area label

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions