Cluster health and number of nodes unchanged after a node failure

### Elasticsearch Version

8.14.1

### Installed Plugins

n/a

### Java Version

17

### OS Version

Linux <redacted> 5.15.167-112.165.amzn2.x86_64 #1 SMP Mon Sep 23 21:53:24 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

### Problem Description

I have a 3 node cluster where Node-1 has a hardware failure and the other two nodes detect the disconnect but do not hold an election and no failover occurs. Prometheus metrics on the other two nodes indicate that Node-2 and Node-3 still think there is GREEN cluster health and 3 nodes in the cluster.

Timeline:

- 11:44:00 Node-1 has a hardware failure and goes down. It has been encountering "ProcessClusterEventTimeoutException: failed to process cluster event" up until this point see [1] for examples. Node-2 and Node-3 also has ProcessClusterEventTimeoutException in their logs up to this point. Node-2 and Node-3 are already seeing ReceiveTimeoutTransportException when talking with Node-1.
- 11:57:47 Node-2 and Node-3 log starts to show "NodeNotConnectedException: [node-1][<NODE-1_IP>:9300] Node not connected". See [2] for log.
- 12:59 Node-1 recovers but Node-2 and Node-3 still do not have an election
- 13:01 Node-2 and Node-3 ES instances are restarted; Node-2 becomes master; Cluster health goes from RED -> YELLOW -> GREEN See [3] for log.
- Node-2 and Node-3 ES instances only hold an election AFTER they are MANUALLY restarted and after having waited more than 1 hour for them to automatically failover (which they did not).

Even if no failover happens, prometheus exporter is still showing that Node-2 and Node-3 were reporting a cluster status of GREEN while Node-1 was down and that there are 3 nodes.

The settings cluster.fault_detection.leader_check.interval, cluster.fault_detection.leader_check.timeout, and cluster.fault_detection.leader_check.retry_count have been untouched, so they should be their default of 1s, 10s, and 3

See also community forum for this issue https://discuss.elastic.co/t/node-fails-but-cluster-holds-no-election-and-no-failover-occurs/369391

### Steps to Reproduce

n/a

### Logs (if relevant)

1:
```
[2024-10-15T11:43:00,998][WARN ][rest.suppressed          ] [node-1] path: /news/_settings, params: {master_timeout=30s, index=news, timeout=30s}, status: 503
org.elasticsearch.transport.RemoteTransportException: [node-2][<NODE-2_IP>:9300][indices:admin/settings/update]
Caused by: org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (update-settings [[news-8-0/JXXtmZtqSRm2GKNx1-2Gkw]]) within 30s
        at org.elasticsearch.cluster.service.MasterService$TaskTimeoutHandler.doRun(MasterService.java:1460) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.14.1.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
        at java.lang.Thread.run(Thread.java:840) ~[?:?]
[2024-10-15T11:43:56,870][WARN ][rest.suppressed          ] [Node-1 path: /designer-objects-ia/_settings, params: {master_timeout=30s, index=designer-objects-ia, timeout=30s}, status: 503
org.elasticsearch.transport.RemoteTransportException: [node-2][<NODE-2_IP>:9300][indices:admin/settings/update]
Caused by: org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (update-settings [[designer-objects-ia-8-6/jZvLf-kxTGSigmHZrFQI7A]]) within 30s
        at org.elasticsearch.cluster.service.MasterService$TaskTimeoutHandler.doRun(MasterService.java:1460) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.14.1.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
        at java.lang.Thread.run(Thread.java:840) ~[?:?]
```

 
2:
```
[2024-10-15T11:57:47,914][DEBUG][org.elasticsearch.action.support.nodes.TransportNodesAction] [node-2] failed to execute [cluster:monitor/nodes/stats] on node [{node-1}{UtQcAppJSkO-4BQc4b4avA}{xAjt4736RuaDjmi926JniA}{node-1}{node-1}{<NODE-1_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000}{xpack.installed=true}]
org.elasticsearch.transport.NodeNotConnectedException: [node-1][<NODE-1_IP>:9300] Node not connected
        at org.elasticsearch.transport.ClusterConnectionManager.getConnection(ClusterConnectionManager.java:283) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:876) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.transport.TransportService.getConnectionOrFail(TransportService.java:771) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:757) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$1.sendItemRequest(TransportNodesAction.java:127) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$1.sendItemRequest(TransportNodesAction.java:95) ~[elasticsearch-8.14.1.jar:?]
...
...
[2024-10-15T11:57:47,917][WARN ][org.elasticsearch.cluster.InternalClusterInfoService] [node-2] failed to retrieve stats for node [UtQcAppJSkO-4BQc4b4avA]
org.elasticsearch.transport.NodeNotConnectedException: [node-1][<NODE-1_IP>:9300] Node not connected
        at org.elasticsearch.transport.ClusterConnectionManager.getConnection(ClusterConnectionManager.java:283) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:876) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.transport.TransportService.getConnectionOrFail(TransportService.java:771) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:757) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$1.sendItemRequest(TransportNodesAction.java:127) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$1.sendItemRequest(TransportNodesAction.java:95) ~[elasticsearch-8.14.1.jar:?]

```

3:
Log from Node-2 which becomes the new master
```
[2024-10-15T13:01:33,271][INFO ][org.elasticsearch.cluster.service.MasterService] [node-2] elected-as-master ([3] nodes joined in term 5)[_FINISH_ELECTION_, {node-2}{tNTw-p6NSnSJxY968nPYZw}{VKtjkdECT12KNNSUWslwVw}{node-2}{node-2}{<NODE-2_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000} completing election, {node-3}{6LpgSROnTNmGmzzlCfofZQ}{JPvvOvJzQW6V51A954agpw}{node-3}{node-3}{<NODE-3_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000} completing election, {node-1}{kzk-51K-ThKg7uwQZ9as2g}{_BGkbFdMS5CFULwIZrHBAw}{node-1}{node-1}{<NODE-1_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000} completing election], term: 5, version: 68136, delta: master node changed {previous [], current [{node-2}{tNTw-p6NSnSJxY968nPYZw}{VKtjkdECT12KNNSUWslwVw}{node-2}{node-2}{<NODE-2_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000}]}, added {{node-1}{kzk-51K-ThKg7uwQZ9as2g}{_BGkbFdMS5CFULwIZrHBAw}{node-1}{node-1}{<NODE-1_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000}, {node-3}{6LpgSROnTNmGmzzlCfofZQ}{JPvvOvJzQW6V51A954agpw}{node-3}{node-3}{<NODE-3_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000}}
[2024-10-15T13:01:33,468][INFO ][org.elasticsearch.cluster.service.ClusterApplierService] [node-2] master node changed {previous [], current [{node-2}{tNTw-p6NSnSJxY968nPYZw}{VKtjkdECT12KNNSUWslwVw}{node-2}{node-2}{<NODE-2_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000}]}, added {{node-1}{kzk-51K-ThKg7uwQZ9as2g}{_BGkbFdMS5CFULwIZrHBAw}{node-1}{node-1}{<NODE-1_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000}, {node-3}{6LpgSROnTNmGmzzlCfofZQ}{JPvvOvJzQW6V51A954agpw}{node-3}{node-3}{<NODE-3_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000}}, term: 5, version: 68136, reason: Publication{term=5, version=68136}
[
...
...
024-10-15T13:01:41,829][INFO ][org.elasticsearch.cluster.routing.allocation.AllocationService] [node-2] current.health="YELLOW" message="Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[user-activity-8-0][0]]])." previous.health="RED" reason="shards started [[user-activity-8-0][0]]"
[2024-10-15T13:01:41,864][WARN ][org.elasticsearch.cluster.routing.allocation.AllocationService] [node-2] [user-activity-8-0][0] marking unavailable shards as stale: [MsnZMCyIRnmjkxcnhEtO-A]
[2024-10-15T13:01:42,186][WARN ][org.elasticsearch.cluster.routing.allocation.AllocationService] [node-2] [user-activity-8-0][0] marking unavailable shards as stale: [9aeL3u_KQYOsqSSdgI6b1g]
[2024-10-15T13:01:44,994][INFO ][org.elasticsearch.cluster.routing.allocation.AllocationService] [node-2] current.health="GREEN" message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[user-activity-8-0][0]]])." previous.health="YELLOW" reason="shards started [[user-activity-8-0][0]]"

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cluster health and number of nodes unchanged after a node failure #115875

Elasticsearch Version

Installed Plugins

Java Version

OS Version

Problem Description

Steps to Reproduce

Logs (if relevant)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cluster health and number of nodes unchanged after a node failure #115875

Description

Elasticsearch Version

Installed Plugins

Java Version

OS Version

Problem Description

Steps to Reproduce

Logs (if relevant)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions