Skip to content

NodeDoesNotExistOnMasterException Handling #3663

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
anandnalya opened this issue Sep 11, 2013 · 4 comments
Closed

NodeDoesNotExistOnMasterException Handling #3663

anandnalya opened this issue Sep 11, 2013 · 4 comments
Assignees

Comments

@anandnalya
Copy link

This is related to #2117

I'm able to reproduce a slight variation with the following steps:
Set discovery.zen.minimum_master_nodes = 2 and node.max_local_storage_nodes = 1. These nodes are deployed on NFS so that in case a machine is fails, an ES process can be started on the same data path from a different machine. Each machine has access to data directories for all es nodes.

The Initial deployment was:

  • ES1 - Node1:9300 (path.data=/nfs/node1)
  • ES2 - Node2:9300 (path.data=/nfs/node2)
  • ES3 - Node3:9300 (path.data=/nfs/node3)

Then I took the following steps:

  1. Killed the network on Node2, which resulted in a cluster with two nodes (ES1,ES3). ES2 process was still running on Node2.
  2. Started new es process on Node1 with path.data=/nfs/node2. I was assuming since node.max_local_storage_nodes = 1, it will not start as ES2 on Node2:9300 already has lock on it, but it started anyways. The cluster now looked like
  • ES1 - Node1:9300 (path.data=/nfs/node1)
  • ES2 - Node1:9301 (path.data=/nfs/node2)
  • ES3 - Node3:9300 (path.data=/nfs/node3)

ES2 - Node2:9300 (path.data=/nfs/node2) was still running but not part of the cluster.
3. I started the network on Node2 which resulted in following two clusters:

  • Cluster1:
    • ES1 - Node1:9300 (path.data=/nfs/node1)
    • ES2 - Node1:9301 (path.data=/nfs/node2)
    • ES3 - Node3:9300 (path.data=/nfs/node3)
  • Cluster2:
    • ES1 - Node1:9300 (path.data=/nfs/node1)
    • ES2 - Node2:9300 (path.data=/nfs/node2)

Now, Node1:9300 is participating in both the clusters

This happens because of the way NodeDoesNotExistOnMasterException is handled at https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/discovery/zen/fd/MasterFaultDetection.java#L315

When Node2 comes into network, it sees that it is no longer a part of the cluster and starts master election resulting in a new master.

@anandnalya
Copy link
Author

I'm using the following workaround for this: anandnalya@8aba868

@spinscale
Copy link
Contributor

Hey,

couple of notes:

  • node.max_local_storage_nodes only applies, if you use the same installation directory. As you have a different data.path this setting does not apply.
  • You should not use NFS at all, when working with elasticsearch. Your search and indexing performance will be dramatically slower. Also native locking is not supported on NFS. You might want to read this thread for further information: http://lucene.472066.n3.nabble.com/Lucene-index-on-NFS-td4011301.html
  • You should especially make sure, that data directories are not written by two separate process, no matter if these process are on one machine or because your network storage are on a cluster.

Hope this helps (and ensures having a fast and reliable search engine up and running).

@anandnalya
Copy link
Author

HI,

I wanted to know if there are any other complications of using ES over NFS apart from the lucene gotchas.

Thanks,
Anand

@bleskes
Copy link
Contributor

bleskes commented Jan 26, 2015

I think this should be solved with the improvement to Zen Discovery we made in 1.4. If this is not the case, please feel free to re-open.

@bleskes bleskes closed this as completed Jan 26, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants