At the time we speak ( Solr 7.3.0 ) SolrCloud is a reliable and stable distributed architecture for Apache Solr.
But it is not perfect and failures happen.
This lightning blog post will present some practical tips to follow when a specific shard of a collection is down with no leader and the situation is stuck.
The following problem has been experienced with the following Solr versions :
Steps to solve the problem may involve manual interaction with the Zookeeper Ensemble .
The following steps are extracted from an interesting thread of the Solr User mailing list  and practical experience in the field.
In particular, thanks to Jeff Wartes for the suggestions, which proved useful for me on a couple of occasions.
- All nodes for a Shard in a Collection are up and running
- There is no leader for the shard
- All the nodes are in a “Recovering” / “Recovery Failed” state
- Search is down and the situation persist after many minutes (> 5)
A possible explanation for this problem to occur is when the node-local version of the Zookeeper clusterstate has diverged from the centralized Zookeeper cluster state.
One possible cause for the leader election to break is a Zookeeper failure : for example you lose >=50% of the ensemble nodes or the connectivity among the ensemble nodes for a certain period of time ( this is the scenario I experimented directly)
This failure, even if resolved later, can bring a corruption to the Zookeeper file system.
Some of the SolrCloud collections may remain in a not consistent status.
It may be necessary to manually delete corrupted files from Zookeeper :
Let’s start from :
A healthy SolrCloud cluster presents as many core_nodeX as the total replicas for the shard.
You don’t want duplicates or missing nodes here.
If you’re having trouble getting a sane election, you can try deleting the lowest-numbered entries (as well as any lower-numbered duplicates) and try to foce the election again. Possibly followed by restarting the node with that lowest-numbered entry.
Make sure that this folder exists and has the expected replica as a leader.
This folder can be informative too, this represents replicas that the *leader* thinks are out of sync, usually due to a failed update request.
After having completed the verification above, there a couple of Collection API endpoints that may be useful :
Force Leader Election
Force Leader Rebalance
N.B. rebalancing all the leaders will affect all the shards
Subscribe to our newsletter
Did you like this post about SolrCloud Leader Election Failing? Don’t forget to subscribe to our Newsletter to stay always updated from the Information Retrieval world!