At the time we speak ( Solr 7.3.0 ) SolrCloud is a reliable and stable distributed architecture for Apache Solr.
But it is not perfect and failures happen.
This lightening blog post will present some practical tips to follow when a specific shard of a collection is down with no leader and the situation is stuck.
The following problem has been experienced with the following Solr versions :
Steps to solve the problem may involve manual interaction with the Zookeeper Ensemble.
The following steps are extracted from an interesting thread of the Solr User mailing list and practical experience on the field.
In particular, thanks to Jeff Wartes for the suggestions, that proved useful for me in a couple of occasions.
- All nodes for a Shard in a Collection are up and running
- There is no leader for the shard
- All the nodes are in a “Recovering” / “Recovery Failed” state
- Search is down and the situation persist after many minutes (> 5)
A possible explanation for this problem to occur is when the node-local version of the Zookeeper clusterstate has diverged from the centralized Zookeeper cluster state.
One possible cause for the leader election to break is a Zookeeper failure : for example you lose >=50% of the ensemble nodes or the connectivity among the ensemble nodes for a certain period of time ( this is the scenario I experimented directly)
This failure, even if resolved later, can bring a corruption to the Zookeeper file system.
Some of the SolrCloud collections may remain in a not consistent status.
It may be necessary to manually delete corrupted files from Zookeeper :
Let’s start from :
An healthy SolrCloud cluster presents as many core_nodeX as the total replicas for the shard.
You don’t want duplicates or missing nodes here.
If you’re having trouble getting a sane election, you can try deleting the lowest-numbered entries (as well as any lower-numbered duplicates) and try to foce the election again. Possibly followed by restarting the node with that lowest-numbered entry.
Make sure that this folder exists and has the expected replica as a leader.
This folder can be informative too, this represents replicas that the *leader* thinks are out of sync, usually due to a failed update request.
After having completed the verification above, there a couple of Collection API endpoints that may be useful :
Force Leader Election
Force Leader Rebalance
N.B. rebalancing all the leader will affect all the shards