At the time we speak ( Solr 7.3.0 ) SolrCloud is a reliable and stable distributed architecture for Apache Solr.
But it is not perfect and failures happen.
This lightening blog post will present some practical tips to follow when a specific shard of a collection is down with no leader and the situation is stuck.
The following problem has been experienced with the following Solr versions :

  • 4.10.2
  • 5.4.0

Steps to solve the problem may involve manual interaction with the Zookeeper Ensemble[1].
The following steps are extracted from an interesting thread of the Solr User mailing list[2] and practical experience on the field.
In particular, thanks to Jeff Wartes for the suggestions, that proved useful for me in a couple of occasions.

Problem

  • All nodes for a Shard in a Collection are up and running
  • There is no leader for the shard
  • All the nodes are in a “Recovering” / “Recovery Failed” state
  • Search is down and the situation persist after many minutes (> 5)

Solution

A possible explanation for this problem to occur is when the node-local version of the Zookeeper clusterstate has diverged from the centralized Zookeeper cluster state.
One possible cause for the leader election to break is a Zookeeper failure : for example you lose >=50% of the ensemble nodes or the connectivity among the ensemble nodes for a certain period of time ( this is the scenario I experimented directly)
This failure, even if resolved later, can bring a corruption to the Zookeeper file system.
Some of the SolrCloud collections may remain in a not consistent status.

It may be necessary to manually delete corrupted files from Zookeeper :
Let’s start from :

collections/<collection>/leader_elect/shard<x>/election
An healthy SolrCloud cluster presents as many core_nodeX as the total replicas for the shard.
You don’t want duplicates or missing nodes here.
If you’re having trouble getting a sane election, you can try deleting the lowest-numbered entries (as well as any lower-numbered duplicates) and try to foce the election again. Possibly followed by restarting the node with that lowest-numbered entry.

collections/<collection>/leader/shard<x>
Make sure that this folder exists and has the expected replica as a leader.

collections/<collection>/leader_initiated_recovery
This folder can be informative too, this represents replicas that the *leader* thinks are out of sync, usually due to a failed update request.

After having completed the verification above, there a couple of Collection API endpoints that may be useful :

Force Leader Election
/admin/collections?action=FORCELEADER&collection=<collectionName>&shard=<shardName>

Force Leader Rebalance
/admin/collections?action=REBALANCELEADERS&collection=collectionName

N.B. rebalancing all the leader will affect all the shards

 

[1] Apache Zookeeper Solr Cli

[2] Solr Mailing List Thread

[3] Solr Collection API

 

Leave a Reply