Apache Solr Tips And Tricks
SolrCloud Leader Election Failing

At the time we speak ( Solr 7.3.0 ) SolrCloud is a reliable and stable distributed architecture for Apache Solr.
But it is not perfect and failures happen.
This lightning blog post will present some practical tips to follow when a specific shard of a collection is down with no leader and the situation is stuck.
The following problem has been experienced with the following Solr versions :

    • 4.10.2
    • 5.4.0

Steps to solve the problem may involve manual interaction with the Zookeeper Ensemble [1].
The following steps are extracted from an interesting thread of the Solr User mailing list [2] and practical experience in the field.
In particular, thanks to Jeff Wartes for the suggestions, which proved useful for me on a couple of occasions.

Problem

    • All nodes for a Shard in a Collection are up and running
    • There is no leader for the shard
    • All the nodes are in a “Recovering” / “Recovery Failed” state
    • Search is down and the situation persist after many minutes (> 5)

Solution

A possible explanation for this problem to occur is when the node-local version of the Zookeeper clusterstate has diverged from the centralized Zookeeper cluster state.
One possible cause for the leader election to break is a Zookeeper failure : for example you lose >=50% of the ensemble nodes or the connectivity among the ensemble nodes for a certain period of time ( this is the scenario I experimented directly)
This failure, even if resolved later, can bring a corruption to the Zookeeper file system.
Some of the SolrCloud collections may remain in a not consistent status.

It may be necessary to manually delete corrupted files from Zookeeper :
Let’s start from :

collections/<collection>/leader_elect/shard<x>/election
A healthy SolrCloud cluster presents as many core_nodeX as the total replicas for the shard.
You don’t want duplicates or missing nodes here.
If you’re having trouble getting a sane election, you can try deleting the lowest-numbered entries (as well as any lower-numbered duplicates) and try to foce the election again. Possibly followed by restarting the node with that lowest-numbered entry.

collections/<collection>/leader/shard<x>
Make sure that this folder exists and has the expected replica as a leader.

collections/<collection>/leader_initiated_recovery
This folder can be informative too, this represents replicas that the *leader* thinks are out of sync, usually due to a failed update request.

After having completed the verification above, there a couple of Collection API endpoints that may be useful :

Force Leader Election
/admin/collections?action=FORCELEADER&collection=<collectionName>&shard=<shardName>

Force Leader Rebalance
/admin/collections?action=REBALANCELEADERS&collection=collectionName

N.B. rebalancing all the leaders will affect all the shards

// our service

Shameless plug for our training and services!

Did I mention we do Apache Solr Beginner and Elasticsearhc Beginner training?
We also provide consulting on these topics, get in touch if you want to bring your search engine to the next level!

// STAY ALWAYS UP TO DATE

Subscribe to our newsletter

Did you like this post about SolrCloud Leader Election Failing? Don’t forget to subscribe to our Newsletter to stay always updated from the Information Retrieval world!

Author

Alessandro Benedetti

Alessandro Benedetti is the founder of Sease Ltd. Senior Search Software Engineer, his focus is on R&D in information retrieval, information extraction, natural language processing, and machine learning.

Leave a comment

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.