Apache Solr, Tips And Tricks

SolrCloud Leader Election Failing

At the time we speak ( Solr 7.3.0 ) SolrCloud is a reliable and stable distributed architecture for Apache Solr.
But it is not perfect and failures happen.
This lightning blog post will present some practical tips to follow when a specific shard of a collection is down with no leader and the situation is stuck.
The following problem has been experienced with the following Solr versions :

- 4.10.2
- 5.4.0

Steps to solve the problem may involve manual interaction with the Zookeeper Ensemble [1].
The following steps are extracted from an interesting thread of the Solr User mailing list [2] and practical experience in the field.
In particular, thanks to Jeff Wartes for the suggestions, which proved useful for me on a couple of occasions.

The Problem

- All nodes for a Shard in a Collection are up and running
- There is no leader for the shard
- All the nodes are in a “Recovering” / “Recovery Failed” state
- Search is down and the situation persists after many minutes (> 5)

The Solution

A possible explanation for this problem to occur is when the node-local version of the Zookeeper clusterstate has diverged from the centralized Zookeeper cluster state.
One possible cause for the leader election to break is a Zookeeper failure: for example, you lose >=50% of the ensemble nodes or the connectivity among the ensemble nodes for a certain period of time (this is the scenario I experimented directly)
This failure, even if resolved later, can bring corruption to the Zookeeper file system.
Some of the SolrCloud collections may remain in a not consistent status.

It may be necessary to manually delete corrupted files from Zookeeper.
Let’s start from:

collections/<collection>/leader_elect/shard<x>/election
A healthy SolrCloud cluster presents as many core_nodeX as the total replicas for the shard.
You don’t want duplicates or missing nodes here.
If you’re having trouble getting a sane election, you can try deleting the lowest-numbered entries (as well as any lower-numbered duplicates) and try to foce the election again. Possibly followed by restarting the node with that lowest-numbered entry.

collections/<collection>/leader/shard<x>
Make sure that this folder exists and has the expected replica as a leader.

collections/<collection>/leader_initiated_recovery
This folder can be informative too, this represents replicas that the *leader* thinks are out of sync, usually due to a failed update request.

After having completed the verification above, there a couple of Collection API endpoints that may be useful :

Force Leader Election
/admin/collections?action=FORCELEADER&collection=<collectionName>&shard=<shardName>

Force Leader Rebalance
/admin/collections?action=REBALANCELEADERS&collection=collectionName

N.B. rebalancing all the leaders will affect all the shards

Need Help With This Topic?

If you’re struggling with Solrcloud, don’t worry – we’re here to help!
Our team offers expert services and training to help you optimize your Solr search engine and get the most out of your system. Contact us today to learn more!

Need Help with this topic?

If you're struggling with Solrcloud, don't worry - we're here to help! Our team offers expert services and training to help you optimize your Solr search engine and get the most out of your system. Contact us today to learn more!

Click Here

apache solr, distributed search, scalability, search, solr, solr lucene, solrCloud

Other posts you may find useful

We are Sease, an Information Retrieval Company based in London, focused on providing R&D project guidance and implementation, Search consulting services, Training, and Search solutions using open source software like Apache Lucene/Solr, Elasticsearch, OpenSearch and Vespa.

Sign up for our Newsletter

Did you like this post? Don’t forget to subscribe to our Newsletter to stay always updated in the Information Retrieval world!

About the company

about our work

Rated Ranking Evaluator
(RRE)

Rated Ranking Evaluator Enterprise (RREE)

Apache Solr LLM Highlighter plugin

News

Main Blog

TIPS AND TRICKS

LATEST BLOG POST

contact us

Don't miss all the news - subscribe to our newsletter!

SolrCloud Leader Election Failing

The Problem

The Solution

Need Help With This Topic?

Need Help with this topic?

Other posts you may find useful

Solr Is Learning To Rank Better – Part 4 – Solr Integration

Apache Solr Learning To Rank Feature Extraction and qTime

The luceneMatchVersion Parameter in Apache Solr

Alessandro Benedetti

Alessandro Benedetti

Follow Us

Top Categories

Recent Posts

Faster Vector Search: Early Termination Strategy Now in Apache Solr

OpenSearch and Large Language Models

Sease at Search Solutions and Tutorials 2025

Monthly video

Sign up for our Newsletter

Leave a Reply Cancel reply

Quick Links

Services

Subscribe

About the company

about our work

Rated Ranking Evaluator (RRE)

Rated Ranking Evaluator Enterprise (RREE)

Apache Solr LLM Highlighter plugin

News

Main Blog

TIPS AND TRICKS

LATEST BLOG POST

contact us

Don't miss all the news - subscribe to our newsletter!

SolrCloud Leader Election Failing

The Problem

The Solution

Need Help With This Topic?​​

Need Help with this topic?​

Other posts you may find useful

Solr Is Learning To Rank Better – Part 4 – Solr Integration

Apache Solr Learning To Rank Feature Extraction and qTime

The luceneMatchVersion Parameter in Apache Solr

Alessandro Benedetti

Alessandro Benedetti

Follow Us

Top Categories

Recent Posts

Faster Vector Search: Early Termination Strategy Now in Apache Solr

OpenSearch and Large Language Models

Sease at Search Solutions and Tutorials 2025

Monthly video

Sign up for our Newsletter

Leave a Reply Cancel reply

Rated Ranking Evaluator
(RRE)

Need Help With This Topic?

Need Help with this topic?