Apache Solr, Apache Zookeeper, Tips And Tricks

SolrCloud exceptions with Apache Zookeeper

At the time we speak ( Solr 7.3.1 ) SolrCloud is a reliable and stable distributed architecture for Apache Solr.
But it is not perfect and failures happen.

Apache Zookeeper [1] is the system responsible for managing the communications across the SolrCloud cluster.
It contains the shared collections configurations and it has the view of the cluster status.
It is part of the brain of the cluster, a keeper that maintains the cluster healthy and functional.

It can answer questions such as :

• Who is the leader of this shard and collection?
• Is this node down?
• Is this node recovering?

The Solr nodes communicate with Zookeeper to understand who to contact when running SolrCloud operations.

This lightening blog post will present some practical tips to follow when your client application encounters some classic exceptions dealing with SolrCloud and Apache Zookeeper.
Special thanks to the Apache Solr user mailing list contributors and the Apache Solr community, this post is an aggregation of recommendations from there and from official code and documentation.

org.apache.solr.common.SolrException:
Could not load collection from ZK:

If you landed here with just that Exception I assume there is a missing :
“ Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /collections/<collection name>/state.json “ ?

Solr’s zkClientTimeout is used to set ZooKeeper’s sessionTimeout, and that’s what is exceeded when a session expires.
When this kind of exception happens, it means something has gone VERY wrong in the Solr-Zookeeper communication, 30 seconds ( the current default [2] ) is a REALLY long time when applications are trying to communicate.

Recommendation: take care of the different time outs around, don’t keep them too small!
i.e. for zkClientTimeout assign a value >= 30 seconds.

maxSessionTimeout (Zookeeper)
New in 3.3.0: the maximum session timeout in milliseconds that the server will allow the client to negotiate. Defaults to 20 times the tickTime.

zkClientTimeout (Solr)
Controls your client timeout.

Once checked the time outs, let’s explore some possible root causes.
A session expiry can be caused by:

1. Garbage collection on Solr node/Zookeeper – extreme GC pauses can happen with the heap being too small or VERY large
2. Slow IO on disk.
3. Network latency

Recommendations

set up a JVM profiler to monitor closely your Solr and Zookeeper nodes, and pay particular attention to the garbage collection cycles and the memory usage in general: you don’t want Zk to swap too much! ( GCViewer [3] could be a nice tool for this)
Verify that the Zookeeper node has fast writing access to the disk: Zookeeper needs fast writes and ideally, a separate disk allocated.
Monitor your network and make sure the solr nodes can talk effectively to the Zookeeper nodes.

In case the suggestions are not solving your problem, you may be experiencing a Solr bug.
One of them is [4] which unfortunately has not been fixed yet.

org.apache.solr.client.solrj.SolrServerException:
No live SolrServers available to handle this

From the official JavaDoc :

org/apache/solr/client/solrj/impl/LBHttpSolrClient.java:369
“Tries to query a live server from the list provided in Req. Servers in the dead pool are skipped.
* If a request fails due to an IOException, the server is moved to the dead pool for a certain period of
* time, or until a test request on that server succeeds.
*
* Servers are queried in the exact order given (except servers currently in the dead pool are skipped).
* If no live servers from the provided list remain to be tried, a number of previously skipped dead servers will be tried.
* Req.getNumDeadServersToTry() controls how many dead servers will be tried.
*
* If no live servers are found a SolrServerException is thrown.”

What was the status of the cluster at the moment the exception happened ?
Was any Solr server UP and running according to Zookeeper knowledge ?

The recommendation is to check the clusterstate.json when the exception happens.
From the Solr admin UI you can open Cloud->Tree and verify which nodes are up and running.

It could be very much related a node failure ( that could be related to any possible reason including GC)
I’ve seen situations where it was caused by a specific query, the real exception got hidden by a “No live SolrServers…” client exception.
Solr logs should help to identify the inner Solr problem and JVM monitoring could discard any memory/gc problem.
Some people saw this with wildcard queries (when every shard reported a “too many expansions…”
type error, but the exception in the client response was “No live SolrServers…”.

org.apache.solr.common.SolrException:
Could not find a healthy node to handle the request

Pretty much the same considerations as the “No Live Solr Server”.
This happens when the load balancer SolrJ side is unable to retrieve an alive node, from the cluster ( based on Zookeeper state).
This happens before the previous exception, so the request doesn’t even reach the LoadBalancinghttpSolrClient.

Need Help With This Topic?

If you’re struggling with Apache Zookeeper, don’t worry – we’re here to help! Our team offers expert services and training to help you optimize your Solr search engine and get the most out of your system. Contact us today to learn more!

Need Help with this topic?

If you're struggling with Apache Zookeeper, don't worry - we're here to help! Our team offers expert services and training to help you optimize your Solr search engine and get the most out of your system. Contact us today to learn more!

Click Here

distributed search

Sign up for our Newsletter

Did you like this post? Don’t forget to subscribe to our Newsletter to stay always updated in the Information Retrieval world!

About the company

about our work

Rated Ranking Evaluator
(RRE)

Rated Ranking Evaluator Enterprise (RREE)

Apache Solr LLM Highlighter plugin

News

Main Blog

TIPS AND TRICKS

LATEST BLOG POST

contact us

Don't miss all the news - subscribe to our newsletter!

SolrCloud exceptions with Apache Zookeeper

org.apache.solr.common.SolrException:
Could not load collection from ZK:

Recommendations

org.apache.solr.client.solrj.SolrServerException:
No live SolrServers available to handle this

org.apache.solr.common.SolrException:
Could not find a healthy node to handle the request

Need Help With This Topic?

Need Help with this topic?

Other posts you may find useful

Efficiently Manage Numeric Ids in JSON and Pandas

Exploring Sexism in Information Retrieval Systems with NLP and ML

How Does Fuzzy Queries Work in Elasticsearch?

Alessandro Benedetti

Alessandro Benedetti

Follow Us

Top Categories

Recent Posts

Hybrid Search Using a Custom Algorithm in Apache Solr

Hybrid Search with Reciprocal Rank Fusion in Apache Solr

Apache Solr Multivalued Vectors Tutorial

Monthly video

Sign up for our Newsletter

Leave a Reply Cancel reply

Quick Links

Services

Subscribe

About the company

about our work

Rated Ranking Evaluator (RRE)

Rated Ranking Evaluator Enterprise (RREE)

Apache Solr LLM Highlighter plugin

News

Main Blog

TIPS AND TRICKS

LATEST BLOG POST

contact us

Don't miss all the news - subscribe to our newsletter!

SolrCloud exceptions with Apache Zookeeper

org.apache.solr.common.SolrException: Could not load collection from ZK:

Recommendations

org.apache.solr.client.solrj.SolrServerException: No live SolrServers available to handle this

org.apache.solr.common.SolrException: Could not find a healthy node to handle the request

Need Help With This Topic?​​

Need Help with this topic?​

Other posts you may find useful

Efficiently Manage Numeric Ids in JSON and Pandas

Exploring Sexism in Information Retrieval Systems with NLP and ML

How Does Fuzzy Queries Work in Elasticsearch?

Alessandro Benedetti

Alessandro Benedetti

Follow Us

Top Categories

Recent Posts

Hybrid Search Using a Custom Algorithm in Apache Solr

Hybrid Search with Reciprocal Rank Fusion in Apache Solr

Apache Solr Multivalued Vectors Tutorial

Monthly video

Sign up for our Newsletter

Leave a Reply Cancel reply

Rated Ranking Evaluator
(RRE)

org.apache.solr.common.SolrException:
Could not load collection from ZK:

org.apache.solr.client.solrj.SolrServerException:
No live SolrServers available to handle this

org.apache.solr.common.SolrException:
Could not find a healthy node to handle the request

Need Help With This Topic?

Need Help with this topic?