Apache Solr Distributed Facets

Apache Solr distributed faceting feature has been introduced back in 2008 with the first versions of Solr (1.3 according to this jira[1]) .
Until now, I always assumed it just worked, without diving too much into the details.
Nowadays distributed search and faceting are extremely popular, you can find them pretty much everywhere (in the legacy or SolrCloud form alike).
N.B. Although the mechanics are pretty much the same, Json faceting revisits this approach with some change, so we will now focus on legacy field faceting.

I think it’s time to get a better understanding of how it works:

Multiple Shard Requests

When dealing with distributed search and distributed aggregation calculations, you are going to see multiple requests going back and forth across the shards.
They have different focus and are meant to retrieve the different bits of information necessary to build the final response.
We are going to explore the different rounds of requests, focusing just for the faceting purpose.
N.B. Some of these requests are also carrying results for the distributed search calculation, this is used to minimise the network traffic.

For the sake of this blog let’s simulate a simple sharded index, white space tokenization on field1 and facet.field=field1

Shard 1 Shard 2
Doc0
{  “id”:”1”,
“field1”:”a b”
}
Doc3
{  “id”:”4”,
“field1”:”b c”
}
Doc1
{  “id”:”2”,
“field1”:”a”
}
Doc4
{  “id”:”5”,
“field1”:”b c”
}
Doc2
{  “id”:”3”,
“field1”:”b c”
}
Doc53
{  “id”:”6”,
“field1”:”c”
}

Global Facets : b(4), c(4), a(2)

Shard 1 Local Facets : a(2), b(2), c(1)

Shard 2 Local Facets : c(3), b(2)

Collection of Candidate Facet Field Values

The first round of requests is sent to each shard to identify the candidate top K global facet values.
To achieve this target each shard will be requested to respond with its local top K+J facet values and counts.
The reason we actually ask for more facets from each shard is to have a better term coverage, to avoid losing relevant facet values and to minimise the refinement requests.
How many more we request from each shard is regulated by the “overrequest” facet parameter, a factor that gives more accurate facets at the cost of additional computations[2].
Let’s assume we configure a facet.limit=2&facet.overrequest.count=0&facet.overrequest.ratio=1 to explain when refinement happens and how it works.

Shard 1 Returned Facets : a(2), b(2)

Shard 2 Returned Facets : c(3), b(2)

Global Merge of Collected Counts

The facet value counts collected from each shard are merged and the most occurring global top K is calculated.
These facet field values are the first candidates to be the final ones.
In addition to that, other candidates are extracted from the terms below the top K, based on the shards that didn’t return those values statistics.
At this point we have a candidate set of values and we are ready to refine their counts where necessary, asking back this information to the shards that didn’t include that in the first round.
This happens including the following specific facet parameter to the following refinement requests:

{!terms=$<field>__terms}<field>&<field>__terms=<values>
e.g.
{!terms=$field1__terms}field1&field1__terms=term1,term2

N.B. This request is specifically asking a Solr instance to return back the facet counts just for the terms specified[3]

Top 2 candidates = b(4), c(3)
Additional candidates = a(2)

The reason that a(2) is added to the potential candidates is because Shard 2 didn’t answer with a count for a, the potential missing count of 1 could bring a to the top K. So it is worth a verification.

Shard 1 didn’t return any value for the candidate c facet.
So the following request is built and sent to it:
facet.field={!terms=$field1__terms}field1&field1__terms=c

Shard 2 didn’t return any value for the candidate a facet.
So the following request is built and sent to it:
facet.field={!terms=$field1__terms}field1&field1__terms=a

Final Counts Refinement

The refinements counts returned by each shard can be used to finalise the global candidate facet values counts and to identify the final top K to be returned by the distributed request.
We are finally done!

Shard 1 Refinements Facets : c(1)

Shard 2 Refinements Facets : a(0)

Top K candidates updatedb(4), c(4), a(2)

GIven a facet.limit=2 the final global facets with correct results returned is :
b(4), c(4)

 

[1] https://issues.apache.org/jira/browse/SOLR-303

[2] https://lucene.apache.org/solr/guide/6_6/faceting.html#Faceting-Over-RequestParameters

[3] https://lucene.apache.org/solr/guide/7_5/faceting.html#limiting-facet-with-certain-terms

SolrCloud Leader Election Failing

At the time we speak ( Solr 7.3.0 ) SolrCloud is a reliable and stable distributed architecture for Apache Solr.
But it is not perfect and failures happen.
This lightening blog post will present some practical tips to follow when a specific shard of a collection is down with no leader and the situation is stuck.
The following problem has been experienced with the following Solr versions :

  • 4.10.2
  • 5.4.0

Steps to solve the problem may involve manual interaction with the Zookeeper Ensemble[1].
The following steps are extracted from an interesting thread of the Solr User mailing list[2] and practical experience on the field.
In particular, thanks to Jeff Wartes for the suggestions, that proved useful for me in a couple of occasions.

Problem

  • All nodes for a Shard in a Collection are up and running
  • There is no leader for the shard
  • All the nodes are in a “Recovering” / “Recovery Failed” state
  • Search is down and the situation persist after many minutes (> 5)

Solution

A possible explanation for this problem to occur is when the node-local version of the Zookeeper clusterstate has diverged from the centralized Zookeeper cluster state.
One possible cause for the leader election to break is a Zookeeper failure : for example you lose >=50% of the ensemble nodes or the connectivity among the ensemble nodes for a certain period of time ( this is the scenario I experimented directly)
This failure, even if resolved later, can bring a corruption to the Zookeeper file system.
Some of the SolrCloud collections may remain in a not consistent status.

It may be necessary to manually delete corrupted files from Zookeeper :
Let’s start from :

collections/<collection>/leader_elect/shard<x>/election
An healthy SolrCloud cluster presents as many core_nodeX as the total replicas for the shard.
You don’t want duplicates or missing nodes here.
If you’re having trouble getting a sane election, you can try deleting the lowest-numbered entries (as well as any lower-numbered duplicates) and try to foce the election again. Possibly followed by restarting the node with that lowest-numbered entry.

collections/<collection>/leader/shard<x>
Make sure that this folder exists and has the expected replica as a leader.

collections/<collection>/leader_initiated_recovery
This folder can be informative too, this represents replicas that the *leader* thinks are out of sync, usually due to a failed update request.

After having completed the verification above, there a couple of Collection API endpoints that may be useful :

Force Leader Election
/admin/collections?action=FORCELEADER&collection=<collectionName>&shard=<shardName>

Force Leader Rebalance
/admin/collections?action=REBALANCELEADERS&collection=collectionName

N.B. rebalancing all the leader will affect all the shards

 

[1] Apache Zookeeper Solr Cli

[2] Solr Mailing List Thread

[3] Solr Collection API