Distributed search is the foundation for Apache Solr Scalability :
It’s possible to distributed search across different Apache Solr nodes of the same collection ( both in a legacy or SolrCloud architecture), but it is also possible to distribute search across different collections in a SolrCloud cluster.
Aggregating results from different collections may be useful when you put in place different systems ( that were meant to be separate ) and you later realize that aggregating the results may be an additional useful use case.
This blog will focus on some tricky situations that can happen when running distributed search ( for configuration or details you can refer to the Solr wiki ).
Inverse Document Frequency affects the score.
This means that a document coming from a big collection can obtain a boost from IDF, in comparison to a similar document from a smaller collection.
This is because the maxDoc count is taken into account as corpus size, so even if a term has the same document frequency, IDF will be strongly affected by the collection size.
Distributed IDF partially solved the problem :
When distributing the search across different shards of the same collection, it works quite well.
But using the ExactStatCache and alternating single collection distribution and multi collection distribution in the same SolrCloud cluster will create some caching conflict.
Specifically if we first execute the inter collection query, the global stats cached will be the inter collection global stats, so if we then execute a single collection distributed search, the preview global stats will remain cached ( viceversa applies).
Real score and debug score is not aligned with the distributed IDF, this means that the debug query will not show the correct distributed IDF and correct scoring calculus for distributed searches.
Lucene/Solr score is not probabilistic or normalised.
For the same collection we can have completely different score scales just with different queries.
The situation becomes more complicated when we tune our relevancy adding multiplicative or additive boosts.
Different collections may imply completely different boosting logic that could cause the score of a collection to be on a completely different scale in comparison to another.
We need to be extra careful when tuning relevancy for searches across different collections and try to configure the distributed request handler in the most compatible way as possible.
It is important to carefully specify the request handler to be used when using distributed search.
The request will hit one collection in one node and then when distributing the same request handler will be called on the other collections across the other nodes.
If necessary it is possible to configure the aggregator request handler and local request handlers ( this may be useful if we want to use a different scoring formulas per collection, using local parameters) :
Aggregator Request Handler
It is executed on the first node receiving the request.
It will distribute the request and then aggregate the results.
It must describe parameters that are in common across all the collections involved in the search.
It is the one specified in the main request.
Local Request Handler
It is specified passing the parameter : shards.qt=
It is execute on each node that receive the distributed query AFTER the first one.
This can be used to use specific fields or parameter on a per collection basis.
A local request handler may use fields and search components that are local to the collection interested.
N.B. the use of local request handler may be useful in case you want to define local query parser rules, such as local edismax configuration to affect the score.
The unique key field must be the same across the different collections.
Furthermore the value should be unique across the different collections to guarantee a proper behaviour.
If we don’t comply with this rule, Solr will fail in aggregating the results and raise an exception.
Alessandro Benedetti is the founder of Sease Ltd.
Senior Search Software Engineer, his focus is on R&D in information retrieval, information extraction, natural language processing, and machine learning.