London Information Retrieval Meetup

The London Information Retrieval Meetup is approaching (19/02/2019) and we are excited to add more details about the speakers and talks!
The event is free and you are invited to register :

After Sambhav Kothari, software engineer at Bloomberg and Elia Porciani, R&D software engineer at Sease, our last speaker is Andrea Gazzarini, founder and software engineer at Sease :

Andrea Gazzarini

Andrea Gazzarini is a curious software engineer, mainly focused on the Java language and Search technologies.
With more than 15 years of experience in various software engineering areas, his adventure with the search domain began in 2010, when he met Apache Solr and later Elasticsearch… and it was love at first sight. 
Since then, he has been involved in many projects across different fields (bibliographic, e-government, e-commerce, geospatial).

In 2015 he wrote “Apache Solr Essentials”, a book about Solr, published by Packt Publishing.
He’s an opensource lover; he’s currently involved in several (too many!) projects, always thinking about a “big” idea that will change his (developer) life.

Introduction to Music Information Retrieval

Music Information Retrieval is about retrieving information from music entities.
This high-level definition relates to a complex discipline with many real-world applications.     
Being a former bass player, Andrea will describe a high-level overview about Music Information Retrieval and it will analyse from a musician perspective a set of challenges that the topic offers.
We will introduce the basic concepts of the music language, then passing through different kind of music representations we will end up describing some useful low level features that are used when dealing with music entities. 

Elia Porciani

Elia is a Software Engineer passionate about algorithms and data structures concerning search engines and efficiency.
He is currently involved in many research projects at CNR (National Research Council, Italy ) and for personal purpose.
Before joining Sease he worked in Intecs and List where he could experience different fields and levels of computer science, by handling low level programming problems such as embedded and networking up to high level trading algorithms.
He graduated with a dissertation about data compression and query performance on search engines.
He is active part of the information retrieval research community, attending international conferences such as SIGIR and ECIR.
His most recent pubblication is : FASTER BLOCKMAX WAND WITH VARIABLE-SIZED BLOCKS SIGIR 2017 Proceedings of the 40th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2017

Improving top-k retrieval algorithms using dynamic programming and longer skipping

Modern search engines has to keep up with the enormous growth in the number of documents and queries submitted by users. One of the problem to deal with is finding the best k relevant documents for a given query. This operation has to be fast and this is possible only by using specialised technologies.
Block max wand is one of the best known algorithm for solving this problem without any effectiveness degradation of its ranking.
After a brief introduction, in this talk I’m going to show a strategy introduced in “Faster BlockMax WAND with Variable-sized Blocks” (SIGIR 2017), that applied to BlockMaxWand data has made possible to speed up the algorithm execution by almost 2x.
Then, will be presented another optimisation of the BlockMaxWand algorithm (“Faster BlockMax WAND with Longer Skipping”, ECIR 2019) for reducing the time execution of short queries.

Sambhav Kothari

Sambhav is a software engineer at Bloomberg, working in the News Search Experience team.

Learning To Rank: Explained for Dinosaurs

Internet search has long evolved from days when you had to string up your query in just the right way to get the results you were looking for. Search has to be smart and natural, and people expect it to “just work” and read what’s on their minds.

On the other hand, anyone who has worked behind-the-scenes with a search engine knows exactly how hard it is to get the right result to show up at the right time. Countless hours are spent tuning the boosts before your user can find his favorite two-legged tiny-armed dinosaur on the front page.

When your data is constantly evolving, updating, it’s only realistic that so do your search engines. Search teams thus are on a constant pursuit to refine and improve the ranking and relevance of their search results. But, working smart is not the same as working hard. There are many techniques we can employ, that can help us dynamically improve and automate this process. One such technique is Learning to Rank.

Learning to Rank was initially proposed in academia around 20 years ago and almost all commercial web search-engines utilize it in some form or other. At Bloomberg, we decided that it was time for an open source search-engine to support Learning to Rank, so we spent more than a year designing and implementing it. The result of our efforts has been accepted by the Solr community and our Learning to Rank plugin is now available in Apache Solr.

This talk will serve as an introduction to the LTR(Learning-to-Rank) module in Solr. No prior knowledge about Learning to Rank is needed, but attendees will be expected to know the basics of Python, Solr, and machine learning techniques. We will be going step-by-step through the process of shipping a machine-learned ranking model in Solr, including:

  • how you can engineer features and build a training data-set as per your needs
  • how you can train ranking models using popular Python ML(machine learning) libraries like scikit-learn
  • how you can use the above-learned ranking-models in Solr

Get ready for an interactive session where we learn to rank!

Apache Solr Facets and ACL Filters Using Tag and Exclusion

What happens with facets aggregations on fields when documents in the results have been filtered by Access Control Lists ?
In such scenarios it is important to use the facet mincount parameter.
That specifies the minimum count in the result set for a facet value to appear in the response:

  • mincount=0, all the facet values present in the corpus are returned in the response. This includes the ones related to documents that have been filtered out by the ACLs(0 counts facets). This could cause some nasty side effect: such as a user seeing a facet value that he/she’s not supposed to see(because ACL filtered out that document from the result set).
  • mincount=1, only facet values matching at least one document in the result set are returned. This configuration is safe, users are going to see only facet values regulated by the ACL. They will effectively see only what they are supposed to see.

But what happens if you like to see 0 counting facet values, but preserving ACL?
This may help you in having a better understanding of the distribution of the values in the entire corpus, but ACL are still valid, so that users still see only possible values that they are supposed to see.
Tags and Exclusion comes handy in such case.

Faceting Tag And Exclusion

Tag and Exclusion is an extremely important feature for faceting in Apache Solr and you would not believe how many times it is misused or completely ignored, causing an erratic experience for the user.
Let’s see how it works:


You can tag a filter query using Solr local parameter syntax:


The same applies to the main query(with some caveats if you are using an explicit query parser) :

q={!tag=mainQuery}I am the main query

q={!edismax qf=text title tag=mainQuery}I am the main query

When assigning a tag we give Solr the possibility of identifying separately the various search clauses (such the main query or filter queries).
Effectively it is a way to assign an identifier to a search query or filter.

Excluding in Legacy Faceting

When applying filter queries, Solr is reducing the result space eliminating documents that don’t satisfy the additional filters added.
Let’s assume we want to count the values for a facet on the result set, ignoring the additional filtering that was added by a filter query.
Effectively can be equivalent to the concept of counting the facet values on a result set status that precedes the application of the filter that reduced the result set.
Apache Solr allows you to do that, without affecting the final results returned.

This is called exclusion and can be applied on a facet by facet basis.


This will calculate the ‘doctype’ field facet on the result set with the exclusion of the tagged filter (so for the matter of calculating such aggregation the “doctype:pdf” filter will not be applied and the counts will be calculated on an extended result set).
All other facets, aggregations and the result set itself will not be affected.

1.<Wanted Behaviour - applying tag and exclusion>
=== Document Type ===
[ ] Word (42)
[x] PDF (96)
[ ] Excel(11)
[ ] HTML (63)

This is especially useful for single valued fields:
when selecting a facet value and refreshing the search if you don’t apply tag and exclusion you will get just that value in the facets, defeating the refinement and exploration facet functionality for that field.

2.<Unwanted Behaviour - out of the box>
=== Document Type ===
[ ] Word (0)
[x] PDF (96)
[ ] Excel(0)
[ ] HTML (0)
3.<Unwanted Behaviour - mincount=1>
=== Document Type ===
[x] PDF (96)

As you see in 2. and 3. the facet become barely usable to further explore the results, this may bring the user experience to be fragmented with a lot of back and forth activity selecting and unselecting filters.

Excluding in Json Faceting

After the tagging of a filter, applying an exclusion with the json.facet approach is quite simple:

visibleValues: {
type: terms,
field: cat,
mincount: 1,
limit: 100,
domain: {
excludeTags: <tag>

When defining a json facet, applying exclusion is just adding the domain node with the excludeTags defined.

Tag and Exclusion to Preserve Acl Filtering in 0 counts


  • Users are subject to a set of ACL that limit their results visibility.
  • They would like to see also 0 count facets to have a better understanding of the result set and corpus.
  • You don’t want to invalidate the ACL control, so you don’t expect them to see sensible facet values.

Tagging the Main Query and Json Faceting

This is achievale with a combination of tagging and exclusion with Json faceting.
First of all, we want to tag the main query.
We assume the ACL control will be a filter query(and we recommend to apply ACL filtering with properly tuned filter queries).
Tagging the main query and excluding it from the facet calculation will allow us to get all the facet values in the ACL filtered corpus (the main query will be excluded but the ACL filter query will still be applied).

q={!edismax tag=mainQuery qf=name}query&fq=aclField:user1...
json.facet={visibleValues: {
type: terms,
field: cat,
mincount: 1,
limit: 100,
domain: {
excludeTags: mainQuery

We are almost there, this facet aggregation will give the counts of all facet values visible to the user in the original corpus(with ACL applied).
But what we want is to have the correct counts based on the current result set and all the visible 0 count facets.
To do that we can add a block to the Json faceting request:

q={!edismax tag=mainQuery qf=name}query&fq=aclField:user1...
resultSetCounts: {
type: terms,
field: category,
mincount: 1
visibleValues: {
type: terms,
field: category,
mincount: 1,
domain: {
excludeTags: mainQuery
  • resultSetCounts –  are the counts in the result set, including only NOT 0 counts facet values. This is the list of values the user has visibility on the current result set with correct counts.
  • visibleValues – are all the facet values in the result set the user should have visibility

Then, depending on the user experience we want to provide, we could use these blocks of information to properly render a final response.
For example we may want to show all visible values and associate with them a count from the resultSetCounts when available.

=== Document Type - Result Counts ===   
[ ] Word (10)
[ ] PDF (7)
[ ] Excel(5)
[ ] HTML (2)
=== Document Type - Visible Values ===
[ ] Word (100)
[ ] PDF (75)
[ ] Excel(54)
[ ] HTML (34)
[ ] Jpeg (31)
[ ] Mp4 (14)
 [ ] SecretDocType1 (0) -> not visible, mincount=1 in visibleValues
 [ ] SecretDocType2 (0) -> not visible, mincount=1 in visibleValues

=== Document Type - Final Result for users ===
[ ] Word (10) -> count is replaced with effective result count
[ ] PDF (7) -> count is replaced with effective result count
[ ] Excel(5) -> count is replaced with effective result count
[ ] HTML (2)-> count is replaced with effective result count
[ ] Jpeg (+31)
[ ] Mp4 (+14)

Bonus: What if I Defined the Query Parser in the Solrconfig.xml

This solution is still valid if you are using your query parser defined in the solrconfig.xml .
Extra care is needed to tag the main query.
You can achieve that using the local params in Solr request parameters:

<lst name="defaults">
<str name="q">{!type=edismax tag=mainQuery v=$qq}</str>
<str name="qq">*:*</str>

Query Time
.../solr/techproducts/browse?qq=ipod mini&fq=acl:user1&json.facet=...

Hope this helps when dealing with ACL or generic filter queries and faceting!

Apache Solr Distributed Facets

Apache Solr distributed faceting feature has been introduced back in 2008 with the first versions of Solr (1.3 according to this jira[1]) .
Until now, I always assumed it just worked, without diving too much into the details.
Nowadays distributed search and faceting are extremely popular, you can find them pretty much everywhere (in the legacy or SolrCloud form alike).
N.B. Although the mechanics are pretty much the same, Json faceting revisits this approach with some change, so we will now focus on legacy field faceting.

I think it’s time to get a better understanding of how it works:

Multiple Shard Requests

When dealing with distributed search and distributed aggregation calculations, you are going to see multiple requests going back and forth across the shards.
They have different focus and are meant to retrieve the different bits of information necessary to build the final response.
We are going to explore the different rounds of requests, focusing just for the faceting purpose.
N.B. Some of these requests are also carrying results for the distributed search calculation, this is used to minimise the network traffic.

For the sake of this blog let’s simulate a simple sharded index, white space tokenization on field1 and facet.field=field1

Shard 1 Shard 2
{  “id”:”1”,
“field1”:”a b”
{  “id”:”4”,
“field1”:”b c”
{  “id”:”2”,
{  “id”:”5”,
“field1”:”b c”
{  “id”:”3”,
“field1”:”b c”
{  “id”:”6”,

Global Facets : b(4), c(4), a(2)

Shard 1 Local Facets : a(2), b(2), c(1)

Shard 2 Local Facets : c(3), b(2)

Collection of Candidate Facet Field Values

The first round of requests is sent to each shard to identify the candidate top K global facet values.
To achieve this target each shard will be requested to respond with its local top K+J facet values and counts.
The reason we actually ask for more facets from each shard is to have a better term coverage, to avoid losing relevant facet values and to minimise the refinement requests.
How many more we request from each shard is regulated by the “overrequest” facet parameter, a factor that gives more accurate facets at the cost of additional computations[2].
Let’s assume we configure a facet.limit=2&facet.overrequest.count=0&facet.overrequest.ratio=1 to explain when refinement happens and how it works.

Shard 1 Returned Facets : a(2), b(2)

Shard 2 Returned Facets : c(3), b(2)

Global Merge of Collected Counts

The facet value counts collected from each shard are merged and the most occurring global top K is calculated.
These facet field values are the first candidates to be the final ones.
In addition to that, other candidates are extracted from the terms below the top K, based on the shards that didn’t return those values statistics.
At this point we have a candidate set of values and we are ready to refine their counts where necessary, asking back this information to the shards that didn’t include that in the first round.
This happens including the following specific facet parameter to the following refinement requests:


N.B. This request is specifically asking a Solr instance to return back the facet counts just for the terms specified[3]

Top 2 candidates = b(4), c(3)
Additional candidates = a(2)

The reason that a(2) is added to the potential candidates is because Shard 2 didn’t answer with a count for a, the potential missing count of 1 could bring a to the top K. So it is worth a verification.

Shard 1 didn’t return any value for the candidate c facet.
So the following request is built and sent to it:

Shard 2 didn’t return any value for the candidate a facet.
So the following request is built and sent to it:

Final Counts Refinement

The refinements counts returned by each shard can be used to finalise the global candidate facet values counts and to identify the final top K to be returned by the distributed request.
We are finally done!

Shard 1 Refinements Facets : c(1)

Shard 2 Refinements Facets : a(0)

Top K candidates updatedb(4), c(4), a(2)

GIven a facet.limit=2 the final global facets with correct results returned is :
b(4), c(4)





Synonyms and Stopwords: Vademecum

In this post we’ll cover two additional synonyms scenarios and we’ll try to summarise all previous tips in a coincise form. Following the approach of the previous posts [1] [2] [3], everything can be applied both to Apache Solr and Elasticsearch.


  • Synonyms and stopwords at query time: this is not just a “theoretical” constraint; imagine if you have to manage a deployment context belonging to the same customer with a lot of small / medium indexes: you cannot re-build from scratch everything each time a synonym or a stopword changes.
  • Synonyms, not hypernyms or hyponyms: or better, we aren’t talking about what a thesaurus calls broader, narrower or related terms. Although some of the things below could be also valid in those contexts, the broader or narrower scope introduced with hypernyms, hyponyms or related concepts can have some weird side-effect on the scoring phase.

Test data

Let’s start with the test data.

  • synonyms = [“out of warranty, oow”, “transfer phone number, port number”]
  • stopwords = [“of”, “my”]
  • query analyzer = [ “standard_tokenizer”, “lowercase filter”, “synonyms (graph) filter”, “stopwords filter”]

#1: How can I define Multi-terms Concepts?

If you want to manage a multi-terms concept as a whole, regardless it has synonyms or not, you can use the synonyms file. Here’s a couple of examples: the first is a concept with one synonym, the second one doesn’t have any synonym:

Multimedia Messaging Service,Multimedia Text Message,MMS
Apache Cassandra, Apache Cassandra

As you can see, when a concept doesn’t have any available synonym, we can just repeat it.

Solr users only: don’t forget the following things:

  • the request handler should use an edismax or lucene query parser, and the SplitOnWhiteSpace flag (sow) must be set to true
  • the field type which includes the synonyms graph filter must have the autoGeneratePhraseQueries set to true

You can read more here [1] about this approach.

Note: this will work until the Lucene SynonymMap uses a List/Array for collecting the synonyms associated with a given concept. When and if the implementation will switch to a Set-like approach, there’s a high chance this trick will stop working.

#2: What if the query contains multi-terms concepts with stopwords?

Imagine a query like this

q=my car is out of warranty. What can I do?

Well, with the configuration above the stopwords removal after the synonyms detection causes a weird effect on the generated query: the “what” term is wrongly added to the synonym phrase query: “out ? warranty what”.

While the issue affects the FilteringTokenFilter (the superclass of StopFilter) and therefore it has a wider scope, for this specific problem we proposed a solution [2], consisting of a specialised StopFilter which is aware about synonym tokens. The result is that terms which are part of a previously detected synonym are not removed, even if they are stopwords. The query analyzer of our field becomes something like this:

<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SynonymGraphFilterFactory" 
<filter class="io.sease.SynonymAwareStopFilterFactory" 

#3: What if the document contains multi-terms concepts with “intruder” stopwords?

We have a document like this:

  "id": 1,
  "title": "how do I transfer my phone number?"

and the query:

q=transfer phone number procedure

at query time, the synonym is correctly detected and phrase clauses are generated, but unfortunately it doesn’t match the document above because the intermediate “my” stopwords:

You can read here [3] the proposed solution for this scenario, which basically consists of a two-steps query plan: in the first, the detected synonyms generate phrase clauses, while in the second they are destructured in term clauses.

#4: What if the query contains multi-terms concepts with “intruder” stopwords?

And here we are in the opposite case. We have a document like this:

  "id": 1, 
  "title": "transfer phone number procedure" 

and the query:

q=how do I transfer my phone number?

As you can see, at query time the synonym is not detected because the “my” stopword between terms. While the document above could be still be part of the response of the generated query, here we are focusing on the missing synonym detection.

A possible solution is to double the synonym filter before and after the stopwords filter:

       class="solr.TextField" autoGeneratePhraseQueries="true">
       <analyzer type="index">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
       <analyzer type="query">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
           <filter class="solr.SynonymGraphFilterFactory" 
           <filter class="io.sease.SynonymAwareStopFilterFactory" 
           <filter class="solr.SynonymGraphFilterFactory" 

In the first iteration the synonym is not detected, then the StopFilter removes the “my” stopword so in the second iteration the synonym will be correctly recognized. Note the StopFilter is still the custom class we introduced in #2 because we want to cover also that scenario.

What is the drawback of this approach? This is something which worked in my specific case, but be aware that the SynonymGraphFilter documentation states this explicit warning:

NOTE: this cannot consume an incoming graph; results will be undefined.

#5 (UNSOLVED) What if the query contains multi-terms concepts more than one “intruder” stopwords?

This is the worst case, where we have a query like this:

q=out of my warranty

That is: we have a couple of terms which have been declared as stopwords, but the first (of) is potentially part of a synonym (out of warranty) while the second (my) isn’t.

We’re still working on this case so unfortunately there’s no a proposal here, if you got some idea or feedback, it is warmly welcome.

[1] Multi-terms concepts in Apache Solr / Elasticsearch
[2] SynonymAwareStopFilter

Still Synonyms + Stopwords?? Mamma mia!

The Context

Brief recap of where we arrived in the preceding article: we had the following synonyms and stopwords settings:

  • synonyms = {“out of warranty”,”oow”}
  • stopwords = {“of”}

Both of those filters were configured exclusively at query-time; the synonym filter first and then the stopwords filter.

Using the built-in StopFilter we had a synonym detection issue because the removal of the “of” term in the query string (e.g. “my device ran out of warranty“). For that reason, we introduced a custom StopFilter subclass which was aware about stopwords in synonyms.

The other scenario we are going to describe is a little bit different: let’s suppose we have the following data:

  • synonyms = {test code, tdd, testing}
  • stopwords = {my, your, how ,to, in}

Still here, we want to manage synonyms and stopwords only at query time.
We have this document indexed:

      "id": 1,
      "title": "Java programmer: do you want to test your code?"

And a query like this:

"how to test code in Java?"

The Problem: missing synonym match

The query parser matches the “test code” synonym in the query and produces a query like this:

(title:tdd title:testing PhraseQuery(title:"test code")) title:java

unfortunately there’s no match, because the document title contains an intruder: the “your” term between the “test” and “code”.

A Solution: invisible queries with and without synonym phrases

In the preceding article we’ve underlined the role of the autoGeneratePhraseQueries flag. It is the responsible of creating phrase clauses for all detected multi-terms synonyms. In case this flag is set to false (or even missing) the generated query won’t have any phrase, even if a multi-term synonym is detected.

While ordinarily this is not what you would expect, in this specific case it could be a valid alternative for dealing with such mismatching: a first request would require the “synonym phrasing” behaviour, a second one wouldn’t. The first query would be:

(title:tdd title:testing PhraseQuery(title:"test code")) title:java

After receiving an empty response, a second query will be sent, targeting another (similar) field related to a field type which has the autoGeneratePhraseQueries parameter will be set to false. That would generates the following query:

(title:testing title:tdd (+title:test +title:code)) title:java

and here we would get a match!

A couple of notes:

  • in the second try we are requiring the disjoint presence of those two terms (“test” and “code”) in whatever order, with whatever proximity, so the increased recall could produce some unexpected results. In case we are using the edismax query parser, a “pf” parameter would be helpful for moving up those results which adhere better to the entered query, in terms of proximity and terms order.
  • we could put the stop filter at index time, but that violates the precondition: we want a pure query-time management.

How to implement such search workflow? In Solr, we need a couple of fields, the first one is exactly the field + field type we described in the preceding article, the second is similar, the only difference is in the autoGeneratePhraseQueries parameter, which is set to false:

       class="solr.TextField" autoGeneratePhraseQueries="true">
       <analyzer type="index">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
       <analyzer type="query">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
           <filter class="solr.SynonymGraphFilterFactory" 
           <filter class="sc.SynonymAwareStopFilterFactory" 
       class="solr.TextField" autoGeneratePhraseQueries="false">
       <analyzer type="index">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
       <analyzer type="query">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
           <filter class="solr.SynonymGraphFilterFactory" 
           <filter class="sc.SynonymAwareStopFilterFactory" 

      type="text_with_synonyms_phrases .../>
      type="text_without_synonyms_phrases .../>

then, here is the minimal request handler:

<requestHandler name="/search" class="solr.SearchHandler" default="true">
       <lst name="defaults">
           <bool name="sow">false</bool>
           <str name="df">title_with_synonyms_phrases</str>
           <str name="defType">lucene</str> 

A client would send first a request like this:

/search?q=how to test code in Java

And, after receiving an empty response, it will send a second query:

/search?q=how to test code in Java&df=text_without_synonyms_phrases

Another option, which moves the search workflow on Solr side, is our CompositeRequestHandler [1], a Solr component which invokes in chain a set of RequestHandler instances: a first request handler, targeting the title_with_synonyms_phrases would be invoked and, in case of zero results, the same query will be sent to another request handler, which would target the title_without_synonyms_phrases.

Note for Elasticsearch users: you will find some difference in applying what is described above. Although the auto_generate_phrase_queries attribute is also present in Elasticsearch, it doesn’t have the same effect. What you’re looking for is an attribute which is not related with field types, it is a query attribute [2] [3]  and it is called auto_generate_synonyms_phrase_query.

[2] Match Query / Synonyms
[3] Query String Query / Synonyms

Synonyms + Stopwords?? OMG!

The Context

The scenario description is quite simple: we want to use synonyms and stopwords.

Following the path of our previous article, we will introduce an additional component in the analysis chain: a StopFilter, which, as the name suggests, removes a set of words from an incoming token stream.

We will use the following data through the examples:

  • synonyms = [“out of warranty”,”oow”]
  • stopwords = [“of”]

Token filters can be configured at index and/or query time. In this context we are focused on the query side: both synonyms and stopwords will be configured only in the query analyzer.

Working exclusively at query time has a great benefit: we can change things at runtime without any reindex need. At the same time, no stopwords filtering will be executed at index time so those terms will be uselessly part of the dictionary.

The Problem: synonyms followed by stopwords

We have the following analyzers:

  • index analyzer
    • standard-tokenizer
    • lowercase
  • query analyzer
    • standard-tokenizer
    • lowercase + synonyms + stopwords

Theoretically, in the query analyzer we would have two options: the stopwords filter could be defined before or after the synonym filter. However, the first way (before) doesn’t make so much sense, because terms that are stopwords and that are, at the same time, part of a synonym will be removed before the synonym detection. As consequence of that those synonym won’t be detected: in the example data, issuing a query like

q=out of warranty

the “of” term will be removed by the StopFilter, the subsequent filter would receive [“out”, “warranty”], which doesn’t match the configured synonym (“out of warranty”).

Elasticsearch users: Elasticsearch doesn’t allow this scenario at all; if you try to use the PUT Settings API with a chain defined as above (first stopwords then synonyms with some term intersection), it will throw an illegal argument exception saying “term: out of warranty analyzed to a token (warranty) with position increment != 1 (got: 2)” .

Apache Solr instead uses a lenient approach: no errors at index creation, but the problem remains (personally I prefer the Elasticsearch approach)

So the obvious choice is to postpone the stopwords management after the synonym filter. Unfortunately, here there’s an issue: the stopword(s) removal has some unwanted side-effect in the generated token graph and the query parser generates a wrong query because it consumes the token stream at the end of the chain.

Let’s imagine we have the following query:

q=tv went out of warranty something of

it will generate the following:

title:tv title:went (title:oow PhraseQuery(title:"out ? warranty something"))

As you can see, the synonym (out of warranty -> oow) is correctly detected but the stopwords filter removes all the “of” tokens, even if the first occurrence is part of a synonym. In the generated query you can see the sneaky effect: the “hole” created by the first “of” occurrence removal, produces the inclusion, in the phrase query, of the next available token in the stream (“something”, in the example).

In other words, the oow token synonym is marked with a positionLength = 3, which correctly means it spans three tokens (1=out, 2=of, 3=warranty); later, the query parser will include the next three available terms for generating a synonym phrase queries but since we no longer have the 2nd token (of), such count includes also “something”, which is the 3rd available token in the stream.

Before proceeding: this is a known problem, a long-standing issue [1] in Lucene which has a broader domain because it is related with the FilteringTokenFilter, the superclass of StopFilter.

The problem we will try to solve is: how can we manage synonyms and stopwords at query time without generating the conflict above?

A Solution

A note first: the token filter we are going to create is something that deals only with Lucene classes. However, when things need to be plugged in a runtime container (e.g. Apache Solr or Elasticsearch) the deployment procedure depends on the target platform: we won’t cover this part here.

The proposed solution is to create a StopFilter subclass which will be “synonym-aware”; it will check the tokenType and positionLength attributes before deciding if a token needs to be removed from the stream. The goal is to avoid removing those terms which have been defined in the stopwords list but are part of a synonym definition.

The class that we are going to extends is org.apache.lucene.analysis.core.StopFlter. This is an empty class, because all the filtering logic is in the superclasses (org.apache.lucene.analysis.StopFilter and the more generic org.apache.lucene.analysis.FilteringTokenFilter). The stopwords logic resides in the accept() method, which as you can see is very simple:

protected boolean accept() {
  return !stopWords.contains(termAtt.buffer(), 0, termAtt.length());

If the stopwords list contains the current term, it will be removed. So far, so good. We need to extend (actually we could also decorate) the StopFilter class for doing something else before calling the logic above.

First we need to check the token type: if a token has been marked as a SYNONYM then our filter doesn’t have to remove it. Then we need to check the positionLength attribute, because, within a synonym detection context, a position length greater than 1 means we have traversing a multi-term synonym:

public class SynonymAwareStopFilter extends StopFilter {

  private TypeAttribute tAtt = 
  private PositionLengthAttribute plAtt = 

  private int synonymSpans;

  protected SynonymAwareStopFilter(
                         TokenStream in, CharArraySet stopwords) {
    super(in, stopwords);

  protected boolean accept() {
    if (isSynonymToken()) {
      synonymSpans = plAtt.getPositionLength() > 1 
                             ? plAtt.getPositionLength() 
                             : 0;
      return true;

    return (--synonymSpans > 0) || super.accept();

  private boolean isSynonymToken() {
    return "SYNONYM".equals(tAtt.type());

Let’s do some test. We will use Apache Solr 7.4.0 for checking the results. Here is the field type definition, where you can see our SynonymAwareStopFilter:

<fieldtype name="text" class="solr.TextField" autoGeneratePhraseQueries="true">
       <analyzer type="index">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
       <analyzer type="query">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
           <filter class="solr.SynonymGraphFilterFactory" 
           <filter class="sc.SynonymAwareStopFilterFactory" 

and this is a minimal request handler:

<requestHandler name="/def" class="solr.SearchHandler" default="true">
       <lst name="defaults">
           <bool name="sow">false</bool>
           <str name="df">title</str>
           <str name="defType">lucene</str>
           <bool name="debug">true</bool>

Running the previous query:

q=tv went out of warranty something of

we have the following:

title:tv title:went (title:oow PhraseQuery(title:"out of warranty")) title:something

if we use instead the other synonym variant:

q=tv went oow something of

we have the following:

title:tv title:went (PhraseQuery(title:"out of warranty") title:oow) title:something

Everything seems working as expected! This is probably just one specific scenario among those addressed by LUCENE-4065; however, it helped me a lot because this is (at least in my experience) a frequent use case.

As usual, any feedback is warmly welcome. See you next time!



Apache Solr/Elasticsearch: How to Manage Multi-term Concepts out of the Box?

This flash blog post will address a very specific and common problem : how to manage entities/concepts composed by multiple terms in a vanilla Apache Solr/Elasticsearch instance ( no plugins or extensions to install).

The (deployment) context

An Elasticsearch or Apache Solr infrastructure where you cannot install third-party components (e.g. plugins, filters, query parsers). This can happen for several reasons:

  • endogenous factors: lack of required expertise/skills for implementing/installing things in your infrastructure.
  • exogenous factors: search capabilities live in an external and managed context which doesn’t allow custom components. This happens for example with services like the Amazon Elasticsearch Service [1].

The Problem

How can I model multi-terms concepts (i.e. concepts composed by multiple terms)?

Concepts are a fundamental piece of a domain specific vocabulary: “United States of America”, “Phone Number”, “Out of Warranty” are just examples of entities you probably want to manage as a whole; if a user searches something like “How can I transfer a Phone Number?” you probably don’t want to return things about numbers or phones, which have a broader scope (i.e. sacrificing precision in favour of recall) .

Note, “probably” is in bold because there’s not an absolute truth here: everything depends on the functional context where the application is running. Here we assume this requirement, but things could be different in another context.

Index/Query Time Solutions

The problem can be solved using two different approaches that involves Indexing time and Query time configurations :

  • Simple Contraction: using the SynonymFilter with Simple Contraction [2] [3] in order to inject a single term that represents the concepts (with optional synonyms)
Multimedia Messaging Service,Multimedia Text Message => mms
  • Shingles: combining shingles and a keep word filter for generating bigrams and trigrams from a given text (assuming we are limiting our interest only to concepts composed by a maximum of 3 terms)

…an Additional Constraint: Query-Time Only

One of the drawbacks of the first group (index + query time) is the cost of reindexing the whole dataset when a change occurs in the synonyms or in the keywords list. So the additional constraint we will introduce is: we want to be able to change the concepts list at runtime without any reindexing.

Again, this is not an absolute constraint: there are a lot of scenarios where reindex everything is completely ok. In my experience this has a direct correlation with the index size plus how the whole reindexing process takes.

So in other words: if the full reindexing process takes a reasonable amount of time and it doesn’t produce any service interruption, then you should consider removing the “query-time only” constraint.

A Solution

The synonym management has been enhanced in Elasticsearch and Apache Solr [4], with the introduction of “graph-aware” token filters, for enabling a full support of Multi-Word Synonyms. 

Prior to that, both index/query time and query time approaches suffered from some limitations when dealing with synonyms [5].

In any case, the current implementation allows us to correctly manage multi-word synonyms as a whole. So, coming back to our question, we can try to “shape” the synonym filter at our wills for managing compound concepts as well.

First, a compound concept can or cannot have synonyms.

Concept with Synonym(s)

If the concept has one or more synonyms, we are within a regular context of the synonym filter. Here’s a sample content of the synonyms.txt file:

Multimedia Messaging Service,Multimedia Text Message,MMS
USA,United States of America

Here’s the corresponding configuration:

<fieldtype name="txt" 
       <analyzer type="index">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
       <analyzer type="query">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
<field name="title" type="txt" indexed="true" stored="true"/>

"analysis": {
    "filter": {
      "english_synonyms": {
         "type": "synonym_graph",
         "synonyms_path": "synonyms.txt",
         "expand": true
    "analyzer": {
      "text_index_analyzer": {
        "tokenizer": "standard",
        "filter": [ "lowercase" ]
      "text_query_analyzer": {
        "tokenizer": "standard",
        "filter": [
"properties": {
   "title": {
      "type": "text",
      "analyzer": "text_index_analyzer",
      "search_analyzer": "text_query_analyzer"

Issuing the following queries:

q={!lucene}multimedia messaging service&sow=false&df=title

  "query": {
     "match": {
       "title": "Multimedia messaging service"

   "query": {
     "query_string": {
       "query": "Multimedia messaging service",
       "default_field": "title",

produces (use debug=true in Solr or _validate/query?explain=true in Elasticsearch) the following query:

(PhraseQuery(title:"multimedia text message") 
 PhraseQuery(title:"multimedia messaging service"))

which is exactly what we want: the concept expressed in the query has been detected and the output query is looking for its expanded (united states of america) or contracted (usa) form. So far, so good.

Concept without Synonyms

What about if a multi-term concept doesn’t have any variant/synonym but we still want to manage it as a whole? Can we use the synonym filter without synonyms? Let’s try to see what happens.

The first obvious idea is to declare something like this, in our synonyms.txt:

Multimedia Messaging Service  

That is, a line containing our concept without any synonym. Unfortunately this is not working, the engine detects there are no synonyms and the resulting query is something like this:

title:multimedia title:messaging title:service

The same happens if you put a dummy synonym which is removed later, in the indexing chain, by a stopword filter. Something like:

Multimedia Messaging Service, something_that_will_be_configured_as_stopword

So the last chance is to duplicate the entry: something like this:

Multimedia Messaging Service,Multimedia Messaging Service

Here things start to be interesting. Running the same queries above we got the following explain:

(PhraseQuery(title:"multimedia messaging service") 
 PhraseQuery(title:"multimedia messaging service"))

title:"multimedia messaging service"^2.0

The phrase query is doubled, which means the score assigned to matches will reflect this, as illustrated in the example explain below (see the 2.0 boost):

4.544185 = sum of:
  4.544185 = weight(title:"multimedia messaging service" in 1) [SchemaSimilarity], result of:
    4.544185 = score(doc=1,freq=1.0 = phraseFreq=1.0
), product of:
      2.0 = boost
      2.7725887 = idf(), sum of:
        0.6931472 = idf, computed as ... from:
          1.0 = docFreq
          2.0 = docCount
        0.6931472 = idf, computed as ... from:
          1.0 = docFreq
          2.0 = docCount
        0.6931472 = idf, computed as ... from:
          1.0 = docFreq

That is, the score of the match above would have been 2.2720926 (4.544185 / 2) but since we have that artificial boost it has been doubled; at the same time it’s important to underline our concept has been correctly managed as a whole, and the queries above won’t return items related with messages, multimedia or services, which are broader concepts.

Is that boost factor a problem? That actually depends on your application: you should

  • have a representative number of search cases
  • try a plain term-centric search
  • try the approach suggested above
  • use a search quality evaluation tool
  • compare and choose


  • You have an Elasticsearch or Apache Solr cluster
  • You cannot install custom plugins
  • You want to manage compound concepts
  • You don’t want to reindex your corpus when adding / removing / updating the concepts list
  • If a concept has one or more synonyms, this is quite straightforward: use the synonym (graph) filter at query time*
  • if a concept doesn’t have any synonym, you can still use the synonym (graph) filter: just double the concept definition, but keep in mind the double (2.0) boost applied to the corresponding phrase query*

* Apache Solr users: make sure the target field type has autoGeneratePhraseQueries set to true, and the sow parameter (defined in the RequestHandler settings or as a request parameter) set to true as well.






Rated Ranking Evaluator: Help the poor (Search Engineer)

A Software Engineer is always required to give his customers a concrete evidence about deliverables quality. A Search Engineer deals with a specialisation of such generic Software Quality, which is called Search Quality.

What is Search Quality? And why is it so important in a search infrastructure? After all, the “Software Quality” should be omni-comprensive, it should always include everything (and actually it is), but when we are dealing with search systems, the quality is a very abstract term, which is very hard to define in advance.

The functional correctness of a search infrastructure (assuming the correctness is the only factor which influences the system quality – and it isn’t) is naturally associated with human judgments, with opinions, and unfortunately we know opinions can be different among people.

The business stakeholders, which will get a value from a search system, can belong to different categories, can have different expectations, and they can have in mind a different idea about the expected system correctness.

In this scenario a Search Engineer is facing many challenges in terms of choices, and at the end, he has to provide concrete evidences about the functional coverage of those choices.

This is the context where we developed the Rated Ranking Evaluator (hereafter RRE).

What it is?

The Rated Ranking Evaluator (RRE) is a search quality evaluation tool which evaluates the quality of results coming from a search infrastructure.

It helps a Search Engineer in his daily job. Are you a Search Engineer? Are you tuning/implementing/changing/configuring a search infrastructure? Do you want to have something that gives you an evidence about the improvements between changes? RRE could give you a hand on that.

RRE formalises how well a search system satisfies the user information needs, at “technical” level, combining a rich tree-like domain model with several evaluation measures, but also at “functional” level, providing human-readable outputs that could target the business stakeholders.

It encourages an incremental/iterative/immutable approach during the deveoopment and the evolution of a search system: assuming we’re starting our system from version x.y: when it’s time to apply some relevant change to its configuration, instead of applying changes to x.y, is better to clone it and apply those changes to the new fresh version.

In this way, RRE will execute the evaluation process on all available versions, it will provide the delta/trend between  subsequent versions, so you can immediately get a fine-grained picture about where the system is going, in terms of relevance.

This post is only a brief summary about RRE. You can find more detailed information in the project Wiki.

In a few words, what can I get from RRE?

You can configure RRE as a compounding part of your project build cycle. That means, every time a build is triggered, an evaluation process will be executed.

RRE is not tied to a given search platform: it provides a mini-framework for plugging-in different search platforms. At the moment we have two available bindings: Apache Solr and Elasticsearch  (see here for supported versions).

The output evaluation data will be available:

  • as a JSON file: for further elaborations
  • as a spreadsheet: for delivering the evaluation results to someone else (e.g. a business stakeholder)
  • in a Web Console where metrics and their values get refreshed in real time (after each build)

How it works

RRE provides a rich, composite, tree-like, domain model, where the evaluation concept can be seen at different levels.

RRE Domain Model

The Evaluation at the top level is just a container of the nested entities. Note that all entities relationships are 1 to many. In this context, a Corpus is defined as a test dataset. RRE will use it for executing the evaluation process; in a single evaluation process you can have multiple datasets.

A Topic is an information need: it defines a functional requirement on the end-user perspective. Within a topic we can have several queries, which express the same need but more close to a technical layer. RRE provides a further abstraction in the middle: query groups. A Query Group is a group of queries which are supposed to produce the same results (and therefore are associated with the same judgments set).

Queries, which are the technical leaves of RRE domain model, are furtherly decomposed in several perspectives, one for each available version of our system. A query itself is of course a single entity, but during an evaluation session, its concrete execution happens several times, one for each available version. That because RRE needs to measure the search results (i.e. the query executions) against all versions.

For each version we will finally have one or more metrics, depending on the configuration. Last but not least, even if metrics are computed at query/version level, RRE will aggregate those values at upper levels (see the dashed vertical lines in the diagram) so each entity/level in the domain model will offer an aggregate perspective of all available metrics (i.e I could be interested in the NDCG for a given query, or I could just stop my analysis at a topic level).


In order to execute an evaluation process, RRE needs the following things:

  • One or more corpus / test collection: these are the representative datasets of a specific domain, that will be used for populating and querying a target search platform
  • One or more configuration sets: although there’s nothing against having one single configuration, a minimum of two versions are required in order to provide a comparison between evaluation measures.
  • One or more ratings sets: this is where judgments are defined, in terms of relevant documents for each query group.


The RRE concrete output depends on the runtime container where it is running. The RRE core itself is just a library, so when used programmatically within a project, it outputs a set of objects corresponding to the domain model described above.

When it is used as a Maven plugin, it primarily outputs the same structure in JSON format. This data is then used for producing further outputs, like a spreadsheet. The same payload can be sent to another module called RRE Server, which offers an AngularJS based web console that gets automatically refreshed.

The RRE console is very useful when we are doing internal iterations / tries around some issue, which usually requires very short edit-and-immediately-check cycles. Imagine if you can have a couple of monitors on your desk: in the first there’s your favourite IDE, where you change things, run builds. In the second there’s the RRE Console (see below). After each build, just have a look on the console in order to get an immediate feedback of your changes.

Where can I start?

The project repository in Github offers all what you need: a detailed documentation about how it works and how to quick start with RRE.

If you need some help, feel free to contact us! We appreciate any feedback, suggestion and, last but not least, contribution.

Future works

As you can imagine, the topic is quite huge. We have a lot of interesting ideas about the platform evolution.

These are some examples:

  • integration with some tool for building the relevance judgments. That could be some UI or a more sophisticated user interaction collector (which will automatically generates the ratings sets on top of computed online metrics like click through rate, sales rate)
  • Jenkins plugin: for a better integration of RRE into the popular CI tool
  • Gradle plugin
  • Apache Solr Rank Eval API: using the RRE core we could implement a Rank Eval endpoint in Solr, similar to the Rank Eval API provided in Elasticsearch
  • ??? Other? Any suggestion is warmly welcome!


Apache Lucene BlendedInfixSuggester : How It Works, Bugs And Improvements

The Apache Lucene/Solr suggesters are important to Sease : we explored the topic in the past[1] and we strongly believe the autocomplete feature to be vital for a lot of search applications.
This blog post explores in details the current status of the Lucene BlendedInfixSuggester, some bugs of the most recent version ( with the solution attached) and some possible improvements.


The BlendedInfixSuggester is an extension of the AnalyzingInfixSuggester with the additional functionality to weight prefix matches of your query across the matched documents.
It scores higher if a hit is closer to the start of the suggestion.
N.B. at the current stage only the first term in your query will affect the suggestion score

Let’s see some of the configuration parameters from the official wiki:

  • blenderType: used to calculate the positional weight coefficient using the position of the first matching word. Can be one of:
    • position_linear: weightFieldValue*(1 – 0.10*position): Matches to the start will be given a higher score (Default)
    • position_reciprocal: weightFieldValue/(1+position): Matches to the start will be given a score which decays faster than linear
    • position_exponential_reciprocal: weightFieldValue/pow(1+position,exponent): Matches to the start will be given a score which decays faster than reciprocal
      • exponent: an optional configuration variable for the position_reciprocal blenderType used to control how fast the score will increase or decrease. Default 2.0.
Data Structure Auxiliary Lucene Index
Building For each Document, the stored content from the field is analyzed according to the suggestAnalyzerFieldType and then additionally EdgeNgram token filtered.

Finally an auxiliary index is built with those tokens.

Lookup strategy The query is analysed according to the suggestAnalyzerFieldType.

Than a phrase search is triggered against the Auxiliary Lucene index

The suggestions are identified starting at the beginning of each token in the field content.

Suggestions returned The entire content of the field .

This suggester is really common nowadays as it allows to provide suggestions in the middle of a field content, taking advantage of the analysis chain provided with the field.

It will be possible in this way to provide suggestions considering synonymsstop words, stemming and any other token filter used in the analysis and match the suggestion based on internal tokens.
Finally the suggestion is scored, based on the position match.

The simple corpus of document for the examples will be the following :

        "title":"Video gaming: the history"},
        "title":"Nowadays Video games are a phenomenal economic business"},
        "title":"The new generation of PC and Console Video games"},
        "title":"Video games: multiplayer gaming"}]

And a simple synonym mapping : multiplayer, online

Let’s see some example :

Query to autocomplete Suggestions Explanation
  • “Video gaming: the history”
  • “Video game: multiplayer gaming”
  • “Nowadays Video games are a phenomenal economic business”
The input query is analysed, and the tokens produced are the following : “game” .

In the Auxiliary Index , for each of the field content we have the EdgeNgram tokens:

“v”,”vi”,”vid”… , “g”,”ga”,”gam”,“game” .

So the match happens and the suggestion are returned.
N.B. First two suggestions are ranked higher as the matched term happen to be closer to the start of the suggestion

Let’s explore the score of each Suggestion given various Blender Types :

Query gaming
Suggestion First Position Match Position Linear Position Reciprocal Position Exponential Reciprocal
Video gaming: the history 1 1-0.1*position = 0.9 1/(1+position) = 1/2 = 0.5 1/(1+position)^2 = 1/4 = 0.25
Video game: multiplayer gaming 1 1-0.1*position = 0.9 1/(1+position) = 1/2 = 0.5 1/(1+position) = 1/4 = 0.25
Nowadays Video games are a phenomenal economic business 2 1-0.1*position = 0.8 1/(1+position) = 1/3 = 0.3 1/(1+position) = 1/9 = 0.1

The final score of the suggestion will be :

long score = (long) (weight * coefficient)

N.B. the reason I highlighted the data type is because it’s directly affecting the first bug we discuss.

Suggestion Score Approximation

The optional weightField parameter is extremely important for the Blended Infix Suggester.
It assigns the value of the suggestion weight ( extracted from the mentioned field).
The suggestion may come from the product name field, but the suggestion weight depends on how profitable the product suggested is.

<str name=”field”>productName</str>
<str name=”weightField”>profit</str>

So far, so good, but unfortunately there are two problems with that.

Bug 1 – WeightField Not Defined -> Zero suggestion score

How To Reproduce It : Don’t define any weightField in the suggester config
Effect : the suggestion ranking is lost, all the suggestions have 0 score, position of the match doesn’t matter anymore
The weightField is not a mandatory configuration for the BlendedInfixSuggester.
Your use case could not involve any weight for your suggestions and you are just interested in the positional scoring (the main reason the BlendedInfixSuggester exists in the first place).
Unfortunately, this is not possible at the moment :
If the weightField is not defined, each suggestion will have a weight of 0.
This is because the weight associated to each document in the document dictionary is a long. If the field to extract the weight from, is not defined (null), the weight returned will just be 0.
This doesn’t allow to differentiate when a weight should be 0 ( value extracted from the field) or null ( no value at all ).
A solution has been proposed here[3].

Bug 2 – Bad Approximation Of Suggestion Score For Small Weights

There is a misleading data type casting in the score calculation for the suggestion :

long score = (long) (weight * coefficient)

This apparently innocent cast, actually brings very nasty effects if the weight associated to a suggestion is unitary or small enough.

Weight =1
Video gaming: the history
1-0.1*position = 0.9 * 1 =cast= 0
1/(1+position) = 1/2 = 0.5 * 1 =cast= 0
1/(1+position)^2 = 1/4 = 0.25 * 1 =cast= 0

Weight =2
Video gaming: the history
1-0.1*position = 0.9 * 2=cast= 1
1/(1+position) = 1/2 = 0.5 * 2=cast= 1
1/(1+position)^2 = 1/4 = 0.25 * 2=cast= 0

Basically you risk to lose the ranking of your suggestions reducing the score to only few possible values : 0 or 1 ( in edge cases)

A solution has been proposed here[3]

Multi Term Matches Handling

It is quite common to have multiple terms in the autocomplete query, so your suggester should be able to manage multiple matches in the suggestion accordingly.

Given a simple corpus (composed just by the following suggestions) and the query :
“Mini Bar Frid” 

You see these suggestions:

  • 1000 | Mini Bar something Fridge
  • 1000 | Mini Bar something else Fridge
  • 1000 | Mini Bar Fridge something
  • 1000 | Mini Bar Fridge something else
  • 1000 | Mini something Bar Fridge

This is because at the moment, the first matching term wins all ( and the other positions are ignored).
This brings a lot of possible ties (1000), that should be broken to give the user a nice and intuitive ranking.

But intuitively I would expect in the results something like (note that allTermsRequired=true and the schema weight field always returns 1000)

  • Mini Bar Fridge something
  • Mini Bar Fridge something else
  • Mini Bar something Fridge
  • Mini Bar something else Fridge
  • Mini something Bar Fridge

Let’s see a proposed Solution [4] :

Positional Coefficient

Instead of taking into account just the first term position in the suggestion, it’s possible to use all the matching positions from the matched terms [“mini”,”bar”,”fridge”].
Each position match will affect the score with :

  • How much the matched term position is distant from the ideal position match
    • Query : Mini Bar Fri, Ideal Positions : [0,1,2]
    • Suggestion 1Mini Bar something Fridge, Matched Positions:[0,1,3]
    • Suggestion 2Mini Bar something else Fridge, Matched Positions:[0,1,4]
    • Suggestion 2 will be penalised as “Fri” match happens farer (4 > 3) from the ideal position 2
  • Earlier the mis-position happened, stronger the penalty for the score to pay
    • Query : Mini Bar Fri, Ideal Positions : [0,1,2]
    • Suggestion 1Mini Bar something Fridge, Matched Positions:[0,1,3]
    • Suggestion 2Mini something Bar Fridge, Matched Positions:[0,2,3]
    • Suggestion 2 will be additionally penalised as the first mis-match in positions Bar happens closer to the beginning of the suggestion 

Considering only the discountinue position proved useful :

Query1: Bar some
Query2: some
Suggestion : Mini Bar something Fridge
Query 1 Suggestion Matched Terms positions : [1,2]
Query 2 Suggestion Matched Terms positions : [2]

If we compare the suggestion score for both these queries, it would seem unfair to penalise the first one just because it matches 2 terms ( consecutive) while the second query has just one match ( positioned worst than the first match in query1)

Introducing this advanced positional coefficient calculus helped in improving the overall behavior for the experimental test created.
The results obtained were quite promising :

Query : Mini Bar Fri
100 |Mini Bar Fridge something
100 |Mini Bar Fridge something else
100 |Mini Bar Fridge a a a a a a a a a a a a a a a a a a a a a a
26 |Mini Bar something Fridge
22 |Mini Bar something else Fridge
17 |Mini something Bar Fridge
8 |something Mini Bar Fridge
7 |something else Mini Bar Fridge

There is still a tie for the exact prefix matches, but let’s see if we can finalise that improvement as well .

Token Count Coefficient

Let’s focus on the first three ranking suggestions we just saw :

Query : Mini Bar Fri
100 |Mini Bar Fridge something
100 |Mini Bar Fridge something else
100 |Mini Bar Fridge a a a a a a a a a a a a a a a a a a a a a a

Intuitively we want this order to break the ties.
Closer the number of matched terms with the total number of terms for the suggestion, the better.
Ideally we want our top scoring suggestion to just have the matched terms if possible.
We also don’t want to bring strong inconsistencies for the other suggestions, we should ideally only affect the ties.
This is achievable calculating an additional coefficient, dependant on the term counts :
Token Count Coefficient = matched terms count / total terms count

Then we can scale this value accordingly :
90% of the final score will derive from the positional coefficient
10% of the final score will derive from the token count coefficient

Query : Mini Bar Fri
90 * 1.0 + 10*3/4 = 97|Mini Bar Fridge something
90 * 1.0 + 10*3/5 = 96|Mini Bar Fridge something else
90 * 1.0 + 10*3/25 = 91|Mini Bar Fridge a a a a a a a a a a a a a a a a a a a a a a

It will require some additional tuning but the overall idea should bring a better ranking function to the BlendedInfix when multiple terms matches are involved!
If you have any suggestion, feel free to leave a comment below!
Code is available in the Github Pull Request attached to the Lucene Jira issue[4]

[1] Solr Autocomplete
[2] Blended Infix Suggester Solr Wiki
[3] LUCENE-8343
[4] LUCENE-8347

SolrCloud exceptions with Apache Zookeeper

At the time we speak ( Solr 7.3.1 ) SolrCloud is a reliable and stable distributed architecture for Apache Solr.
But it is not perfect and failures happen.

Apache Zookeeper[1] is the system responsible of managing the communications across the SolrCloud cluster.
It contains the shared collections configurations and it has the view of the cluster status.
It is part of the brain of the cluster, a keeper that maintains the cluster healthy and functional.

It is able to answer questions such as :

• Who is the leader for this shard and collection?
• Is this node down ?
• Is this node recovering ?

The Solr nodes communicate with Zookeeper to understand who to contact when running SolrCloud operations.

This lightening blog post will present some practical tips to follow when your client application encounters some classic exceptions dealing with SolrCloud and Apache Zookeeper.
Special thanks to the Apache Solr user mailing list contributors and the Apache Solr community, this post is an aggregation of recommendations from there and from official code and documentation.

org.apache.solr.common.SolrException: Could not load collection from ZK: <collection name>

If you landed here with just that Exception I assume there is a missing :
“ Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /collections/<collection name>/state.json “ ?

Solr’s zkClientTimeout is used to set ZooKeeper’s sessionTimeout, and that’s what is exceeded when a session expires.
When this kind of exception happens, it means something has gone VERY wrong in the Solr-Zookeeper communication, 30 seconds ( the current default[2] ) is a REALLY long time when applications are trying to communicate.

Recommendation : take care of the different time outs around, don’t keep them too small !
i.e. for zkClientTimeout assign a value >= 30 seconds.
maxSessionTimeout (Zookeeper)
New in 3.3.0: the maximum session timeout in milliseconds that the server will allow the client to negotiate. Defaults to 20 times the tickTime.

zkClientTimeout (Solr)
Controls your client timeout.

Once checked the time outs, let’s explore some possible root causes.
A session expiry can be caused by:
1. Garbage collection on Solr node/Zookeeper – extreme GC pauses can happen with the heap being too small or VERY large
2. Slow IO on disk.
3. Network latency


  1. set up a JVM profiler to monitor closely your Solr and Zookeeper nodes, take particular attention to the garbage collection cycles and the memory usage in general : you don’t want Zk to swap too much! ( GCViewer[3] could be a nice tool for this)
  2. Verify that the Zookeeper node has a fast writing access to the disk : Zookeeper needs fast writes and ideally a separate disk allocated.
  3. Monitor your network and make sure the the solr nodes can talk effectively to the Zookeeper nodes

In case the suggestions are not solving your problem, you may be experiencing a Solr bug.
One of them is[4] which unfortunately has not been fixed yet.

org.apache.solr.client.solrj.SolrServerException: No live SolrServers available to handle this

From the official JavaDoc :

“Tries to query a live server from the list provided in Req. Servers in the dead pool are skipped.
* If a request fails due to an IOException, the server is moved to the dead pool for a certain period of
* time, or until a test request on that server succeeds.
* Servers are queried in the exact order given (except servers currently in the dead pool are skipped).
* If no live servers from the provided list remain to be tried, a number of previously skipped dead servers will be tried.
* Req.getNumDeadServersToTry() controls how many dead servers will be tried.
* If no live servers are found a SolrServerException is thrown.”

What was the status of the cluster at the moment the exception happened ?
Was any Solr server UP and running according to Zookeeper knowledge ?

The recommendation is to check the clusterstate.json when the exception happens.
From the Solr admin UI you can open Cloud->Tree and verify which nodes are up and running.

It could be very much related a node failure ( that could be related to any possible reason including GC)
I’ve seen situations where it was caused by a specific query, the real exception got hidden by a “No live SolrServers…” client exception.
Solr logs should help to identify the inner Solr problem and JVM monitoring could discard any memory/gc problem.
Some people saw this with wildcard queries (when every shard reported a “too many expansions…”
type error, but the exception in the client response was “No live SolrServers…”.

org.apache.solr.common.SolrException: Could not find a healthy node to handle the request

Pretty much same considerations as the “No Live Solr Server”.
This happens when the load balancer SolrJ side is unable to retrieve an alive node, from the cluster ( based on Zookeeper state).
This happens before the previous exception, so the request doesn’t even reach the LoadBalancinghttpSolrClient.

[2] SOLR-5565
[4] SOLR-8868