Synonyms and Stopwords: Vademecum

In this post we’ll cover two additional synonyms scenarios and we’ll try to summarise all previous tips in a coincise form. Following the approach of the previous posts [1] [2] [3], everything can be applied both to Apache Solr and Elasticsearch.

Preconditions

  • Synonyms and stopwords at query time: this is not just a “theoretical” constraint; imagine if you have to manage a deployment context belonging to the same customer with a lot of small / medium indexes: you cannot re-build from scratch everything each time a synonym or a stopword changes.
  • Synonyms, not hypernyms or hyponyms: or better, we aren’t talking about what a thesaurus calls broader, narrower or related terms. Although some of the things below could be also valid in those contexts, the broader or narrower scope introduced with hypernyms, hyponyms or related concepts can have some weird side-effect on the scoring phase.

Test data

Let’s start with the test data.

  • synonyms = [“out of warranty, oow”, “transfer phone number, port number”]
  • stopwords = [“of”, “my”]
  • query analyzer = [ “standard_tokenizer”, “lowercase filter”, “synonyms (graph) filter”, “stopwords filter”]

#1: How can I define Multi-terms Concepts?

If you want to manage a multi-terms concept as a whole, regardless it has synonyms or not, you can use the synonyms file. Here’s a couple of examples: the first is a concept with one synonym, the second one doesn’t have any synonym:

Multimedia Messaging Service,Multimedia Text Message,MMS
Apache Cassandra, Apache Cassandra

As you can see, when a concept doesn’t have any available synonym, we can just repeat it.

Solr users only: don’t forget the following things:

  • the request handler should use an edismax or lucene query parser, and the SplitOnWhiteSpace flag (sow) must be set to true
  • the field type which includes the synonyms graph filter must have the autoGeneratePhraseQueries set to true

You can read more here [1] about this approach.

Note: this will work until the Lucene SynonymMap uses a List/Array for collecting the synonyms associated with a given concept. When and if the implementation will switch to a Set-like approach, there’s a high chance this trick will stop working.

#2: What if the query contains multi-terms concepts with stopwords?

Imagine a query like this

q=my car is out of warranty. What can I do?

Well, with the configuration above the stopwords removal after the synonyms detection causes a weird effect on the generated query: the “what” term is wrongly added to the synonym phrase query: “out ? warranty what”.

While the issue affects the FilteringTokenFilter (the superclass of StopFilter) and therefore it has a wider scope, for this specific problem we proposed a solution [2], consisting of a specialised StopFilter which is aware about synonym tokens. The result is that terms which are part of a previously detected synonym are not removed, even if they are stopwords. The query analyzer of our field becomes something like this:

<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SynonymGraphFilterFactory" 
        synonyms="synonyms.txt" 
        ignoreCase="false" 
        expand="true"/>
<filter class="io.sease.SynonymAwareStopFilterFactory" 
        words="stopwords.txt" 
        ignoreCase="true"/>

#3: What if the document contains multi-terms concepts with “intruder” stopwords?

We have a document like this:

{
  "id": 1,
  "title": "how do I transfer my phone number?"
}

and the query:

q=transfer phone number procedure

at query time, the synonym is correctly detected and phrase clauses are generated, but unfortunately it doesn’t match the document above because the intermediate “my” stopwords:

You can read here [3] the proposed solution for this scenario, which basically consists of a two-steps query plan: in the first, the detected synonyms generate phrase clauses, while in the second they are destructured in term clauses.

#4: What if the query contains multi-terms concepts with “intruder” stopwords?

And here we are in the opposite case. We have a document like this:

{ 
  "id": 1, 
  "title": "transfer phone number procedure" 
}

and the query:

q=how do I transfer my phone number?

As you can see, at query time the synonym is not detected because the “my” stopword between terms. While the document above could be still be part of the response of the generated query, here we are focusing on the missing synonym detection.

A possible solution is to double the synonym filter before and after the stopwords filter:

<fieldtype 
       name="text_with_synonyms_phrases" 
       class="solr.TextField" autoGeneratePhraseQueries="true">
       
       <analyzer type="index">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
       </analyzer>
       <analyzer type="query">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
           <filter class="solr.SynonymGraphFilterFactory" 
                   synonyms="synonyms.txt" 
                   ignoreCase="true" 
                   expand="true"/>
           <filter class="io.sease.SynonymAwareStopFilterFactory" 
                   words="stopwords.txt" 
                   ignoreCase="true"/>
           <filter class="solr.SynonymGraphFilterFactory" 
                   synonyms="synonyms.txt" 
                   ignoreCase="true" 
                   expand="true"/>
       </analyzer>
</fieldtype>

In the first iteration the synonym is not detected, then the StopFilter removes the “my” stopword so in the second iteration the synonym will be correctly recognized. Note the StopFilter is still the custom class we introduced in #2 because we want to cover also that scenario.

What is the drawback of this approach? This is something which worked in my specific case, but be aware that the SynonymGraphFilter documentation states this explicit warning:

NOTE: this cannot consume an incoming graph; results will be undefined.

#5 (UNSOLVED) What if the query contains multi-terms concepts more than one “intruder” stopwords?

This is the worst case, where we have a query like this:

q=out of my warranty

That is: we have a couple of terms which have been declared as stopwords, but the first (of) is potentially part of a synonym (out of warranty) while the second (my) isn’t.

We’re still working on this case so unfortunately there’s no a proposal here, if you got some idea or feedback, it is warmly welcome.


[1] Multi-terms concepts in Apache Solr / Elasticsearch
[2] SynonymAwareStopFilter
[3] https://sease.io/2018/08/still-synonyms-stopwords-mamma-mia.html

Still Synonyms + Stopwords?? Mamma mia!

The Context

Brief recap of where we arrived in the preceding article: we had the following synonyms and stopwords settings:

  • synonyms = {“out of warranty”,”oow”}
  • stopwords = {“of”}

Both of those filters were configured exclusively at query-time; the synonym filter first and then the stopwords filter.

Using the built-in StopFilter we had a synonym detection issue because the removal of the “of” term in the query string (e.g. “my device ran out of warranty“). For that reason, we introduced a custom StopFilter subclass which was aware about stopwords in synonyms.

The other scenario we are going to describe is a little bit different: let’s suppose we have the following data:

  • synonyms = {test code, tdd, testing}
  • stopwords = {my, your, how ,to, in}

Still here, we want to manage synonyms and stopwords only at query time.
We have this document indexed:

   {
      "id": 1,
      "title": "Java programmer: do you want to test your code?"
   }

And a query like this:

"how to test code in Java?"

The Problem: missing synonym match

The query parser matches the “test code” synonym in the query and produces a query like this:

(title:tdd title:testing PhraseQuery(title:"test code")) title:java

unfortunately there’s no match, because the document title contains an intruder: the “your” term between the “test” and “code”.

A Solution: invisible queries with and without synonym phrases

In the preceding article we’ve underlined the role of the autoGeneratePhraseQueries flag. It is the responsible of creating phrase clauses for all detected multi-terms synonyms. In case this flag is set to false (or even missing) the generated query won’t have any phrase, even if a multi-term synonym is detected.

While ordinarily this is not what you would expect, in this specific case it could be a valid alternative for dealing with such mismatching: a first request would require the “synonym phrasing” behaviour, a second one wouldn’t. The first query would be:

(title:tdd title:testing PhraseQuery(title:"test code")) title:java

After receiving an empty response, a second query will be sent, targeting another (similar) field related to a field type which has the autoGeneratePhraseQueries parameter will be set to false. That would generates the following query:

(title:testing title:tdd (+title:test +title:code)) title:java

and here we would get a match!

A couple of notes:

  • in the second try we are requiring the disjoint presence of those two terms (“test” and “code”) in whatever order, with whatever proximity, so the increased recall could produce some unexpected results. In case we are using the edismax query parser, a “pf” parameter would be helpful for moving up those results which adhere better to the entered query, in terms of proximity and terms order.
  • we could put the stop filter at index time, but that violates the precondition: we want a pure query-time management.

How to implement such search workflow? In Solr, we need a couple of fields, the first one is exactly the field + field type we described in the preceding article, the second is similar, the only difference is in the autoGeneratePhraseQueries parameter, which is set to false:

<fieldtype 
       name="text_with_synonyms_phrases" 
       class="solr.TextField" autoGeneratePhraseQueries="true">
       
       <analyzer type="index">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
       </analyzer>
       <analyzer type="query">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
           <filter class="solr.SynonymGraphFilterFactory" 
                   synonyms="synonyms.txt" 
                   ignoreCase="false" 
                   expand="true"/>
           <filter class="sc.SynonymAwareStopFilterFactory" 
                   words="stopwords.txt" 
                   ignoreCase="true"/>
       </analyzer>
</fieldtype>
<fieldtype 
       name="text_without_synonyms_phrases" 
       class="solr.TextField" autoGeneratePhraseQueries="false">
       
       <analyzer type="index">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
       </analyzer>
       <analyzer type="query">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
           <filter class="solr.SynonymGraphFilterFactory" 
                   synonyms="synonyms.txt" 
                   ignoreCase="false" 
                   expand="true"/>
           <filter class="sc.SynonymAwareStopFilterFactory" 
                   words="stopwords.txt" 
                   ignoreCase="true"/>
       </analyzer>
</fieldtype>

<field 
      name="title_with_synonyms_phrases" 
      type="text_with_synonyms_phrases .../>
<field 
      name="title_without_synonyms_phrases" 
      type="text_without_synonyms_phrases .../>

then, here is the minimal request handler:

<requestHandler name="/search" class="solr.SearchHandler" default="true">
       <lst name="defaults">
           <bool name="sow">false</bool>
           <str name="df">title_with_synonyms_phrases</str>
           <str name="defType">lucene</str> 
       </lst>
   </requestHandler>

A client would send first a request like this:

/search?q=how to test code in Java

And, after receiving an empty response, it will send a second query:

/search?q=how to test code in Java&df=text_without_synonyms_phrases

Another option, which moves the search workflow on Solr side, is our CompositeRequestHandler [1], a Solr component which invokes in chain a set of RequestHandler instances: a first request handler, targeting the title_with_synonyms_phrases would be invoked and, in case of zero results, the same query will be sent to another request handler, which would target the title_without_synonyms_phrases.

Note for Elasticsearch users: you will find some difference in applying what is described above. Although the auto_generate_phrase_queries attribute is also present in Elasticsearch, it doesn’t have the same effect. What you’re looking for is an attribute which is not related with field types, it is a query attribute [2] [3]  and it is called auto_generate_synonyms_phrase_query.


[1] https://github.com/SeaseLtd/composite-request-handler
[2] Match Query / Synonyms
[3] Query String Query / Synonyms

Synonyms + Stopwords?? OMG!

The Context

The scenario description is quite simple: we want to use synonyms and stopwords.

Following the path of our previous article, we will introduce an additional component in the analysis chain: a StopFilter, which, as the name suggests, removes a set of words from an incoming token stream.

We will use the following data through the examples:

  • synonyms = [“out of warranty”,”oow”]
  • stopwords = [“of”]

Token filters can be configured at index and/or query time. In this context we are focused on the query side: both synonyms and stopwords will be configured only in the query analyzer.

Working exclusively at query time has a great benefit: we can change things at runtime without any reindex need. At the same time, no stopwords filtering will be executed at index time so those terms will be uselessly part of the dictionary.

The Problem: synonyms followed by stopwords

We have the following analyzers:

  • index analyzer
    • standard-tokenizer
    • lowercase
  • query analyzer
    • standard-tokenizer
    • lowercase + synonyms + stopwords

Theoretically, in the query analyzer we would have two options: the stopwords filter could be defined before or after the synonym filter. However, the first way (before) doesn’t make so much sense, because terms that are stopwords and that are, at the same time, part of a synonym will be removed before the synonym detection. As consequence of that those synonym won’t be detected: in the example data, issuing a query like

q=out of warranty

the “of” term will be removed by the StopFilter, the subsequent filter would receive [“out”, “warranty”], which doesn’t match the configured synonym (“out of warranty”).

Elasticsearch users: Elasticsearch doesn’t allow this scenario at all; if you try to use the PUT Settings API with a chain defined as above (first stopwords then synonyms with some term intersection), it will throw an illegal argument exception saying “term: out of warranty analyzed to a token (warranty) with position increment != 1 (got: 2)” .

Apache Solr instead uses a lenient approach: no errors at index creation, but the problem remains (personally I prefer the Elasticsearch approach)

So the obvious choice is to postpone the stopwords management after the synonym filter. Unfortunately, here there’s an issue: the stopword(s) removal has some unwanted side-effect in the generated token graph and the query parser generates a wrong query because it consumes the token stream at the end of the chain.

Let’s imagine we have the following query:

q=tv went out of warranty something of

it will generate the following:

title:tv title:went (title:oow PhraseQuery(title:"out ? warranty something"))

As you can see, the synonym (out of warranty -> oow) is correctly detected but the stopwords filter removes all the “of” tokens, even if the first occurrence is part of a synonym. In the generated query you can see the sneaky effect: the “hole” created by the first “of” occurrence removal, produces the inclusion, in the phrase query, of the next available token in the stream (“something”, in the example).

In other words, the oow token synonym is marked with a positionLength = 3, which correctly means it spans three tokens (1=out, 2=of, 3=warranty); later, the query parser will include the next three available terms for generating a synonym phrase queries but since we no longer have the 2nd token (of), such count includes also “something”, which is the 3rd available token in the stream.

Before proceeding: this is a known problem, a long-standing issue [1] in Lucene which has a broader domain because it is related with the FilteringTokenFilter, the superclass of StopFilter.

The problem we will try to solve is: how can we manage synonyms and stopwords at query time without generating the conflict above?

A Solution

A note first: the token filter we are going to create is something that deals only with Lucene classes. However, when things need to be plugged in a runtime container (e.g. Apache Solr or Elasticsearch) the deployment procedure depends on the target platform: we won’t cover this part here.

The proposed solution is to create a StopFilter subclass which will be “synonym-aware”; it will check the tokenType and positionLength attributes before deciding if a token needs to be removed from the stream. The goal is to avoid removing those terms which have been defined in the stopwords list but are part of a synonym definition.

The class that we are going to extends is org.apache.lucene.analysis.core.StopFlter. This is an empty class, because all the filtering logic is in the superclasses (org.apache.lucene.analysis.StopFilter and the more generic org.apache.lucene.analysis.FilteringTokenFilter). The stopwords logic resides in the accept() method, which as you can see is very simple:

protected boolean accept() {
  return !stopWords.contains(termAtt.buffer(), 0, termAtt.length());
}

If the stopwords list contains the current term, it will be removed. So far, so good. We need to extend (actually we could also decorate) the StopFilter class for doing something else before calling the logic above.

First we need to check the token type: if a token has been marked as a SYNONYM then our filter doesn’t have to remove it. Then we need to check the positionLength attribute, because, within a synonym detection context, a position length greater than 1 means we have traversing a multi-term synonym:

public class SynonymAwareStopFilter extends StopFilter {

  private TypeAttribute tAtt = 
                             addAttribute(TypeAttribute.class);
  private PositionLengthAttribute plAtt = 
                             addAttribute(PositionLengthAttribute.class);

  private int synonymSpans;

  protected SynonymAwareStopFilter(
                         TokenStream in, CharArraySet stopwords) {
    super(in, stopwords);
  }

  @Override
  protected boolean accept() {
    if (isSynonymToken()) {
      synonymSpans = plAtt.getPositionLength() > 1 
                             ? plAtt.getPositionLength() 
                             : 0;
      return true;
    }

    return (--synonymSpans > 0) || super.accept();
  }

  private boolean isSynonymToken() {
    return "SYNONYM".equals(tAtt.type());
  }

Let’s do some test. We will use Apache Solr 7.4.0 for checking the results. Here is the field type definition, where you can see our SynonymAwareStopFilter:

<fieldtype name="text" class="solr.TextField" autoGeneratePhraseQueries="true">
       <analyzer type="index">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
       </analyzer>
       <analyzer type="query">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
           <filter class="solr.SynonymGraphFilterFactory" 
                   synonyms="synonyms.txt" 
                   ignoreCase="false" 
                   expand="true"/>
           <filter class="sc.SynonymAwareStopFilterFactory" 
                   words="stopwords.txt" 
                   ignoreCase="true"/>
       </analyzer>
</fieldtype>

and this is a minimal request handler:

<requestHandler name="/def" class="solr.SearchHandler" default="true">
       <lst name="defaults">
           <bool name="sow">false</bool>
           <str name="df">title</str>
           <str name="defType">lucene</str>
           <bool name="debug">true</bool>
       </lst>
   </requestHandler>

Running the previous query:

q=tv went out of warranty something of

we have the following:

title:tv title:went (title:oow PhraseQuery(title:"out of warranty")) title:something

if we use instead the other synonym variant:

q=tv went oow something of

we have the following:

title:tv title:went (PhraseQuery(title:"out of warranty") title:oow) title:something

Everything seems working as expected! This is probably just one specific scenario among those addressed by LUCENE-4065; however, it helped me a lot because this is (at least in my experience) a frequent use case.

As usual, any feedback is warmly welcome. See you next time!

 


[1] https://issues.apache.org/jira/browse/LUCENE-4065

Apache Solr/Elasticsearch: How to Manage Multi-term Concepts out of the Box?

This flash blog post will address a very specific and common problem : how to manage entities/concepts composed by multiple terms in a vanilla Apache Solr/Elasticsearch instance ( no plugins or extensions to install).

The (deployment) context

An Elasticsearch or Apache Solr infrastructure where you cannot install third-party components (e.g. plugins, filters, query parsers). This can happen for several reasons:

  • endogenous factors: lack of required expertise/skills for implementing/installing things in your infrastructure.
  • exogenous factors: search capabilities live in an external and managed context which doesn’t allow custom components. This happens for example with services like the Amazon Elasticsearch Service [1].

The Problem

How can I model multi-terms concepts (i.e. concepts composed by multiple terms)?

Concepts are a fundamental piece of a domain specific vocabulary: “United States of America”, “Phone Number”, “Out of Warranty” are just examples of entities you probably want to manage as a whole; if a user searches something like “How can I transfer a Phone Number?” you probably don’t want to return things about numbers or phones, which have a broader scope (i.e. sacrificing precision in favour of recall) .

Note, “probably” is in bold because there’s not an absolute truth here: everything depends on the functional context where the application is running. Here we assume this requirement, but things could be different in another context.

Index/Query Time Solutions

The problem can be solved using two different approaches that involves Indexing time and Query time configurations :

  • Simple Contraction: using the SynonymFilter with Simple Contraction [2] [3] in order to inject a single term that represents the concepts (with optional synonyms)
Multimedia Messaging Service,Multimedia Text Message => mms
  • Shingles: combining shingles and a keep word filter for generating bigrams and trigrams from a given text (assuming we are limiting our interest only to concepts composed by a maximum of 3 terms)

…an Additional Constraint: Query-Time Only

One of the drawbacks of the first group (index + query time) is the cost of reindexing the whole dataset when a change occurs in the synonyms or in the keywords list. So the additional constraint we will introduce is: we want to be able to change the concepts list at runtime without any reindexing.

Again, this is not an absolute constraint: there are a lot of scenarios where reindex everything is completely ok. In my experience this has a direct correlation with the index size plus how the whole reindexing process takes.

So in other words: if the full reindexing process takes a reasonable amount of time and it doesn’t produce any service interruption, then you should consider removing the “query-time only” constraint.

A Solution

The synonym management has been enhanced in Elasticsearch and Apache Solr [4], with the introduction of “graph-aware” token filters, for enabling a full support of Multi-Word Synonyms. 

Prior to that, both index/query time and query time approaches suffered from some limitations when dealing with synonyms [5].

In any case, the current implementation allows us to correctly manage multi-word synonyms as a whole. So, coming back to our question, we can try to “shape” the synonym filter at our wills for managing compound concepts as well.

First, a compound concept can or cannot have synonyms.

Concept with Synonym(s)

If the concept has one or more synonyms, we are within a regular context of the synonym filter. Here’s a sample content of the synonyms.txt file:

Multimedia Messaging Service,Multimedia Text Message,MMS
USA,United States of America
...

Here’s the corresponding configuration:

<fieldtype name="txt" 
           class="solr.TextField" 
          autoGeneratePhraseQueries="true">
       <analyzer type="index">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
       </analyzer>
       <analyzer type="query">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
           <filter 
              class="solr.SynonymGraphFilterFactory" 
              synonyms="synonyms.txt" 
              expand="true"/>
       </analyzer>
</fieldtype>
<field name="title" type="txt" indexed="true" stored="true"/>

"analysis": {
    "filter": {
      "english_synonyms": {
         "type": "synonym_graph",
         "synonyms_path": "synonyms.txt",
         "expand": true
      }
    },
    "analyzer": {
      "text_index_analyzer": {
        "tokenizer": "standard",
        "filter": [ "lowercase" ]
      },
      "text_query_analyzer": {
        "tokenizer": "standard",
        "filter": [
          "lowercase",
          "english_synonyms"
        ]
      }
    }
...
"properties": {
   "title": {
      "type": "text",
      "analyzer": "text_index_analyzer",
      "search_analyzer": "text_query_analyzer"
   }
}

Issuing the following queries:

q={!lucene}multimedia messaging service&sow=false&df=title

{
  "query": {
     "match": {
       "title": "Multimedia messaging service"
     }
   }
}

{
   "query": {
     "query_string": {
       "query": "Multimedia messaging service",
       "default_field": "title",
     }
   }
}

produces (use debug=true in Solr or _validate/query?explain=true in Elasticsearch) the following query:

(PhraseQuery(title:"multimedia text message") 
 title:MMS 
 PhraseQuery(title:"multimedia messaging service"))

which is exactly what we want: the concept expressed in the query has been detected and the output query is looking for its expanded (united states of america) or contracted (usa) form. So far, so good.

Concept without Synonyms

What about if a multi-term concept doesn’t have any variant/synonym but we still want to manage it as a whole? Can we use the synonym filter without synonyms? Let’s try to see what happens.

The first obvious idea is to declare something like this, in our synonyms.txt:

Multimedia Messaging Service  
...

That is, a line containing our concept without any synonym. Unfortunately this is not working, the engine detects there are no synonyms and the resulting query is something like this:

title:multimedia title:messaging title:service

The same happens if you put a dummy synonym which is removed later, in the indexing chain, by a stopword filter. Something like:

Multimedia Messaging Service, something_that_will_be_configured_as_stopword
...

So the last chance is to duplicate the entry: something like this:

Multimedia Messaging Service,Multimedia Messaging Service
...

Here things start to be interesting. Running the same queries above we got the following explain:

(PhraseQuery(title:"multimedia messaging service") 
 PhraseQuery(title:"multimedia messaging service"))

title:"multimedia messaging service"^2.0

The phrase query is doubled, which means the score assigned to matches will reflect this, as illustrated in the example explain below (see the 2.0 boost):

4.544185 = sum of:
  4.544185 = weight(title:"multimedia messaging service" in 1) [SchemaSimilarity], result of:
    4.544185 = score(doc=1,freq=1.0 = phraseFreq=1.0
), product of:
      2.0 = boost
      2.7725887 = idf(), sum of:
        0.6931472 = idf, computed as ... from:
          1.0 = docFreq
          2.0 = docCount
        0.6931472 = idf, computed as ... from:
          1.0 = docFreq
          2.0 = docCount
        0.6931472 = idf, computed as ... from:
          1.0 = docFreq
...

That is, the score of the match above would have been 2.2720926 (4.544185 / 2) but since we have that artificial boost it has been doubled; at the same time it’s important to underline our concept has been correctly managed as a whole, and the queries above won’t return items related with messages, multimedia or services, which are broader concepts.

Is that boost factor a problem? That actually depends on your application: you should

  • have a representative number of search cases
  • try a plain term-centric search
  • try the approach suggested above
  • use a search quality evaluation tool
  • compare and choose

Summary

  • You have an Elasticsearch or Apache Solr cluster
  • You cannot install custom plugins
  • You want to manage compound concepts
  • You don’t want to reindex your corpus when adding / removing / updating the concepts list
  • If a concept has one or more synonyms, this is quite straightforward: use the synonym (graph) filter at query time*
  • if a concept doesn’t have any synonym, you can still use the synonym (graph) filter: just double the concept definition, but keep in mind the double (2.0) boost applied to the corresponding phrase query*

* Apache Solr users: make sure the target field type has autoGeneratePhraseQueries set to true, and the sow parameter (defined in the RequestHandler settings or as a request parameter) set to true as well.

[1] https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/aes-supported-plugins.html

[2] https://www.elastic.co/guide/en/elasticsearch/guide/current/multi-word-synonyms.html#_use_simple_contraction_for_phrase_queries

[3] https://lucene.apache.org/core/6_6_0//analyzers-common/org/apache/lucene/analysis/synonym/SolrSynonymParser.html

[4] https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support/#footnote1

[5] https://opensourceconnections.com/blog/2013/10/27/why-is-multi-term-synonyms-so-hard-in-solr/

SolrCloud exceptions with Apache Zookeeper

At the time we speak ( Solr 7.3.1 ) SolrCloud is a reliable and stable distributed architecture for Apache Solr.
But it is not perfect and failures happen.

Apache Zookeeper[1] is the system responsible of managing the communications across the SolrCloud cluster.
It contains the shared collections configurations and it has the view of the cluster status.
It is part of the brain of the cluster, a keeper that maintains the cluster healthy and functional.

It is able to answer questions such as :

• Who is the leader for this shard and collection?
• Is this node down ?
• Is this node recovering ?

The Solr nodes communicate with Zookeeper to understand who to contact when running SolrCloud operations.

This lightening blog post will present some practical tips to follow when your client application encounters some classic exceptions dealing with SolrCloud and Apache Zookeeper.
Special thanks to the Apache Solr user mailing list contributors and the Apache Solr community, this post is an aggregation of recommendations from there and from official code and documentation.

org.apache.solr.common.SolrException: Could not load collection from ZK: <collection name>

If you landed here with just that Exception I assume there is a missing :
“ Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /collections/<collection name>/state.json “ ?

Solr’s zkClientTimeout is used to set ZooKeeper’s sessionTimeout, and that’s what is exceeded when a session expires.
When this kind of exception happens, it means something has gone VERY wrong in the Solr-Zookeeper communication, 30 seconds ( the current default[2] ) is a REALLY long time when applications are trying to communicate.

Recommendation : take care of the different time outs around, don’t keep them too small !
i.e. for zkClientTimeout assign a value >= 30 seconds.
maxSessionTimeout (Zookeeper)
New in 3.3.0: the maximum session timeout in milliseconds that the server will allow the client to negotiate. Defaults to 20 times the tickTime.

zkClientTimeout (Solr)
Controls your client timeout.

Once checked the time outs, let’s explore some possible root causes.
A session expiry can be caused by:
1. Garbage collection on Solr node/Zookeeper – extreme GC pauses can happen with the heap being too small or VERY large
2. Slow IO on disk.
3. Network latency

Recommendations

  1. set up a JVM profiler to monitor closely your Solr and Zookeeper nodes, take particular attention to the garbage collection cycles and the memory usage in general : you don’t want Zk to swap too much! ( GCViewer[3] could be a nice tool for this)
  2. Verify that the Zookeeper node has a fast writing access to the disk : Zookeeper needs fast writes and ideally a separate disk allocated.
  3. Monitor your network and make sure the the solr nodes can talk effectively to the Zookeeper nodes

In case the suggestions are not solving your problem, you may be experiencing a Solr bug.
One of them is[4] which unfortunately has not been fixed yet.

org.apache.solr.client.solrj.SolrServerException: No live SolrServers available to handle this

From the official JavaDoc :

org/apache/solr/client/solrj/impl/LBHttpSolrClient.java:369
“Tries to query a live server from the list provided in Req. Servers in the dead pool are skipped.
* If a request fails due to an IOException, the server is moved to the dead pool for a certain period of
* time, or until a test request on that server succeeds.
*
* Servers are queried in the exact order given (except servers currently in the dead pool are skipped).
* If no live servers from the provided list remain to be tried, a number of previously skipped dead servers will be tried.
* Req.getNumDeadServersToTry() controls how many dead servers will be tried.
*
* If no live servers are found a SolrServerException is thrown.”

What was the status of the cluster at the moment the exception happened ?
Was any Solr server UP and running according to Zookeeper knowledge ?

The recommendation is to check the clusterstate.json when the exception happens.
From the Solr admin UI you can open Cloud->Tree and verify which nodes are up and running.

It could be very much related a node failure ( that could be related to any possible reason including GC)
I’ve seen situations where it was caused by a specific query, the real exception got hidden by a “No live SolrServers…” client exception.
Solr logs should help to identify the inner Solr problem and JVM monitoring could discard any memory/gc problem.
Some people saw this with wildcard queries (when every shard reported a “too many expansions…”
type error, but the exception in the client response was “No live SolrServers…”.

org.apache.solr.common.SolrException: Could not find a healthy node to handle the request

Pretty much same considerations as the “No Live Solr Server”.
This happens when the load balancer SolrJ side is unable to retrieve an alive node, from the cluster ( based on Zookeeper state).
This happens before the previous exception, so the request doesn’t even reach the LoadBalancinghttpSolrClient.

[1] https://zookeeper.apache.org
[2] SOLR-5565
[3] https://github.com/chewiebug/GCViewer
[4] SOLR-8868

SolrCloud Leader Election Failing

At the time we speak ( Solr 7.3.0 ) SolrCloud is a reliable and stable distributed architecture for Apache Solr.
But it is not perfect and failures happen.
This lightening blog post will present some practical tips to follow when a specific shard of a collection is down with no leader and the situation is stuck.
The following problem has been experienced with the following Solr versions :

  • 4.10.2
  • 5.4.0

Steps to solve the problem may involve manual interaction with the Zookeeper Ensemble[1].
The following steps are extracted from an interesting thread of the Solr User mailing list[2] and practical experience on the field.
In particular, thanks to Jeff Wartes for the suggestions, that proved useful for me in a couple of occasions.

Problem

  • All nodes for a Shard in a Collection are up and running
  • There is no leader for the shard
  • All the nodes are in a “Recovering” / “Recovery Failed” state
  • Search is down and the situation persist after many minutes (> 5)

Solution

A possible explanation for this problem to occur is when the node-local version of the Zookeeper clusterstate has diverged from the centralized Zookeeper cluster state.
One possible cause for the leader election to break is a Zookeeper failure : for example you lose >=50% of the ensemble nodes or the connectivity among the ensemble nodes for a certain period of time ( this is the scenario I experimented directly)
This failure, even if resolved later, can bring a corruption to the Zookeeper file system.
Some of the SolrCloud collections may remain in a not consistent status.

It may be necessary to manually delete corrupted files from Zookeeper :
Let’s start from :

collections/<collection>/leader_elect/shard<x>/election
An healthy SolrCloud cluster presents as many core_nodeX as the total replicas for the shard.
You don’t want duplicates or missing nodes here.
If you’re having trouble getting a sane election, you can try deleting the lowest-numbered entries (as well as any lower-numbered duplicates) and try to foce the election again. Possibly followed by restarting the node with that lowest-numbered entry.

collections/<collection>/leader/shard<x>
Make sure that this folder exists and has the expected replica as a leader.

collections/<collection>/leader_initiated_recovery
This folder can be informative too, this represents replicas that the *leader* thinks are out of sync, usually due to a failed update request.

After having completed the verification above, there a couple of Collection API endpoints that may be useful :

Force Leader Election
/admin/collections?action=FORCELEADER&collection=<collectionName>&shard=<shardName>

Force Leader Rebalance
/admin/collections?action=REBALANCELEADERS&collection=collectionName

N.B. rebalancing all the leader will affect all the shards

 

[1] Apache Zookeeper Solr Cli

[2] Solr Mailing List Thread

[3] Solr Collection API

 

Distributed Search Tips for Apache Solr

Distributed search is the foundation for Apache Solr Scalability :

It’s possible to distributed search across different Apache Solr nodes of the same collection ( both in a  legacy[1] or SolrCloud[2] architecture), but it is also possible to distribute search across different collections in a SolrCloud cluster.
Aggregating results from different collections may be useful when you put in place different systems ( that were meant to be separate ) and you later realize that aggregating the results may be an additional useful use case.
This blog will focus on some tricky situations that can happen when running distributed search ( for configuration or details you can refer to the Solr wiki ).

IDF

Inverse Document Frequency affects the score.
This means that a document coming from a big collection can obtain a boost from IDF, in comparison to a similar document from a smaller collection.
This is because the maxDoc count is taken into account as corpus size, so even if a term has the same document frequency, IDF will be strongly affected by the collection size.
Distributed IDF[3] partially solved the problem :

When distributing the search across different shards of the same collection, it works quite well.
But using the ExactStatCache and alternating single collection distribution and multi collection distribution in the same SolrCloud cluster will create some caching conflict.

Specifically if we first execute the inter collection query, the global stats cached will be the inter collection global stats,  so if we then execute a single collection distributed search, the preview global stats will remain cached ( viceversa applies).

Debug Scoring

Real score and debug score is not aligned with the distributed IDF, this means that the debug query will not show the correct distributed IDF and correct scoring calculus for distributed searches[4].

Relevancy tuning

Lucene/Solr score is not probabilistic or normalised.
For the same collection we can have completely different score scales just with different queries.
The situation becomes more complicated when we tune our relevancy adding multiplicative or additive boosts.
Different collections may imply completely different boosting logic that could cause the score of a collection to be on a completely different scale in comparison to another.
We need to be extra careful when tuning relevancy for searches across different collections and try to configure the distributed request handler in the most compatible way as possible.

Request handler

It is important to carefully specify the request handler to be used when using distributed search.
The request will hit one collection in one node and then when distributing the same request handler will be called on the other collections across the other nodes.
If necessary it is possible to configure the aggregator request handler and local request handlers ( this may be useful if we want to use a different scoring formulas per collection, using local parameters) :

Aggregator Request Handler

It is executed on the first node receiving the request.
It will distribute the request and then aggregate the results.
It must describe parameters that are in common across all the collections involved in the search.
It is the one specified in the main request.
e.g.
http://localhost:8983/solr/collection1/select?

Local Request Handler

It is specified passing the parameter : shards.qt=
It is execute on each node that receive the distributed query AFTER the first one.
This can be used to use specific fields or parameter on a per collection basis.
A local request handler may use fields and search components that are local to the collection interested.

e.g.
http://localhost:8983/solr/collection1/select?q=*:*&collection=collection1,collection2&shards.qt=localSelect

N.B. the use of local request handler may be useful in case you want to define local query parser rules, such as local edismax configuration to affect the score.

Unique Key

The unique key field must be the same across the different collections.
Furthermore the value should be unique across the different collections to guarantee a proper behaviour.
If we don’t comply with this rule, Solr will fail in aggregating the results and raise an exception.

[1] https://lucene.apache.org/solr/guide/6_6/distributed-search-with-index-sharding.html
[2] https://lucene.apache.org/solr/guide/6_6/solrcloud.html
[3] https://lucene.apache.org/solr/guide/6_6/distributed-requests.html#DistributedRequests-ConfiguringstatsCache_DistributedIDF_
[4] https://issues.apache.org/jira/browse/SOLR-7759