Still Synonyms + Stopwords?? Mamma mia!

The Context

Brief recap of where we arrived in the preceding article: we had the following synonyms and stopwords settings:

  • synonyms = {“out of warranty”,”oow”}
  • stopwords = {“of”}

Both of those filters were configured exclusively at query-time; the synonym filter first and then the stopwords filter.

Using the built-in StopFilter we had a synonym detection issue because the removal of the “of” term in the query string (e.g. “my device ran out of warranty“). For that reason, we introduced a custom StopFilter subclass which was aware about stopwords in synonyms.

The other scenario we are going to describe is a little bit different: let’s suppose we have the following data:

  • synonyms = {test code, tdd, testing}
  • stopwords = {my, your, how ,to, in}

Still here, we want to manage synonyms and stopwords only at query time.
We have this document indexed:

   {
      "id": 1,
      "title": "Java programmer: do you want to test your code?"
   }

And a query like this:

"how to test code in Java?"

The Problem: missing synonym match

The query parser matches the “test code” synonym in the query and produces a query like this:

(title:tdd title:testing PhraseQuery(title:"test code")) title:java

unfortunately there’s no match, because the document title contains an intruder: the “your” term between the “test” and “code”.

A Solution: invisible queries with and without synonym phrases

In the preceding article we’ve underlined the role of the autoGeneratePhraseQueries flag. It is the responsible of creating phrase clauses for all detected multi-terms synonyms. In case this flag is set to false (or even missing) the generated query won’t have any phrase, even if a multi-term synonym is detected.

While ordinarily this is not what you would expect, in this specific case it could be a valid alternative for dealing with such mismatching: a first request would require the “synonym phrasing” behaviour, a second one wouldn’t. The first query would be:

(title:tdd title:testing PhraseQuery(title:"test code")) title:java

After receiving an empty response, a second query will be sent, targeting another (similar) field related to a field type which has the autoGeneratePhraseQueries parameter will be set to false. That would generates the following query:

(title:testing title:tdd (+title:test +title:code)) title:java

and here we would get a match!

A couple of notes:

  • in the second try we are requiring the disjoint presence of those two terms (“test” and “code”) in whatever order, with whatever proximity, so the increased recall could produce some unexpected results. In case we are using the edismax query parser, a “pf” parameter would be helpful for moving up those results which adhere better to the entered query, in terms of proximity and terms order.
  • we could put the stop filter at index time, but that violates the precondition: we want a pure query-time management.

How to implement such search workflow? In Solr, we need a couple of fields, the first one is exactly the field + field type we described in the preceding article, the second is similar, the only difference is in the autoGeneratePhraseQueries parameter, which is set to false:

<fieldtype 
       name="text_with_synonyms_phrases" 
       class="solr.TextField" autoGeneratePhraseQueries="true">
       
       <analyzer type="index">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
       </analyzer>
       <analyzer type="query">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
           <filter class="solr.SynonymGraphFilterFactory" 
                   synonyms="synonyms.txt" 
                   ignoreCase="false" 
                   expand="true"/>
           <filter class="sc.SynonymAwareStopFilterFactory" 
                   words="stopwords.txt" 
                   ignoreCase="true"/>
       </analyzer>
</fieldtype>
<fieldtype 
       name="text_without_synonyms_phrases" 
       class="solr.TextField" autoGeneratePhraseQueries="false">
       
       <analyzer type="index">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
       </analyzer>
       <analyzer type="query">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
           <filter class="solr.SynonymGraphFilterFactory" 
                   synonyms="synonyms.txt" 
                   ignoreCase="false" 
                   expand="true"/>
           <filter class="sc.SynonymAwareStopFilterFactory" 
                   words="stopwords.txt" 
                   ignoreCase="true"/>
       </analyzer>
</fieldtype>

<field 
      name="title_with_synonyms_phrases" 
      type="text_with_synonyms_phrases .../>
<field 
      name="title_without_synonyms_phrases" 
      type="text_without_synonyms_phrases .../>

then, here is the minimal request handler:

<requestHandler name="/search" class="solr.SearchHandler" default="true">
       <lst name="defaults">
           <bool name="sow">false</bool>
           <str name="df">title_with_synonyms_phrases</str>
           <str name="defType">lucene</str> 
       </lst>
   </requestHandler>

A client would send first a request like this:

/search?q=how to test code in Java

And, after receiving an empty response, it will send a second query:

/search?q=how to test code in Java&df=text_without_synonyms_phrases

Another option, which moves the search workflow on Solr side, is our CompositeRequestHandler [1], a Solr component which invokes in chain a set of RequestHandler instances: a first request handler, targeting the title_with_synonyms_phrases would be invoked and, in case of zero results, the same query will be sent to another request handler, which would target the title_without_synonyms_phrases.

Note for Elasticsearch users: you will find some difference in applying what is described above. Although the auto_generate_phrase_queries attribute is also present in Elasticsearch, it doesn’t have the same effect. What you’re looking for is an attribute which is not related with field types, it is a query attribute [2] [3]  and it is called auto_generate_synonyms_phrase_query.


[1] https://github.com/SeaseLtd/composite-request-handler
[2] Match Query / Synonyms
[3] Query String Query / Synonyms