Synonyms Tips And Tricks
Still Synonyms + Stopwords?? Mamma mia!

Still Synonyms + Stopwords?? Mamma mia!

The Context

Brief recap of where we arrived in the preceding article: we had the following synonyms and stopwords settings:

    • synonyms = {“out of warranty”,”oow”}
    • stopwords = {“of”}

Both of those filters were configured exclusively at query-time; the synonym filter first and then the stopwords filter.

Using the built-in StopFilter we had a synonym detection issue because the removal of the “of” term in the query string (e.g. “my device ran out of warranty“). For that reason, we introduced a custom StopFilter subclass which was aware about stopwords in synonyms.

The other scenario we are going to describe is a little bit different: let’s suppose we have the following data:

    • synonyms = {test code, tdd, testing}
    • stopwords = {my, your, how ,to, in}

Still here, we want to manage synonyms and stopwords only at query time.
We have this document indexed:

   {
      "id": 1,
      "title": "Java programmer: do you want to test your code?"
   }

And a query like this:

"how to test code in Java?"

The Problem: missing synonym match

The query parser matches the “test code” synonym in the query and produces a query like this:

(title:tdd title:testing PhraseQuery(title:"test code")) title:java

unfortunately there’s no match, because the document title contains an intruder: the “your” term between the “test” and “code”.

A Solution: invisible queries with and without synonym phrases

In the preceding article we’ve underlined the role of the autoGeneratePhraseQueries flag. It is the responsible of creating phrase clauses for all detected multi-terms synonyms. In case this flag is set to false (or even missing) the generated query won’t have any phrase, even if a multi-term synonym is detected.

While ordinarily this is not what you would expect, in this specific case it could be a valid alternative for dealing with such mismatching: a first request would require the “synonym phrasing” behaviour, a second one wouldn’t. The first query would be:

(title:tdd title:testing PhraseQuery(title:"test code")) title:java

After receiving an empty response, a second query will be sent, targeting another (similar) field related to a field type which has the autoGeneratePhraseQueries parameter will be set to false. That would generates the following query:

(title:testing title:tdd (+title:test +title:code)) title:java

and here we would get a match!

A couple of notes:

    • in the second try we are requiring the disjoint presence of those two terms (“test” and “code”) in whatever order, with whatever proximity, so the increased recall could produce some unexpected results. In case we are using the edismax query parser, a “pf” parameter would be helpful for moving up those results which adhere better to the entered query, in terms of proximity and terms order.
    • we could put the stop filter at index time, but that violates the precondition: we want a pure query-time management.

How to implement such search workflow? In Solr, we need a couple of fields, the first one is exactly the field + field type we described in the preceding article, the second is similar, the only difference is in the autoGeneratePhraseQueries parameter, which is set to false:

<fieldtype 
       name="text_with_synonyms_phrases" 
       class="solr.TextField" autoGeneratePhraseQueries="true">
       
       <analyzer type="index">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
       </analyzer>
       <analyzer type="query">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
           <filter class="solr.SynonymGraphFilterFactory" 
                   synonyms="synonyms.txt" 
                   ignoreCase="false" 
                   expand="true"/>
           <filter class="sc.SynonymAwareStopFilterFactory" 
                   words="stopwords.txt" 
                   ignoreCase="true"/>
       </analyzer>
</fieldtype>
<fieldtype 
       name="text_without_synonyms_phrases" 
       class="solr.TextField" autoGeneratePhraseQueries="false">
       
       <analyzer type="index">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
       </analyzer>
       <analyzer type="query">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
           <filter class="solr.SynonymGraphFilterFactory" 
                   synonyms="synonyms.txt" 
                   ignoreCase="false" 
                   expand="true"/>
           <filter class="sc.SynonymAwareStopFilterFactory" 
                   words="stopwords.txt" 
                   ignoreCase="true"/>
       </analyzer>
</fieldtype>

<field 
      name="title_with_synonyms_phrases" 
      type="text_with_synonyms_phrases .../>
<field 
      name="title_without_synonyms_phrases" 
      type="text_without_synonyms_phrases .../>

then, here is the minimal request handler:

<requestHandler name="/search" class="solr.SearchHandler" default="true">
       <lst name="defaults">
           <bool name="sow">false</bool>
           <str name="df">title_with_synonyms_phrases</str>
           <str name="defType">lucene</str> 
       </lst>
   </requestHandler>

A client would send first a request like this:

/search?q=how to test code in Java

And, after receiving an empty response, it will send a second query:

/search?q=how to test code in Java&df=text_without_synonyms_phrases

Another option, which moves the search workflow on Solr side, is our CompositeRequestHandler [1], a Solr component which invokes in chain a set of RequestHandler instances: a first request handler, targeting the title_with_synonyms_phrases would be invoked and, in case of zero results, the same query will be sent to another request handler, which would target the title_without_synonyms_phrases.

Note for Elasticsearch users: you will find some difference in applying what is described above. Although the auto_generate_phrase_queries attribute is also present in Elasticsearch, it doesn’t have the same effect. What you’re looking for is an attribute which is not related with field types, it is a query attribute [2] [3]  and it is called auto_generate_synonyms_phrase_query.

// our service

Shameless plug for our training and services!

Did I mention we do Apache Solr Beginner and Elasticsearch Beginner training?
We also provide consulting on these topics, get in touch if you want to bring your search engine to the next level!

// STAY ALWAYS UP TO DATE

Subscribe to our newsletter

Did you like this post about Synonyms + Stopwords? Don’t forget to subscribe to our Newsletter to stay always updated from the Information Retrieval world!

Author

Andrea Gazzarini

Andrea Gazzarini is a curious software engineer, mainly focused on the Java language and Search technologies. With more than 15 years of experience in various software engineering areas, his adventure in the search world began in 2010, when he met Apache Solr and later Elasticsearch.

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.