Synonyms and Stopwords: Vademecum

In this post we’ll cover two additional synonyms scenarios and we’ll try to summarise all previous tips in a coincise form. Following the approach of the previous posts [1] [2] [3], everything can be applied both to Apache Solr and Elasticsearch.

Preconditions

  • Synonyms and stopwords at query time: this is not just a “theoretical” constraint; imagine if you have to manage a deployment context belonging to the same customer with a lot of small / medium indexes: you cannot re-build from scratch everything each time a synonym or a stopword changes.
  • Synonyms, not hypernyms or hyponyms: or better, we aren’t talking about what a thesaurus calls broader, narrower or related terms. Although some of the things below could be also valid in those contexts, the broader or narrower scope introduced with hypernyms, hyponyms or related concepts can have some weird side-effect on the scoring phase.

Test data

Let’s start with the test data.

  • synonyms = [“out of warranty, oow”, “transfer phone number, port number”]
  • stopwords = [“of”, “my”]
  • query analyzer = [ “standard_tokenizer”, “lowercase filter”, “synonyms (graph) filter”, “stopwords filter”]

#1: How can I define Multi-terms Concepts?

If you want to manage a multi-terms concept as a whole, regardless it has synonyms or not, you can use the synonyms file. Here’s a couple of examples: the first is a concept with one synonym, the second one doesn’t have any synonym:

Multimedia Messaging Service,Multimedia Text Message,MMS
Apache Cassandra, Apache Cassandra

As you can see, when a concept doesn’t have any available synonym, we can just repeat it.

Solr users only: don’t forget the following things:

  • the request handler should use an edismax or lucene query parser, and the SplitOnWhiteSpace flag (sow) must be set to true
  • the field type which includes the synonyms graph filter must have the autoGeneratePhraseQueries set to true

You can read more here [1] about this approach.

Note: this will work until the Lucene SynonymMap uses a List/Array for collecting the synonyms associated with a given concept. When and if the implementation will switch to a Set-like approach, there’s a high chance this trick will stop working.

#2: What if the query contains multi-terms concepts with stopwords?

Imagine a query like this

q=my car is out of warranty. What can I do?

Well, with the configuration above the stopwords removal after the synonyms detection causes a weird effect on the generated query: the “what” term is wrongly added to the synonym phrase query: “out ? warranty what”.

While the issue affects the FilteringTokenFilter (the superclass of StopFilter) and therefore it has a wider scope, for this specific problem we proposed a solution [2], consisting of a specialised StopFilter which is aware about synonym tokens. The result is that terms which are part of a previously detected synonym are not removed, even if they are stopwords. The query analyzer of our field becomes something like this:

<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SynonymGraphFilterFactory" 
        synonyms="synonyms.txt" 
        ignoreCase="false" 
        expand="true"/>
<filter class="io.sease.SynonymAwareStopFilterFactory" 
        words="stopwords.txt" 
        ignoreCase="true"/>

#3: What if the document contains multi-terms concepts with “intruder” stopwords?

We have a document like this:

{
  "id": 1,
  "title": "how do I transfer my phone number?"
}

and the query:

q=transfer phone number procedure

at query time, the synonym is correctly detected and phrase clauses are generated, but unfortunately it doesn’t match the document above because the intermediate “my” stopwords:

You can read here [3] the proposed solution for this scenario, which basically consists of a two-steps query plan: in the first, the detected synonyms generate phrase clauses, while in the second they are destructured in term clauses.

#4: What if the query contains multi-terms concepts with “intruder” stopwords?

And here we are in the opposite case. We have a document like this:

{ 
  "id": 1, 
  "title": "transfer phone number procedure" 
}

and the query:

q=how do I transfer my phone number?

As you can see, at query time the synonym is not detected because the “my” stopword between terms. While the document above could be still be part of the response of the generated query, here we are focusing on the missing synonym detection.

A possible solution is to double the synonym filter before and after the stopwords filter:

<fieldtype 
       name="text_with_synonyms_phrases" 
       class="solr.TextField" autoGeneratePhraseQueries="true">
       
       <analyzer type="index">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
       </analyzer>
       <analyzer type="query">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
           <filter class="solr.SynonymGraphFilterFactory" 
                   synonyms="synonyms.txt" 
                   ignoreCase="true" 
                   expand="true"/>
           <filter class="io.sease.SynonymAwareStopFilterFactory" 
                   words="stopwords.txt" 
                   ignoreCase="true"/>
           <filter class="solr.SynonymGraphFilterFactory" 
                   synonyms="synonyms.txt" 
                   ignoreCase="true" 
                   expand="true"/>
       </analyzer>
</fieldtype>

In the first iteration the synonym is not detected, then the StopFilter removes the “my” stopword so in the second iteration the synonym will be correctly recognized. Note the StopFilter is still the custom class we introduced in #2 because we want to cover also that scenario.

What is the drawback of this approach? This is something which worked in my specific case, but be aware that the SynonymGraphFilter documentation states this explicit warning:

NOTE: this cannot consume an incoming graph; results will be undefined.

#5 (UNSOLVED) What if the query contains multi-terms concepts more than one “intruder” stopwords?

This is the worst case, where we have a query like this:

q=out of my warranty

That is: we have a couple of terms which have been declared as stopwords, but the first (of) is potentially part of a synonym (out of warranty) while the second (my) isn’t.

We’re still working on this case so unfortunately there’s no a proposal here, if you got some idea or feedback, it is warmly welcome.


[1] Multi-terms concepts in Apache Solr / Elasticsearch
[2] SynonymAwareStopFilter
[3] https://sease.io/2018/08/still-synonyms-stopwords-mamma-mia.html

Still Synonyms + Stopwords?? Mamma mia!

The Context

Brief recap of where we arrived in the preceding article: we had the following synonyms and stopwords settings:

  • synonyms = {“out of warranty”,”oow”}
  • stopwords = {“of”}

Both of those filters were configured exclusively at query-time; the synonym filter first and then the stopwords filter.

Using the built-in StopFilter we had a synonym detection issue because the removal of the “of” term in the query string (e.g. “my device ran out of warranty“). For that reason, we introduced a custom StopFilter subclass which was aware about stopwords in synonyms.

The other scenario we are going to describe is a little bit different: let’s suppose we have the following data:

  • synonyms = {test code, tdd, testing}
  • stopwords = {my, your, how ,to, in}

Still here, we want to manage synonyms and stopwords only at query time.
We have this document indexed:

   {
      "id": 1,
      "title": "Java programmer: do you want to test your code?"
   }

And a query like this:

"how to test code in Java?"

The Problem: missing synonym match

The query parser matches the “test code” synonym in the query and produces a query like this:

(title:tdd title:testing PhraseQuery(title:"test code")) title:java

unfortunately there’s no match, because the document title contains an intruder: the “your” term between the “test” and “code”.

A Solution: invisible queries with and without synonym phrases

In the preceding article we’ve underlined the role of the autoGeneratePhraseQueries flag. It is the responsible of creating phrase clauses for all detected multi-terms synonyms. In case this flag is set to false (or even missing) the generated query won’t have any phrase, even if a multi-term synonym is detected.

While ordinarily this is not what you would expect, in this specific case it could be a valid alternative for dealing with such mismatching: a first request would require the “synonym phrasing” behaviour, a second one wouldn’t. The first query would be:

(title:tdd title:testing PhraseQuery(title:"test code")) title:java

After receiving an empty response, a second query will be sent, targeting another (similar) field related to a field type which has the autoGeneratePhraseQueries parameter will be set to false. That would generates the following query:

(title:testing title:tdd (+title:test +title:code)) title:java

and here we would get a match!

A couple of notes:

  • in the second try we are requiring the disjoint presence of those two terms (“test” and “code”) in whatever order, with whatever proximity, so the increased recall could produce some unexpected results. In case we are using the edismax query parser, a “pf” parameter would be helpful for moving up those results which adhere better to the entered query, in terms of proximity and terms order.
  • we could put the stop filter at index time, but that violates the precondition: we want a pure query-time management.

How to implement such search workflow? In Solr, we need a couple of fields, the first one is exactly the field + field type we described in the preceding article, the second is similar, the only difference is in the autoGeneratePhraseQueries parameter, which is set to false:

<fieldtype 
       name="text_with_synonyms_phrases" 
       class="solr.TextField" autoGeneratePhraseQueries="true">
       
       <analyzer type="index">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
       </analyzer>
       <analyzer type="query">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
           <filter class="solr.SynonymGraphFilterFactory" 
                   synonyms="synonyms.txt" 
                   ignoreCase="false" 
                   expand="true"/>
           <filter class="sc.SynonymAwareStopFilterFactory" 
                   words="stopwords.txt" 
                   ignoreCase="true"/>
       </analyzer>
</fieldtype>
<fieldtype 
       name="text_without_synonyms_phrases" 
       class="solr.TextField" autoGeneratePhraseQueries="false">
       
       <analyzer type="index">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
       </analyzer>
       <analyzer type="query">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
           <filter class="solr.SynonymGraphFilterFactory" 
                   synonyms="synonyms.txt" 
                   ignoreCase="false" 
                   expand="true"/>
           <filter class="sc.SynonymAwareStopFilterFactory" 
                   words="stopwords.txt" 
                   ignoreCase="true"/>
       </analyzer>
</fieldtype>

<field 
      name="title_with_synonyms_phrases" 
      type="text_with_synonyms_phrases .../>
<field 
      name="title_without_synonyms_phrases" 
      type="text_without_synonyms_phrases .../>

then, here is the minimal request handler:

<requestHandler name="/search" class="solr.SearchHandler" default="true">
       <lst name="defaults">
           <bool name="sow">false</bool>
           <str name="df">title_with_synonyms_phrases</str>
           <str name="defType">lucene</str> 
       </lst>
   </requestHandler>

A client would send first a request like this:

/search?q=how to test code in Java

And, after receiving an empty response, it will send a second query:

/search?q=how to test code in Java&df=text_without_synonyms_phrases

Another option, which moves the search workflow on Solr side, is our CompositeRequestHandler [1], a Solr component which invokes in chain a set of RequestHandler instances: a first request handler, targeting the title_with_synonyms_phrases would be invoked and, in case of zero results, the same query will be sent to another request handler, which would target the title_without_synonyms_phrases.

Note for Elasticsearch users: you will find some difference in applying what is described above. Although the auto_generate_phrase_queries attribute is also present in Elasticsearch, it doesn’t have the same effect. What you’re looking for is an attribute which is not related with field types, it is a query attribute [2] [3]  and it is called auto_generate_synonyms_phrase_query.


[1] https://github.com/SeaseLtd/composite-request-handler
[2] Match Query / Synonyms
[3] Query String Query / Synonyms