Synonyms Tips And Tricks
Synonyms and Stopwords: Vademecum

Synonyms and Stopwords: Vademecum

In this post we’ll cover two additional synonyms scenarios and we’ll try to summarise all previous tips in a coincise form. Following the approach of the previous posts [1] [2] [3], everything can be applied both to Apache Solr and Elasticsearch.

Preconditions

    • Synonyms and stopwords at query time: this is not just a “theoretical” constraint; imagine if you have to manage a deployment context belonging to the same customer with a lot of small / medium indexes: you cannot re-build from scratch everything each time a synonym or a stopword changes.
    • Synonyms, not hypernyms or hyponyms: or better, we aren’t talking about what a thesaurus calls broader, narrower or related terms. Although some of the things below could be also valid in those contexts, the broader or narrower scope introduced with hypernyms, hyponyms or related concepts can have some weird side-effect on the scoring phase.

Test data

Let’s start with the test data.

    • synonyms = [“out of warranty, oow”, “transfer phone number, port number”]
    • stopwords = [“of”, “my”]
    • query analyzer = [ “standard_tokenizer”, “lowercase filter”, “synonyms (graph) filter”, “stopwords filter”]

#1: How can I define Multi-terms Concepts?

If you want to manage a multi-terms concept as a whole, regardless it has synonyms or not, you can use the synonyms file. Here’s a couple of examples: the first is a concept with one synonym, the second one doesn’t have any synonym:

Multimedia Messaging Service,Multimedia Text Message,MMS
Apache Cassandra, Apache Cassandra

As you can see, when a concept doesn’t have any available synonym, we can just repeat it.

Solr users only: don’t forget the following things:

    • the request handler should use an edismax or lucene query parser, and the SplitOnWhiteSpace flag (sow) must be set to true
    • the field type which includes the synonyms graph filter must have the autoGeneratePhraseQueries set to true

You can read more here [1] about this approach.

Note: this will work until the Lucene SynonymMap uses a List/Array for collecting the synonyms associated with a given concept. When and if the implementation will switch to a Set-like approach, there’s a high chance this trick will stop working.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.

#2: What if the query contains multi-terms concepts with stopwords?

Imagine a query like this

q=my car is out of warranty. What can I do?

Well, with the configuration above the stopwords removal after the synonyms detection causes a weird effect on the generated query: the “what” term is wrongly added to the synonym phrase query: “out ? warranty what”.

While the issue affects the FilteringTokenFilter (the superclass of StopFilter) and therefore it has a wider scope, for this specific problem we proposed a solution [2], consisting of a specialised StopFilter which is aware about synonym tokens. The result is that terms which are part of a previously detected synonym are not removed, even if they are stopwords. The query analyzer of our field becomes something like this:

<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SynonymGraphFilterFactory" 
        synonyms="synonyms.txt" 
        ignoreCase="false" 
        expand="true"/>
<filter class="io.sease.SynonymAwareStopFilterFactory" 
        words="stopwords.txt" 
        ignoreCase="true"/>

#3: What if the document contains multi-terms concepts with "intruder" stopwords?

We have a document like this:

{
  "id": 1,
  "title": "how do I transfer my phone number?"
}

and the query:

q=transfer phone number procedure

at query time, the synonym is correctly detected and phrase clauses are generated, but unfortunately it doesn’t match the document above because the intermediate “my” stopwords:

You can read here [3] the proposed solution for this scenario, which basically consists of a two-steps query plan: in the first, the detected synonyms generate phrase clauses, while in the second they are destructured in term clauses.

#4: What if the query contains multi-terms concepts with "intruder" stopwords?

And here we are in the opposite case. We have a document like this:

{ 
  "id": 1, 
  "title": "transfer phone number procedure" 
}

and the query:

q=how do I transfer my phone number?

As you can see, at query time the synonym is not detected because the “my” stopword between terms. While the document above could be still be part of the response of the generated query, here we are focusing on the missing synonym detection.

A possible solution is to double the synonym filter before and after the stopwords filter:

<fieldtype 
       name="text_with_synonyms_phrases" 
       class="solr.TextField" autoGeneratePhraseQueries="true">
       
       <analyzer type="index">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
       </analyzer>
       <analyzer type="query">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
           <filter class="solr.SynonymGraphFilterFactory" 
                   synonyms="synonyms.txt" 
                   ignoreCase="true" 
                   expand="true"/>
           <filter class="io.sease.SynonymAwareStopFilterFactory" 
                   words="stopwords.txt" 
                   ignoreCase="true"/>
           <filter class="solr.SynonymGraphFilterFactory" 
                   synonyms="synonyms.txt" 
                   ignoreCase="true" 
                   expand="true"/>
       </analyzer>
</fieldtype>

In the first iteration the synonym is not detected, then the StopFilter removes the “my” stopword so in the second iteration the synonym will be correctly recognized. Note the StopFilter is still the custom class we introduced in #2 because we want to cover also that scenario.

What is the drawback of this approach? This is something which worked in my specific case, but be aware that the SynonymGraphFilter documentation states this explicit warning:

NOTE: this cannot consume an incoming graph; results will be undefined.

#5 (UNSOLVED) What if the query contains multi-terms concepts more than one "intruder" stopwords?

This is the worst case, where we have a query like this:

q=out of my warranty

That is: we have a couple of terms which have been declared as stopwords, but the first (of) is potentially part of a synonym (out of warranty) while the second (my) isn’t.

We’re still working on this case so unfortunately there’s no a proposal here, if you got some idea or feedback, it is warmly welcome.

// our service

Shameless plug for our training and services!

Did I mention we do Apache Solr Beginner and Elasticsearch Beginner training?
We also provide consulting on these topics, get in touch if you want to bring your search engine to the next level!

// STAY ALWAYS UP TO DATE

Subscribe to our newsletter

Did you like this post about Synonyms + Stopwords? Don’t forget to subscribe to our Newsletter to stay always updated from the Information Retrieval world!

Author

Andrea Gazzarini

Andrea Gazzarini is a curious software engineer, mainly focused on the Java language and Search technologies. With more than 15 years of experience in various software engineering areas, his adventure in the search world began in 2010, when he met Apache Solr and later Elasticsearch.

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.