Synonyms Tips And Tricks
Synonyms + Stopwords?? OMG!

The Context

The scenario description is quite simple: we want to use synonyms and stopwords.

Following the path of our previous article, we will introduce an additional component in the analysis chain: a StopFilter, which, as the name suggests, removes a set of words from an incoming token stream.

We will use the following data through the examples:

  • synonyms = [“out of warranty”,”oow”]
  • stopwords = [“of”]

Token filters can be configured at index and/or query time. In this context we are focused on the query side: both synonyms and stopwords will be configured only in the query analyzer.

Working exclusively at query time has a great benefit: we can change things at runtime without any reindex need. At the same time, no stopwords filtering will be executed at index time so those terms will be uselessly part of the dictionary.

The Problem: synonyms followed by stopwords

We have the following analyzers:

  • index analyzer
    • standard-tokenizer
    • lowercase
  • query analyzer
    • standard-tokenizer
    • lowercase + synonyms + stopwords

Theoretically, in the query analyzer we would have two options: the stopwords filter could be defined before or after the synonym filter. However, the first way (before) doesn’t make so much sense, because terms that are stopwords and that are, at the same time, part of a synonym will be removed before the synonym detection. As consequence of that those synonym won’t be detected: in the example data, issuing a query like

q=out of warranty

the “of” term will be removed by the StopFilter, the subsequent filter would receive [“out”, “warranty”], which doesn’t match the configured synonym (“out of warranty”).

Elasticsearch users: Elasticsearch doesn’t allow this scenario at all; if you try to use the PUT Settings API with a chain defined as above (first stopwords then synonyms with some term intersection), it will throw an illegal argument exception saying “term: out of warranty analyzed to a token (warranty) with position increment != 1 (got: 2)” .

Apache Solr instead uses a lenient approach: no errors at index creation, but the problem remains (personally I prefer the Elasticsearch approach)

So the obvious choice is to postpone the stopwords management after the synonym filter. Unfortunately, here there’s an issue: the stopword(s) removal has some unwanted side-effect in the generated token graph and the query parser generates a wrong query because it consumes the token stream at the end of the chain.

Let’s imagine we have the following query:

q=tv went out of warranty something of

it will generate the following:

title:tv title:went (title:oow PhraseQuery(title:"out ? warranty something"))

As you can see, the synonym (out of warranty -> oow) is correctly detected but the stopwords filter removes all the “of” tokens, even if the first occurrence is part of a synonym. In the generated query you can see the sneaky effect: the “hole” created by the first “of” occurrence removal, produces the inclusion, in the phrase query, of the next available token in the stream (“something”, in the example).

In other words, the oow token synonym is marked with a positionLength = 3, which correctly means it spans three tokens (1=out, 2=of, 3=warranty); later, the query parser will include the next three available terms for generating a synonym phrase queries but since we no longer have the 2nd token (of), such count includes also “something”, which is the 3rd available token in the stream.

Before proceeding: this is a known problem, a long-standing issue [1] in Lucene which has a broader domain because it is related with the FilteringTokenFilter, the superclass of StopFilter.

The problem we will try to solve is: how can we manage synonyms and stopwords at query time without generating the conflict above?

A Solution

A note first: the token filter we are going to create is something that deals only with Lucene classes. However, when things need to be plugged in a runtime container (e.g. Apache Solr or Elasticsearch) the deployment procedure depends on the target platform: we won’t cover this part here.

The proposed solution is to create a StopFilter subclass which will be “synonym-aware”; it will check the tokenType and positionLength attributes before deciding if a token needs to be removed from the stream. The goal is to avoid removing those terms which have been defined in the stopwords list but are part of a synonym definition.

The class that we are going to extends is org.apache.lucene.analysis.core.StopFlter. This is an empty class, because all the filtering logic is in the superclasses (org.apache.lucene.analysis.StopFilter and the more generic org.apache.lucene.analysis.FilteringTokenFilter). The stopwords logic resides in the accept() method, which as you can see is very simple:

protected boolean accept() {
  return !stopWords.contains(termAtt.buffer(), 0, termAtt.length());
}

If the stopwords list contains the current term, it will be removed. So far, so good. We need to extend (actually we could also decorate) the StopFilter class for doing something else before calling the logic above.

First we need to check the token type: if a token has been marked as a SYNONYM then our filter doesn’t have to remove it. Then we need to check the positionLength attribute, because, within a synonym detection context, a position length greater than 1 means we have traversing a multi-term synonym:

public class SynonymAwareStopFilter extends StopFilter {

  private TypeAttribute tAtt = 
                             addAttribute(TypeAttribute.class);
  private PositionLengthAttribute plAtt = 
                             addAttribute(PositionLengthAttribute.class);

  private int synonymSpans;

  protected SynonymAwareStopFilter(
                         TokenStream in, CharArraySet stopwords) {
    super(in, stopwords);
  }

  @Override
  protected boolean accept() {
    if (isSynonymToken()) {
      synonymSpans = plAtt.getPositionLength() > 1 
                             ? plAtt.getPositionLength() 
                             : 0;
      return true;
    }

    return (--synonymSpans > 0) || super.accept();
  }

  private boolean isSynonymToken() {
    return "SYNONYM".equals(tAtt.type());
  }

Let’s do some test. We will use Apache Solr 7.4.0 for checking the results. Here is the field type definition, where you can see our SynonymAwareStopFilter:

<fieldtype name="text" class="solr.TextField" autoGeneratePhraseQueries="true">
       <analyzer type="index">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
       </analyzer>
       <analyzer type="query">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
           <filter class="solr.SynonymGraphFilterFactory" 
                   synonyms="synonyms.txt" 
                   ignoreCase="false" 
                   expand="true"/>
           <filter class="sc.SynonymAwareStopFilterFactory" 
                   words="stopwords.txt" 
                   ignoreCase="true"/>
       </analyzer>
</fieldtype>

and this is a minimal request handler:

<requestHandler name="/def" class="solr.SearchHandler" default="true">
       <lst name="defaults">
           <bool name="sow">false</bool>
           <str name="df">title</str>
           <str name="defType">lucene</str>
           <bool name="debug">true</bool>
       </lst>
   </requestHandler>

Running the previous query:

q=tv went out of warranty something of

we have the following:

title:tv title:went (title:oow PhraseQuery(title:"out of warranty")) title:something

if we use instead the other synonym variant:

q=tv went oow something of

we have the following:

title:tv title:went (PhraseQuery(title:"out of warranty") title:oow) title:something

Everything seems working as expected! This is probably just one specific scenario among those addressed by LUCENE-4065; however, it helped me a lot because this is (at least in my experience) a frequent use case.

As usual, any feedback is warmly welcome. See you next time!

 


[1] https://issues.apache.org/jira/browse/LUCENE-4065

Author

Andrea Gazzarini

Andrea Gazzarini is a curious software engineer, mainly focused on the Java language and Search technologies. With more than 15 years of experience in various software engineering areas, his adventure in the search world began in 2010, when he met Apache Solr and later Elasticsearch.

Comments (8)

  1. Mousavi
    August 26, 2018

    Excellent!

  2. Mousavi
    August 26, 2018

    Is there any available source code? Thank you

    • Andrea Gazzarini
      August 26, 2018

      I’m sorry, we are reoganising the Github repository and the token filter described in the article is not yet there. However, if you have some dev skill, you can use the code embedded in the article.

  3. richa
    March 25, 2019

    After using this code even the stop words are searchable now

  4. richa
    March 25, 2019

    Can you also provide a solution that stop word if inside synonym should work fine, but if searching for any stopword then it should not return any response

    • Andrea Gazzarini
      April 10, 2019

      Hi, thanks for your comment.
      I double-checked the code, executed it with a short example and I can confirm you that is working. Could you please expand a bit what you’re trying to do?

  5. Andrés
    October 18, 2021

    Hi Andrea,

    I was trying the conditional token filter in elasticsearch as follows:
    {“type”: “condition”,
    “filter”: [“stop”],
    “script”: {“source”: “token.getType() != \”SYNONYM\””}

    And it seems to work. Do you think this approach is valid?

    • Andrea Gazzarini
      October 26, 2021

      Hi Andrés,
      nice shot: I think so! I went through my post and re-executed the examples using that approach. It works as expected.

      Andrea

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: