Synonyms and Stopwords: Vademecum

In this post we’ll cover two additional synonyms scenarios and we’ll try to summarise all previous tips in a coincise form. Following the approach of the previous posts [1] [2] [3], everything can be applied both to Apache Solr and Elasticsearch.

Preconditions

  • Synonyms and stopwords at query time: this is not just a “theoretical” constraint; imagine if you have to manage a deployment context belonging to the same customer with a lot of small / medium indexes: you cannot re-build from scratch everything each time a synonym or a stopword changes.
  • Synonyms, not hypernyms or hyponyms: or better, we aren’t talking about what a thesaurus calls broader, narrower or related terms. Although some of the things below could be also valid in those contexts, the broader or narrower scope introduced with hypernyms, hyponyms or related concepts can have some weird side-effect on the scoring phase.

Test data

Let’s start with the test data.

  • synonyms = [“out of warranty, oow”, “transfer phone number, port number”]
  • stopwords = [“of”, “my”]
  • query analyzer = [ “standard_tokenizer”, “lowercase filter”, “synonyms (graph) filter”, “stopwords filter”]

#1: How can I define Multi-terms Concepts?

If you want to manage a multi-terms concept as a whole, regardless it has synonyms or not, you can use the synonyms file. Here’s a couple of examples: the first is a concept with one synonym, the second one doesn’t have any synonym:

Multimedia Messaging Service,Multimedia Text Message,MMS
Apache Cassandra, Apache Cassandra

As you can see, when a concept doesn’t have any available synonym, we can just repeat it.

Solr users only: don’t forget the following things:

  • the request handler should use an edismax or lucene query parser, and the SplitOnWhiteSpace flag (sow) must be set to true
  • the field type which includes the synonyms graph filter must have the autoGeneratePhraseQueries set to true

You can read more here [1] about this approach.

Note: this will work until the Lucene SynonymMap uses a List/Array for collecting the synonyms associated with a given concept. When and if the implementation will switch to a Set-like approach, there’s a high chance this trick will stop working.

#2: What if the query contains multi-terms concepts with stopwords?

Imagine a query like this

q=my car is out of warranty. What can I do?

Well, with the configuration above the stopwords removal after the synonyms detection causes a weird effect on the generated query: the “what” term is wrongly added to the synonym phrase query: “out ? warranty what”.

While the issue affects the FilteringTokenFilter (the superclass of StopFilter) and therefore it has a wider scope, for this specific problem we proposed a solution [2], consisting of a specialised StopFilter which is aware about synonym tokens. The result is that terms which are part of a previously detected synonym are not removed, even if they are stopwords. The query analyzer of our field becomes something like this:

<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SynonymGraphFilterFactory" 
        synonyms="synonyms.txt" 
        ignoreCase="false" 
        expand="true"/>
<filter class="io.sease.SynonymAwareStopFilterFactory" 
        words="stopwords.txt" 
        ignoreCase="true"/>

#3: What if the document contains multi-terms concepts with “intruder” stopwords?

We have a document like this:

{
  "id": 1,
  "title": "how do I transfer my phone number?"
}

and the query:

q=transfer phone number procedure

at query time, the synonym is correctly detected and phrase clauses are generated, but unfortunately it doesn’t match the document above because the intermediate “my” stopwords:

You can read here [3] the proposed solution for this scenario, which basically consists of a two-steps query plan: in the first, the detected synonyms generate phrase clauses, while in the second they are destructured in term clauses.

#4: What if the query contains multi-terms concepts with “intruder” stopwords?

And here we are in the opposite case. We have a document like this:

{ 
  "id": 1, 
  "title": "transfer phone number procedure" 
}

and the query:

q=how do I transfer my phone number?

As you can see, at query time the synonym is not detected because the “my” stopword between terms. While the document above could be still be part of the response of the generated query, here we are focusing on the missing synonym detection.

A possible solution is to double the synonym filter before and after the stopwords filter:

<fieldtype 
       name="text_with_synonyms_phrases" 
       class="solr.TextField" autoGeneratePhraseQueries="true">
       
       <analyzer type="index">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
       </analyzer>
       <analyzer type="query">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
           <filter class="solr.SynonymGraphFilterFactory" 
                   synonyms="synonyms.txt" 
                   ignoreCase="true" 
                   expand="true"/>
           <filter class="io.sease.SynonymAwareStopFilterFactory" 
                   words="stopwords.txt" 
                   ignoreCase="true"/>
           <filter class="solr.SynonymGraphFilterFactory" 
                   synonyms="synonyms.txt" 
                   ignoreCase="true" 
                   expand="true"/>
       </analyzer>
</fieldtype>

In the first iteration the synonym is not detected, then the StopFilter removes the “my” stopword so in the second iteration the synonym will be correctly recognized. Note the StopFilter is still the custom class we introduced in #2 because we want to cover also that scenario.

What is the drawback of this approach? This is something which worked in my specific case, but be aware that the SynonymGraphFilter documentation states this explicit warning:

NOTE: this cannot consume an incoming graph; results will be undefined.

#5 (UNSOLVED) What if the query contains multi-terms concepts more than one “intruder” stopwords?

This is the worst case, where we have a query like this:

q=out of my warranty

That is: we have a couple of terms which have been declared as stopwords, but the first (of) is potentially part of a synonym (out of warranty) while the second (my) isn’t.

We’re still working on this case so unfortunately there’s no a proposal here, if you got some idea or feedback, it is warmly welcome.


[1] Multi-terms concepts in Apache Solr / Elasticsearch
[2] SynonymAwareStopFilter
[3] https://sease.io/2018/08/still-synonyms-stopwords-mamma-mia.html

Apache Solr: Chaining SearchHandler instances: the CompositeRequestHandler

What are “Invisible Queries”?

This is an extract of an article [1] on Lucidworks.com, by Grant Ingersoll, talking about invisible queries:

“It is often necessary in many applications to execute more than one query for any given user query.  For instance, in applications that require very high precision (only good results, forgoing marginal results), the app. may have several fields, one for exact matches, one for case-insensitve matches and yet another with stemming.  Given a user query, the app may try the query against the exact match field first and if there is a result, return only that set.  If there are no results, then the app would proceed to search the next field, and so on.”

(source: https://lucidworks.com/blog/2009/08/12/fake-and-invisible-queries)

The sentence above assumes a scenario where the (client) application issues to Solr several and subsequent requests on top of a user query (i.e. one user query => many search engine queries). What about you don’t have such control? Imagine you’re the search engineer of an e-commerce portal that has been built using Magento, which, in this scenario, acts as the Solr client; someone installed and configured the Solr connector and ok, everything is working: when the user submits a search, the connector forwards the request to Solr, which in turns executes a (single) query according with the configuration.

The context

Now, imagine that the query above returns no results. The whole request / response interaction is gone, the user will see something like “Sorry, no results for your search”. Although this sounds perfectly reasonable, in this post we will focus on a different approach, based on the “invisible queries” thing you can read in the extract above. The main point here is a precondition: I cannot change the client code; that because (for example):

  • I don’t want to introduce custom code in my Magento / Drupal instance
  • I don’t know PHP
  • I’m strictly responsible for the search infrastructure and the frontend developer doesn’t want / is not able to properly implement this feature on the client side
  • I want to move as much as possible the search logic in Solr
What I’d like to do is to provide a single entry point (i.e. one single request handler) to my clients, being able to execute a workflow like this:
Invisible Queries Apache Solr

The CompositeRequestHandler

The underlying idea is to provide a Facade which is able to chain several handlers; something like this:
<requestHandler name="/search" class="...CompositeRequestHandler">
    <str name="chain">/rh1,/rh2,/rh3</str>
</requestHandler> 
where /rh1, /rh2 and /rh3 are standard SearchHandler instances you’ve already declared, that you want to chain in the workflow described in the diagram above.

The CompositeRequestHandler implementation is actually simple: its handleRequestBody method will execute, sequentially, the configured handler references, and it will break the chain after receiving the first positive query response (usually that is a query response with numFound > 0, but the last version of the component allows you to configure also other predicates). The logic would be something like this:

chain.stream()
    // Get the request handler associated with a given name
    .map(refName -> requestHandler(request, refName))
    // Only SearchHandler instances are allowed in the chain
    .filter(SearchHandler.class::isInstance) 
    // executes the handler logic 
    .map(handler -> executeQuery(request, response, params, handler))
    .filter(qresponse -> howManyFound(qresponse) > 0)
    // Stop the iteration when the first condition above has been satisfied
    .findFirst()
    // or, if we don’t have any positive executions, just returns an empty response.
    .orElse(emptyResponse(request, response)));
You can find the source code of CompositeRequestHandler in our Sease GitHub repository. As usual, any feedback is warmly welcome.