The Context
Brief recap of where we arrived in the preceding article: we had the following synonyms and stopwords settings:
- synonyms = {“out of warranty”,”oow”}
- stopwords = {“of”}
Both of those filters were configured exclusively at query time; the synonym filter first and then the stopwords filter.
Using the built-in StopFilter we had a synonym detection issue because of the removal of the “of” term in the query string (e.g. “my device ran out of warranty“). For that reason, we introduced a custom StopFilter subclass which was aware of stopwords in synonyms.
The other scenario we are going to describe is a little bit different: let’s suppose we have the following data:
- synonyms = {test code, tdd, testing}
- stopwords = {my, your, how ,to, in}
Still, here, we want to manage synonyms and stopwords only at query time.
We have this document indexed:
{
"id": 1,
"title": "Java programmer: do you want to test your code?"
}
And a query like this:
"how to test code in Java?"
The Problem: missing synonym match
The query parser matches the “test code” synonym in the query and produces a query like this:
(title:tdd title:testing PhraseQuery(title:"test code")) title:java
unfortunately there’s no match, because the document title contains an intruder: the “your” term between the “test” and “code”.
A Solution: invisible queries with and without synonym phrases
In the preceding article, we’ve underlined the role of the autoGeneratePhraseQueries flag. It is responsible for creating phrase clauses for all detected multi-term synonyms. In case this flag is set to false (or even missing) the generated query won’t have any phrase, even if a multi-term synonym is detected.
While usually this is not what you would expect, in this specific case it could be a valid alternative for dealing with such mismatching: a first request would require the “synonym phrasing” behaviour, but a second one wouldn’t. The first query would be:
(title:tdd title:testing PhraseQuery(title:"test code")) title:java
After receiving an empty response, a second query will be sent, targeting another (similar) field related to a field type which has the autoGeneratePhraseQueries parameter set to false. That would generate the following query:
(title:testing title:tdd (+title:test +title:code)) title:java
and here we would get a match!
A couple of notes:
- On the second try, we require the disjoint presence of those two terms (“test” and “code”) in whatever order, with whatever proximity, so the increased recall could produce some unexpected results. In case we are using the edismax query parser, a “pf” parameter would help move up those results which adhere better to the entered query, in terms of proximity and terms order.
- we could put the stop filter at index time, but that violates the precondition: we want pure query-time management.
How to implement such search workflow? In Solr, we need a couple of fields, the first one is exactly the field + field type we described in the preceding article, and the second is similar, the only difference is in the autoGeneratePhraseQueries parameter, which is set to false:
then, here is the minimal request handler:
false
title_with_synonyms_phrases
lucene
A client would send first a request like this:
/search?q=how to test code in Java
And, after receiving an empty response, it will send a second query:
/search?q=how to test code in Java&df=text_without_synonyms_phrases
Another option, which moves the search workflow on Solr side, is our CompositeRequestHandler, a Solr component which invokes in chain a set of RequestHandler instances: a first request handler, targeting the title_with_synonyms_phrases would be invoked and, in case of zero results, the same query will be sent to another request handler, which would target the title_without_synonyms_phrases.
Note for Elasticsearch users: you will find some differences in applying what is described above. Although the auto_generate_phrase_queries attribute is also present in Elasticsearch, it doesn’t have the same effect. What you’re looking for is an attribute which is not related to field types, it is a query attribute [2] [3] and it is called auto_generate_synonyms_phrase_query.
Need Help With This Topic?
If you’re struggling with synonyms and stopwords, don’t worry – we’re here to help! Our team offers expert services and training to help you optimize your search engine and get the most out of your system. Contact us today to learn more!
Need Help with this topic?
If you're struggling with synonyms and stopwords, don't worry - we're here to help! Our team offers expert services and training to help you optimize your search engine and get the most out of your system. Contact us today to learn more!
Other posts you may find useful
Andrea Gazzarini
Andrea Gazzarini
We are Sease, an Information Retrieval Company based in London, focused on providing R&D project guidance and implementation, Search consulting services, Training, and Search solutions using open source software like Apache Lucene/Solr, Elasticsearch, OpenSearch and Vespa.
Follow Us
Top Categories
Recent Posts
Monthly video
Sign up for our Newsletter
Did you like this post? Don’t forget to subscribe to our Newsletter to stay always updated in the Information Retrieval world!
WhatsApp us





