As you can see, the synonym (out of warranty -> oow) is correctly detected but the stopwords filter removes all the “of” tokens, even if the first occurrence is part of a synonym. In the generated query you can see the sneaky effect: the “hole” created by the first “of” occurrence removal, produces the inclusion, in the phrase query, of the next available token in the stream (“something”, in the example).
In other words, the oow token synonym is marked with a positionLength = 3, which correctly means it spans three tokens (1=out, 2=of, 3=warranty); later, the query parser will include the next three available terms for generating a synonym phrase queries but since we no longer have the 2nd token (of), such count includes also “something”, which is the 3rd available token in the stream.
Before proceeding: this is a known problem, a long-standing issue [1] in Lucene which has a broader domain because it is related with the FilteringTokenFilter, the superclass of StopFilter.
The problem we will try to solve is: how can we manage synonyms and stopwords at query time without generating the conflict above?
Mousavi
August 26, 2018Excellent!
Mousavi
August 26, 2018Is there any available source code? Thank you
Andrea Gazzarini
August 26, 2018I’m sorry, we are reoganising the Github repository and the token filter described in the article is not yet there. However, if you have some dev skill, you can use the code embedded in the article.
richa
March 25, 2019After using this code even the stop words are searchable now
richa
March 25, 2019Can you also provide a solution that stop word if inside synonym should work fine, but if searching for any stopword then it should not return any response
Andrea Gazzarini
April 10, 2019Hi, thanks for your comment.
I double-checked the code, executed it with a short example and I can confirm you that is working. Could you please expand a bit what you’re trying to do?
Andrés
October 18, 2021Hi Andrea,
I was trying the conditional token filter in elasticsearch as follows:
{“type”: “condition”,
“filter”: [“stop”],
“script”: {“source”: “token.getType() != \”SYNONYM\””}
And it seems to work. Do you think this approach is valid?
Andrea Gazzarini
October 26, 2021Hi Andrés,
nice shot: I think so! I went through my post and re-executed the examples using that approach. It works as expected.
Andrea