The Apache Lucene/Solr suggesters are important to Sease : we explored the topic in the past[1] and we strongly believe the autocomplete feature to be vital for a lot of search applications.
This blog post explores in details the current status of the Lucene BlendedInfixSuggester, some bugs of the most recent version ( with the solution attached) and some possible improvements.
The final score of the suggestion will be :
BlendedInfixSuggester
The BlendedInfixSuggester is an extension of the AnalyzingInfixSuggester with the additional functionality to weight prefix matches of your query across the matched documents. It scores higher if a hit is closer to the start of the suggestion. N.B. at the current stage only the first term in your query will affect the suggestion score
Let’s see some of the configuration parameters from the official wiki:
- blenderType: used to calculate the positional weight coefficient using the position of the first matching word. Can be one of:
- position_linear: weightFieldValue*(1 – 0.10*position): Matches to the start will be given a higher score (Default)
- position_reciprocal: weightFieldValue/(1+position): Matches to the start will be given a score which decays faster than linear
- position_exponential_reciprocal: weightFieldValue/pow(1+position,exponent): Matches to the start will be given a score which decays faster than reciprocal
- exponent: an optional configuration variable for the position_reciprocal blenderType used to control how fast the score will increase or decrease. Default 2.0.
Description | |
---|---|
Data Structure | Auxiliary Lucene Index |
Building | For each Document, the stored content from the field is analyzed according to the suggestAnalyzerFieldType and then additionally EdgeNgram token filtered. Finally an auxiliary index is built with those tokens. |
Lookup strategy | The query is analysed according to the suggestAnalyzerFieldType. Than a phrase search is triggered against the Auxiliary Lucene index The suggestions are identified starting at the beginning of each token in the field content. |
Suggestions returned | The entire content of the field . |
This suggester is really common nowadays as it allows to provide suggestions in the middle of a field content, taking advantage of the analysis chain provided with the field.
It will be possible in this way to provide suggestions considering synonyms, stop words, stemming and any other token filter used in the analysis and match the suggestion based on internal tokens.
Finally the suggestion is scored, based on the position match.
The simple corpus of document for the examples will be the following :
Let’s explore the score of each Suggestion given various Blender Types :
[ { "id":"44", "title":"Video gaming: the history"}, { "id":"11", "title":"Nowadays Video games are a phenomenal economic business"}, { "id":"55", "title":"The new generation of PC and Console Video games"}, { "id":"33", "title":"Video games: multiplayer gaming"}]And a simple synonym mapping : multiplayer, online Let’s see some example :
Query to autocomplete | Suggestions | Explanation |
---|---|---|
“gaming” |
|
The input query is analysed, and the tokens produced are the following : “game” . In the Auxiliary Index , for each of the field content we have the EdgeNgram tokens: “v”,”vi”,”vid”… , “g”,”ga”,”gam”,“game” . So the match happens and the suggestion are returned. N.B. First two suggestions are ranked higher as the matched term happen to be closer to the start of the suggestion |
Query | gaming | |||
Suggestion | First Position Match | Position Linear | Position Reciprocal | Position Exponential Reciprocal |
---|---|---|---|---|
Video gaming: the history | 1 | 1-0.1*position = 0.9 | 1/(1+position) = 1/2 = 0.5 | 1/(1+position)^2 = 1/4 = 0.25 |
Video game: multiplayer gaming | 1 | 1-0.1*position = 0.9 | 1/(1+position) = 1/2 = 0.5 | 1/(1+position) = 1/4 = 0.25 |
Nowadays Video games are a phenomenal economic business | 2 | 1-0.1*position = 0.8 | 1/(1+position) = 1/3 = 0.3 | 1/(1+position) = 1/9 = 0.1 |
long score = (long) (weight * coefficient)N.B. the reason I highlighted the data type is because it’s directly affecting the first bug we discuss.
Suggestion Score Approximation
The optional weightField parameter is extremely important for the BlendedInfixSuggester. It assigns the value of the suggestion weight ( extracted from the mentioned field). e.g. The suggestion may come from the product name field, but the suggestion weight depends on how profitable the product suggested is. <str name=”field”>productName</str> <str name=”weightField”>profit</str> So far, so good, but unfortunately there are two problems with that.Bug 1 – WeightField Not Defined -> Zero suggestion score
How To Reproduce It : Don’t define any weightField in the suggester config Effect : the suggestion ranking is lost, all the suggestions have 0 score, position of the match doesn’t matter anymore The weightField is not a mandatory configuration for the BlendedInfixSuggester. Your use case could not involve any weight for your suggestions and you are just interested in the positional scoring (the main reason the BlendedInfixSuggester exists in the first place). Unfortunately, this is not possible at the moment : If the weightField is not defined, each suggestion will have a weight of 0. This is because the weight associated to each document in the document dictionary is a long. If the field to extract the weight from, is not defined (null), the weight returned will just be 0. This doesn’t allow to differentiate when a weight should be 0 ( value extracted from the field) or null ( no value at all ). A solution has been proposed here[3].Bug 2 – Bad Approximation Of Suggestion Score For Small Weights
There is a misleading data type casting in the score calculation for the suggestion :long score = (long) (weight * coefficient)This apparently innocent cast, actually brings very nasty effects if the weight associated to a suggestion is unitary or small enough. Weight =1 Video gaming: the history 1-0.1*position = 0.9 * 1 =cast= 0 1/(1+position) = 1/2 = 0.5 * 1 =cast= 0 1/(1+position)^2 = 1/4 = 0.25 * 1 =cast= 0 Weight =2 Video gaming: the history 1-0.1*position = 0.9 * 2=cast= 1 1/(1+position) = 1/2 = 0.5 * 2=cast= 1 1/(1+position)^2 = 1/4 = 0.25 * 2=cast= 0 Basically you risk to lose the ranking of your suggestions reducing the score to only few possible values : 0 or 1 ( in edge cases) A solution has been proposed here[3]
Multi Term Matches Handling
It is quite common to have multiple terms in the autocomplete query, so your suggester should be able to manage multiple matches in the suggestion accordingly. Given a simple corpus (composed just by the following suggestions) and the query : “Mini Bar Frid” You see these suggestions:- 1000 | Mini Bar something Fridge
- 1000 | Mini Bar something else Fridge
- 1000 | Mini Bar Fridge something
- 1000 | Mini Bar Fridge something else
- 1000 | Mini something Bar Fridge
- Mini Bar Fridge something
- Mini Bar Fridge something else
- Mini Bar something Fridge
- Mini Bar something else Fridge
- Mini something Bar Fridge
Positional Coefficient
Instead of taking into account just the first term position in the suggestion, it’s possible to use all the matching positions from the matched terms [“mini”,”bar”,”fridge”]. Each position match will affect the score with :- How much the matched term position is distant from the ideal position match
- Query : Mini Bar Fri, Ideal Positions : [0,1,2]
- Suggestion 1 : Mini Bar something Fridge, Matched Positions:[0,1,3]
- Suggestion 2 : Mini Bar something else Fridge, Matched Positions:[0,1,4]
- Suggestion 2 will be penalised as “Fri” match happens farer (4 > 3) from the ideal position 2
- Earlier the mis-position happened, stronger the penalty for the score to pay
- Query : Mini Bar Fri, Ideal Positions : [0,1,2]
- Suggestion 1 : Mini Bar something Fridge, Matched Positions:[0,1,3]
- Suggestion 2 : Mini something Bar Fridge, Matched Positions:[0,2,3]
- Suggestion 2 will be additionally penalised as the first mis-match in positions Bar happens closer to the beginning of the suggestion
Token Count Coefficient
Let’s focus on the first three ranking suggestions we just saw : Query : Mini Bar Fri 100 |Mini Bar Fridge something 100 |Mini Bar Fridge something else 100 |Mini Bar Fridge a a a a a a a a a a a a a a a a a a a a a a Intuitively we want this order to break the ties. Closer the number of matched terms with the total number of terms for the suggestion, the better. Ideally we want our top scoring suggestion to just have the matched terms if possible. We also don’t want to bring strong inconsistencies for the other suggestions, we should ideally only affect the ties. This is achievable calculating an additional coefficient, dependant on the term counts : Token Count Coefficient = matched terms count / total terms count Then we can scale this value accordingly : 90% of the final score will derive from the positional coefficient 10% of the final score will derive from the token count coefficient Query : Mini Bar Fri 90 * 1.0 + 10*3/4 = 97|Mini Bar Fridge something 90 * 1.0 + 10*3/5 = 96|Mini Bar Fridge something else 90 * 1.0 + 10*3/25 = 91|Mini Bar Fridge a a a a a a a a a a a a a a a a a a a a a a It will require some additional tuning but the overall idea should bring a better ranking function to the BlendedInfixsuggester when multiple terms matches are involved! If you have any suggestion, feel free to leave a comment below! Code is available in the Github Pull Request attached to the Lucene Jira issue[4] [1] Solr Autocomplete [2] Blended Infix Suggester Solr Wiki [3] LUCENE-8343 [4] LUCENE-8347
// STAY ALWAYS UP TO DATE
Subscribe to our newsletter
Did you like this post about the Apache Lucene BlendedInfixSuggester? Don’t forget to subscribe to our Newsletter to stay always updated from the Information Retrieval world!
Related
Author
Alessandro Benedetti
Alessandro Benedetti is the founder of Sease Ltd. Senior Search Software Engineer, his focus is on R&D in information retrieval, information extraction, natural language processing, and machine learning.
Comment (1)
Leave a comment Cancel reply
This site uses Akismet to reduce spam. Learn how your comment data is processed.
Bobby
December 10, 2019Query : Mini
10|Minimal Bar Fridge
10|Mini
10|Mini Fridge
10|Mini Bar Fridge
Since there is first word exact match, scores are identical. Because of this order also not accurate. Can u suggest on this?