The Apache Lucene/Solr suggesters are important to Sease: we explored the topic in the past with this post about Solr autocomplete and we strongly believe the autocomplete feature to be vital for a lot of search applications.
This blog post explores in detail the current status of the Lucene BlendedInfixSuggester, some bugs of the most recent version (with the solution attached) and some possible improvements.
BlendedInfixSuggester
The BlendedInfixSuggester is an extension of the AnalyzingInfixSuggester with the additional functionality to weight prefix matches of your query across the matched documents.
It scores higher if a hit is closer to the start of the suggestion.
N.B. At the current stage, only the first term in your query will affect the suggestion score.
- blenderType: used to calculate the positional weight coefficient using the position of the first matching word. Can be one of the:
- position_linear: weightFieldValue*(1 – 0.10*position): Matches to the start will be given a higher score (Default)
- position_reciprocal: weightFieldValue/(1+position): Matches to the start will be given a score which decays faster than linear
- position_exponential_reciprocal: weightFieldValue/pow(1+position, exponent): Matches to the start will be given a score which decays faster than reciprocal
- exponent: an optional configuration variable for the position_reciprocal blenderType used to control how fast the score will increase or decrease. Default 2.0.
- blenderType: used to calculate the positional weight coefficient using the position of the first matching word. Can be one of the:
| Description | |
|---|---|
| Data Structure | Auxiliary Lucene Index |
| Building | For each Document, the stored content from the field is analyzed according to the suggestAnalyzerFieldType and then additionally EdgeNgram token filtered. Finally an auxiliary index is built with those tokens. |
| Lookup strategy | The query is analysed according to the suggestAnalyzerFieldType. Than a phrase search is triggered against the Auxiliary Lucene index The suggestions are identified starting at the beginning of each token in the field content. |
| Suggestions returned | The entire content of the field . |
This suggester is common nowadays as it allows us to provide suggestions in the middle of a field content, taking advantage of the analysis chain provided with the field. It will be possible in this way to provide suggestions considering synonyms, stop words, stemming and any other token filter used in the analysis and match the suggestion based on internal tokens. Finally, the suggestion is scored, based on the position match. The simple corpus of documents for the examples will be the following:
[
{
"id":"44",
"title":"Video gaming: the history"},
{
"id":"11",
"title":"Nowadays Video games are a phenomenal economic business"},
{
"id":"55",
"title":"The new generation of PC and Console Video games"},
{
"id":"33",
"title":"Video games: multiplayer gaming"}]
And a simple synonym mapping: multiplayer, online
Let’s see some examples:
| Query to autocomplete | Suggestions | Explanation |
|---|---|---|
| “gaming” |
| The input query is analysed, and the tokens produced are the following : “game” . In the Auxiliary Index , for each of the field content we have the EdgeNgram tokens: “v”,”vi”,”vid”… , “g”,”ga”,”gam”,“game” . So the match happens and the suggestion are returned. N.B. First two suggestions are ranked higher as the matched term happen to be closer to the start of the suggestion |
Let’s explore the score of each Suggestion given various Blender Types:
| Query | gaming | |||
| Suggestion | First Position Match | Position Linear | Position Reciprocal | Position Exponential Reciprocal |
|---|---|---|---|---|
| Video gaming: the history | 1 | 1-0.1*position = 0.9 | 1/(1+position) = 1/2 = 0.5 | 1/(1+position)^2 = 1/4 = 0.25 |
| Video game: multiplayer gaming | 1 | 1-0.1*position = 0.9 | 1/(1+position) = 1/2 = 0.5 | 1/(1+position) = 1/4 = 0.25 |
| Nowadays Video games are a phenomenal economic business | 2 | 1-0.1*position = 0.8 | 1/(1+position) = 1/3 = 0.3 | 1/(1+position) = 1/9 = 0.1 |
The final score of the suggestion will be:
long score = (long) (weight * coefficient)
N.B. the reason I highlighted the data type is that it’s directly affecting the first bug we discussed.
Suggestion Score Approximation
The optional weightField parameter is extremely important for the Blended Infix Suggester.
It assigns the value of the suggestion weight (extracted from the mentioned field).
e.g.
The suggestion may come from the product name field, but the suggestion weight depends on how profitable the product suggested is.
<str name=”field”>productName</str>
<str name=”weightField”>profit</str>
So far, so good, but unfortunately there are two problems with that.
Bug 1 - WeightField Not Defined -> Zero suggestion score
How To Reproduce It: Don’t define any weightField in the suggester config.
Effect: the suggestion ranking is lost, all the suggestions have a 0 score, position of the match doesn’t matter anymore.
The weightField is not a mandatory configuration for the BlendedInfixSuggester.
Your use case could not involve any weight for your suggestions and you are just interested in the positional scoring (the main reason the BlendedInfixSuggester exists in the first place).
Unfortunately, this is not possible at the moment:
If the weightField is not defined, each suggestion will have a weight of 0.
This is because the weight associated with each document in the document dictionary is long. If the field to extract the weight from, is not defined (null), the weight returned will just be 0.
This doesn’t allow us to differentiate when a weight should be 0 (value extracted from the field) or null (no value at all).
A solution has been proposed here [3].
Bug 2 - Bad Approximation Of Suggestion Score For Small Weights
There is a misleading data type casting in the score calculation for the suggestion:
long score = (long) (weight * coefficient)
This apparently innocent cast, actually brings very nasty effects if the weight associated with a suggestion is unitary or small enough.
Weight =1
Video gaming: the history
1-0.1*position = 0.9 * 1 =cast= 0
1/(1+position) = 1/2 = 0.5 * 1 =cast= 0
1/(1+position)^2 = 1/4 = 0.25 * 1 =cast= 0
Weight =2
Video gaming: the history
1-0.1*position = 0.9 * 2=cast= 1
1/(1+position) = 1/2 = 0.5 * 2=cast= 1
1/(1+position)^2 = 1/4 = 0.25 * 2=cast= 0
Basically, you risk losing the ranking of your suggestions reducing the score to only a few possible values: 0 or 1 (in edge cases).
A solution has been proposed here [3].
Multi Term Matches Handling
It is quite common to have multiple terms in the autocomplete query, so your suggester should be able to manage multiple matches in the suggestion accordingly.
Given a simple corpus (composed just by the following suggestions) and the query :
“Mini Bar Frid”
You see these suggestions:
- 1000 | Mini Bar something Fridge
- 1000 | Mini Bar something else Fridge
- 1000 | Mini Bar Fridge something
- 1000 | Mini Bar Fridge something else
- 1000 | Mini something Bar Fridge
This is because, at the moment, the first matching term wins all (and the other positions are ignored).
This brings a lot of possible ties (1000), that should be broken to give the user a nice and intuitive ranking.
But intuitively I would expect in the results something like (note that allTermsRequired=true and the schema weight field always returns 1000).
- Mini Bar Fridge something
- Mini Bar Fridge something else
- Mini Bar something Fridge
- Mini Bar something else Fridge
- Mini something Bar Fridge
Let’s see a proposed Solution [4]:
Positional Coefficient
Instead of taking into account just the first term position in the suggestion, it’s possible to use all the matching positions from the matched terms [“mini”, “bar”, “fridge”].
Each position match will affect the score with:
- How much the matched term position is distant from the ideal position match
- Query: Mini Bar Fri, Ideal Positions : [0,1,2]
- Suggestion 1: Mini Bar something Fridge, Matched Positions:[0,1,3]
- Suggestion 2: Mini Bar something else Fridge, Matched Positions:[0,1,4]
- Suggestion 2 will be penalised as “Fri” match happens further (4 > 3) from the ideal position 2
- Earlier the mis-position happened, stronger the penalty for the score to pay
- Query: Mini Bar Fri, Ideal Positions : [0,1,2]
- Suggestion 1: Mini Bar something Fridge, Matched Positions:[0,1,3]
- Suggestion 2: Mini something Bar Fridge, Matched Positions:[0,2,3]
- Suggestion 2 will be additionally penalised as the first mismatch in positions Bar happens closer to the beginning of the suggestion .
- How much the matched term position is distant from the ideal position match
Considering only the discontinue position proved useful:
Query1: Bar some
Query2: some
Suggestion: Mini Bar something Fridge
Query 1 Suggestion Matched Terms positions : [1,2]
Query 2 Suggestion Matched Terms positions : [2]
If we compare the suggestion score for both these queries, it would seem unfair to penalise the first one just because it matches 2 terms (consecutive) while the second query has just one match (positioned worse than the first match in query1)
Introducing this advanced positional coefficient calculus helped in improving the overall behaviour of the experimental test created.
The results obtained were quite promising :
Query: Mini Bar Fri
100 |Mini Bar Fridge something
100 |Mini Bar Fridge something else
100 |Mini Bar Fridge a a a a a a a a a a a a a a a a a a a a a a
26 |Mini Bar something Fridge
22 |Mini Bar something else Fridge
17 |Mini something Bar Fridge
8 |something Mini Bar Fridge
7 |something else Mini Bar Fridge
There is still a tie for the exact prefix matches, but let’s see if we can also finalise that improvement.
Token Count Coefficient
Let’s focus on the first three ranking suggestions we just saw:
Query: Mini Bar Fri
100 |Mini Bar Fridge something
100 |Mini Bar Fridge something else
100 |Mini Bar Fridge a a a a a a a a a a a a a a a a a a a a a a
Intuitively we want this order to break the ties.
The closer the number of matched terms with the total number of terms for the suggestion, the better.
Ideally, we want our top-scoring suggestion to just have the matched terms if possible.
We also don’t want to bring strong inconsistencies for the other suggestions, we should ideally only affect the ties.
This is achievable by calculating an additional coefficient, dependent on the term counts:
Token Count Coefficient = matched terms count / total terms count.
Then we can scale this value accordingly:
90% of the final score will derive from the positional coefficient
10% of the final score will derive from the token count coefficient
Query: Mini Bar Fri
90 * 1.0 + 10*3/4 = 97|Mini Bar Fridge something
90 * 1.0 + 10*3/5 = 96|Mini Bar Fridge something else
90 * 1.0 + 10*3/25 = 91|Mini Bar Fridge a a a a a a a a a a a a a a a a a a a a a a
It will require some additional tuning but the overall idea should bring a better ranking function to the BlendedInfix when multiple-term matches are involved!
If you have any suggestions, feel free to leave a comment below!
Code is available in the Github Pull Request attached to the Lucene Jira issue [4].
Need Help With This Topic?
If you’re struggling with BlendedInfixSuggester, don’t worry – we’re here to help! Our team offers expert services and training to help you optimize your search engine and get the most out of your system. Contact us today to learn more!






One Response
Query : Mini
10|Minimal Bar Fridge
10|Mini
10|Mini Fridge
10|Mini Bar Fridge
Since there is first word exact match, scores are identical. Because of this order also not accurate. Can u suggest on this?