The contribution is detailed in the following official Jira issues :
SOLR-12238 [1]
LUCENE-9171 [2]
The code review and merge process has been tracked in the Github Pull Request [3]
This new features will be available with Apache Lucene/Solr 8.5
The changes happened mostly Lucene side :
– a new token filter, that is able to extract the weight and store it as a token boost attribute
– query building, that checks for boost attributes and use them to build boosted queries when present
This makes the contribution usable by both Apache Solr and Elasticsearch(coming soon).
Solr side, the change affected the Solr base query parser, to be compatible with the synonym query style approach.
elkon
April 8, 2020This is a great contribution!
I was wondering if you know the answer to this question:
I would like to use Solr to index documents with term weights.
Doc1: this(w=0.3) is(w=0.4) the(w=0.1) first(w=0.7) doc(w=0.2)
Doc2: this(w=0.1) is(w=0.2) the(w=0.5) second(w=0.8) doc(w=0.1)
Note that the weight for the same term can be different for two documents.
After indexing I would like the search function to consider these weights when scoring the documents. For example, if the query is “doc”, I would like Doc1 to get a higher score.
Is this possible?
Thanks!
Alessandro Benedetti
April 14, 2020Hi Yes, it is possible in various ways.
You must use a combination of :
– PayloadDelimiterTokenFilter to add the weight payload to the term at indexing time.
Then you need to manage the query time, here you can:
1) fine tune your boosting, using the edismax query parser for example
– then you can create boost factors to multiply to your scoring matching clauses, using the payload function query ( https://lucene.apache.org/solr/guide/7_7/function-queries.html#payload-function ) , you can do that for all your query terms and apply the math you like, multiplicative or additive
2) just use the Payload score query parser : https://lucene.apache.org/solr/guide/7_5/other-parsers.html#payload-score-parser
And you should be good to go 🙂
For more info about using payloads :https://lucidworks.com/post/solr-payloads/
It is not super clear from that blog how to achieve terms weighting at indexing time, but I hope my explanation clarifies it.
Ivana (@icca_kg)
April 14, 2020Hi Alessandro,
Thanks for the blog and explanation.
I wonder if you tried to see how it works if you have different types of similarity?
I’m developing custom java application with lucene 8.5.0.
When I’m using BM25Similarity and delimitedBoost filter everything works as expected, but if I switch to BooleanSimilarity nothing happens. Parsed query look ok it has synonyms with proper boost value, but the final score hasn’t changed.
Thanks.
Alessandro Benedetti
April 14, 2020Hi Ivana,
do you mind debugging the score and pasting here the output ?
In Solr it is as easy as debug=results .
In Lucene you can get it using : org.apache.lucene.search.IndexSearcher#explain(org.apache.lucene.search.Query, int)
Now, from a very quick look to the Similarity classes, BM25Similarity has support for boosting :
org/apache/lucene/search/similarities/BM25Similarity.java:219
BM25Scorer(float boost, float k1, float b, Explanation idf, float avgdl, float[] cache) {
this.boost = boost;
….
this.weight = boost * idf.getValue().floatValue();
}
And so does BooleanSimilarity :
Simple similarity that gives terms a score that is equal to their query
* boost. This similarity is typically used with disabled norms since neither
* document statistics nor index statistics are used for scoring. That said,
* if norms are enabled, they will be computed the same way as
* {@link SimilarityBase} and {@link BM25Similarity} with
* {@link SimilarityBase#setDiscountOverlaps(boolean) discounted overlaps}
* so that the {@link Similarity} can be changed after the index has been
* created.
Ivana (@icca_kg)
April 16, 2020Hi Alessandro,
Here’s my debug output and some additional info:
I’m using StandardAnalyzer for search, and my SynonymGraphFilter has default configuration as in your example.
Query: +Synonym(morphology_term_original_name_key:neoplasm^0.7 morphology_term_original_name_key:tumor^0.8 morphology_term_original_name_key:tumour^0.6)
1.0 = weight(Synonym(morphology_term_original_name:neoplasm^0.7 morphology_term_original_name:tumor^0.8 morphology_term_original_name:tumour^0.6) in 0) [BooleanSimilarity], result of:
1.0 = score(BooleanWeight), computed from:
1.0 = boost, query boost
If I use the BM25Similarity, the printout is as follows:
0.75188845 = weight(Synonym(morphology_term_original_name:neoplasm^0.7 morphology_term_original_name:tumor^0.8 morphology_term_original_name:tumour^0.6) in 0) [BM25Similarity], result of:
0.75188845 = score(freq=0.8), computed as boost * idf * tf from:
1.3862944 = idf, computed as log(1 + (N – n + 0.5) / (n + 0.5)) from:
1 = n, number of documents containing term
5 = N, total number of documents with field
0.5423729 = tf, computed as freq / (freq + k1 * (1 – b + b * dl / avgdl)) from:
0.8 = termFreq=0.8
1.2 = k1, term saturation parameter
0.75 = b, length normalization parameter
1.0 = dl, length of field
2.4 = avgdl, average length of field
Alessandro Benedetti
April 16, 2020mmmm It is suspicious, I am afraid it requires some debug of the boolean similarity to understand the reasons.
I don’t see any obvious mistake here!
Roope Koski
September 2, 2020Hi Alessandro,
Thank you for this and all your other great contributions to search.
How do the weighted synonyms play along with Learning to Rank?
Anna Ruggero
April 2, 2021Hi Roope,
First of all, you have to decide how to manage free text queries in the Learning to Rank model. An approach could be the addition of TF, IDF, length of the field, BM25 score, etc. as a feature.
Second, a possibility to incorporate synonyms in the model could be the addition of the Solr score as a feature (since synonyms impact the Solr score) but I wouldn’t recommend this approach since the Solr score value depends on several different factors. In this way the explainability of the model decrease.
Other options could be:
– the addition of TF, IDF, length field, etc. features for each synonym you use.
– the addition of the BM25 score computed for each synonym you use.
There isn’t a unique answer to your question, it really depends on your specific scenario, the feature you have/want to use, the searching problem you are modeling..
There is a lot of freedom in solving the problem and you can use all your creativity in this.
I hope to had helped you a bit!