Apache Lucene, Apache Solr, Main Blog

Introducing Weighted Synonyms in Apache Lucene/Solr

This blog post is about our latest contribution to the Apache Lucene/Solr project:
introducing the ability to assign different weights to synonyms.
This contribution aims to help users that deal with complex synonyms dictionaries where it’s important to associate a numerical weight to each of them, for example, to boost the ones that are more important in the domain or closer to the original concept.

Our Contribution

The contribution is detailed in the following official Jira issues :
SOLR-12238 [1]
LUCENE-9171 [2]

The code review and merge process has been tracked in the Github Pull Request [3]

These new features will be available with Apache Lucene/Solr 8.5

The changes happened mostly on Lucene side :
– a new token filter, that can extract the weight and store it as a token boost attribute
– query building, which checks for boost attributes and uses them to build boosted queries when present

This makes the contribution usable by both Apache Solr and Elasticsearch(coming soon).

Solr side, the change affected the Solr base query parser, to be compatible with the synonym query style approach.

Apache Solr

Configuration

Enabling query time-weighted synonyms requires two configurations:

- defining the synonyms with the associated weight in the synonyms.txt file following the syntax of the delimitedBoost token filter (you can use the managed REST API to do that if you prefer)

synonyms.txt

				
					tiger, tigre|0.9
lynx => lince|0.8, lynx_canadensis|0.9

leopard, big cat|0.8, bagheera|0.9, panthera pardus|0.85
lion => panthera leo|0.9, simba leo|0.8, kimba|0.75

panthera pardus, leopard|0.6
panthera tigris => tiger|0.99

snow leopard, panthera uncia|0.9, big cat|0.8, white_leopard|0.6
panthera onca => jaguar|0.95, big cat|0.85, black panther|0.65
panthera blytheae, oldest|0.5 ancient|0.9 panthera

- defining a fieldType in the schema.xml that applies the delimitedBoost filter after synonyms are expanded at query time

schema.xml

				
					<fieldType name="boostedSynonyms" class="solr.TextField" positionIncrementGap="100"  synonymQueryStyle="as_distinct_terms" autoGeneratePhraseQueries="true">
    <analyzer type="index">
      ...
    </analyzer>
    <analyzer type="query">
      ...
      <filter name="synonymGraph" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
      <filter name="delimitedBoost"/>
    </analyzer>
  </fieldType>

N.B. by default ‘|’ is used as a separator for the weights, if you prefer any other character, there is a “delimiter” parameter available:

				
					<filter name="delimitedBoost" delimiter="/"/>

That’s all!
Now you are ready to explore the various ways synonyms query expansion works and how boosts are applied.

Query Time

At Query Time the weight you configured for the synonym is going to be used to build a boost query that wraps the synonym.
At Scoring time this is a multiplicative factor that is applied to the score produced by the synonym match.

e.g.
given a <query1, document1> pair where document1 is a search result of query1
query1 = title:(tiger OR tigre^0.8)
Score(document1) (let’s consider just the tigre^0.9 score component)

				
					2.5526304 = weight(title:tigre in 14) [SchemaSimilarity], result of:
    2.5526304 = score(freq=1.0), product of:
      0.8 = boost
      6.531849 = idf
      0.4884969 = tf

How is a query with synonyms parsed?

Apache Solr currently supports three different Synonym Query Styles:

as_same_term (default): to blend terms document frequencies, i.e., SynonymQuery(tshirt,tee) where each term will be treated as equally important independently of their rarity in the corpus of information (blended document frequency)
pick_best to select the most significant (rarest document frequency) synonym when scoring Dismax(tee,tshirt)with a 0 tie factor.
as_distinct_terms to bias scoring towards documents that contain more synonyms (pants OR slacks).

You can configure the synonym query style in the schema.xml:

				
					<fieldType name="text_as_distinct" class="solr.TextField" positionIncrementGap="100"  synonymQueryStyle="as_distinct_terms" autoGeneratePhraseQueries="true">

Depending on the synonym query style you choose you will have different queries parsed, for the sake of this blog post, the following examples use the as_same_term (default) synonym query style.
Weighted synonyms are compatible with all the synonym query styles.

Single Term Query - Single Terms Synonym

synonyms.txt

				
					tiger, tigre|0.9

Query = title:tiger
Parsed Query:
Synonym(title:tiger title:tigre^0.9)

Single Term Query - Multi Terms Synonyms

synonyms.txt

				
					leopard, big cat|0.8, bagheera|0.9, panthera pardus|0.85

Query = title:leopard
Parsed Query:
((title:”big cat”)^0.8 OR
(title:bagheera)^0.9 OR
(title:”panthera pardus”)^0.85 OR
title:leopard)

N.B. multi term synonyms are supported and the weight is parsed as the boost for the phrase query.

MULTI Term Query - MULTI Terms Synonyms

synonyms.txt

				
					snow leopard, panthera uncia|0.9, big cat|0.8, white_leopard|0.6

Query = title:(snow leopard)
Parsed Query:
((title:”panthera uncia”)^0.9 OR
(title:”big cat”)^0.8 OR
(title:white_leopard)^0.6 OR
title:”snow leopard”)

N.B. multi term synonyms are supported and the weight is parsed as the boost for the phrase query.

Pre-existent BOOST compatibility

The contribution is compatible with pre-existent boosts (for example coming from the edismax query parser).

synonyms.txt

				
					snow leopard, panthera uncia|0.9, big cat|0.8, white_leopard|0.6

Edismax Query = snow leopard
qf = title^10
Parsed Query:
(((title:”panthera uncia”)^0.9 OR
(title:”big cat”)^0.8 OR
(title:white_leopard)^0.6 OR
title:”snow leopard”)^10.0

N.B. multi term synonyms are supported and the weight is parsed as the boost for the phrase query.

Conclusion

There is still a lot of work to do to improve how synonyms are managed in Apache Lucene/Solr, especially Hyper-nyms (generalisations) and Hypo-nyms (specifications).
This is just a first step 🙂

TO BE CONTINUED

Need Help With This Topic?

If you’re struggling with Weighted Synonyms, don’t worry – we’re here to help! Our team offers expert services and training to help you optimize your search engine and get the most out of your system. Contact us today to learn more!

Need Help with this topic?

If you're struggling with Weighted Synonyms, don't worry - we're here to help! Our team offers expert services and training to help you optimize your search engine and get the most out of your system. Contact us today to learn more!

Click Here

synonyms, weighted synonyms

Sign up for our Newsletter

Did you like this post? Don’t forget to subscribe to our Newsletter to stay always updated in the Information Retrieval world!

8 Responses

elkon says:

April 8, 2020 at 8:55 pm

This is a great contribution!

I was wondering if you know the answer to this question:

I would like to use Solr to index documents with term weights.

Doc1: this(w=0.3) is(w=0.4) the(w=0.1) first(w=0.7) doc(w=0.2)

Doc2: this(w=0.1) is(w=0.2) the(w=0.5) second(w=0.8) doc(w=0.1)

Note that the weight for the same term can be different for two documents.

After indexing I would like the search function to consider these weights when scoring the documents. For example, if the query is “doc”, I would like Doc1 to get a higher score.

Is this possible?

Thanks!

Loading...

Reply
1. Alessandro Benedetti says:
  
  April 14, 2020 at 11:07 am
  
  Hi Yes, it is possible in various ways.
  You must use a combination of :
  – PayloadDelimiterTokenFilter to add the weight payload to the term at indexing time.
  
  Then you need to manage the query time, here you can:
  1) fine tune your boosting, using the edismax query parser for example
  – then you can create boost factors to multiply to your scoring matching clauses, using the payload function query ( https://lucene.apache.org/solr/guide/7_7/function-queries.html#payload-function ) , you can do that for all your query terms and apply the math you like, multiplicative or additive
  
  2) just use the Payload score query parser : https://lucene.apache.org/solr/guide/7_5/other-parsers.html#payload-score-parser
  
  And you should be good to go 🙂
  
  For more info about using payloads :https://lucidworks.com/post/solr-payloads/
  It is not super clear from that blog how to achieve terms weighting at indexing time, but I hope my explanation clarifies it.
  
  Loading...
  
  Reply
Ivana (@icca_kg) says:

April 14, 2020 at 7:59 am

Hi Alessandro,

Thanks for the blog and explanation.

I wonder if you tried to see how it works if you have different types of similarity?

I’m developing custom java application with lucene 8.5.0.

When I’m using BM25Similarity and delimitedBoost filter everything works as expected, but if I switch to BooleanSimilarity nothing happens. Parsed query look ok it has synonyms with proper boost value, but the final score hasn’t changed.

Thanks.

Loading...

Reply
1. Alessandro Benedetti says:
  
  April 14, 2020 at 11:26 am
  
  Hi Ivana,
  do you mind debugging the score and pasting here the output ?
  In Solr it is as easy as debug=results .
  In Lucene you can get it using : org.apache.lucene.search.IndexSearcher#explain(org.apache.lucene.search.Query, int)
  
  Now, from a very quick look to the Similarity classes, BM25Similarity has support for boosting :
  
  org/apache/lucene/search/similarities/BM25Similarity.java:219
  BM25Scorer(float boost, float k1, float b, Explanation idf, float avgdl, float[] cache) {
  this.boost = boost;
  ….
  this.weight = boost * idf.getValue().floatValue();
  }
  
  And so does BooleanSimilarity :
  
  Simple similarity that gives terms a score that is equal to their query
  * boost. This similarity is typically used with disabled norms since neither
  * document statistics nor index statistics are used for scoring. That said,
  * if norms are enabled, they will be computed the same way as
  * {@link SimilarityBase} and {@link BM25Similarity} with
  * {@link SimilarityBase#setDiscountOverlaps(boolean) discounted overlaps}
  * so that the {@link Similarity} can be changed after the index has been
  * created.
  
  Loading...
  
  Reply
  1. Ivana (@icca_kg) says:
    
    April 16, 2020 at 10:24 am
    
    Hi Alessandro,
    
    Here’s my debug output and some additional info:
    
    I’m using StandardAnalyzer for search, and my SynonymGraphFilter has default configuration as in your example.
    
    Query: +Synonym(morphology_term_original_name_key:neoplasm^0.7 morphology_term_original_name_key:tumor^0.8 morphology_term_original_name_key:tumour^0.6)
    
    1.0 = weight(Synonym(morphology_term_original_name:neoplasm^0.7 morphology_term_original_name:tumor^0.8 morphology_term_original_name:tumour^0.6) in 0) [BooleanSimilarity], result of:
    1.0 = score(BooleanWeight), computed from:
    1.0 = boost, query boost
    
    If I use the BM25Similarity, the printout is as follows:
    
    0.75188845 = weight(Synonym(morphology_term_original_name:neoplasm^0.7 morphology_term_original_name:tumor^0.8 morphology_term_original_name:tumour^0.6) in 0) [BM25Similarity], result of:
    0.75188845 = score(freq=0.8), computed as boost * idf * tf from:
    1.3862944 = idf, computed as log(1 + (N – n + 0.5) / (n + 0.5)) from:
    1 = n, number of documents containing term
    5 = N, total number of documents with field
    0.5423729 = tf, computed as freq / (freq + k1 * (1 – b + b * dl / avgdl)) from:
    0.8 = termFreq=0.8
    1.2 = k1, term saturation parameter
    0.75 = b, length normalization parameter
    1.0 = dl, length of field
    2.4 = avgdl, average length of field
    
    Loading...
Alessandro Benedetti says:

April 16, 2020 at 10:57 am

mmmm It is suspicious, I am afraid it requires some debug of the boolean similarity to understand the reasons.
I don’t see any obvious mistake here!

Loading...

Reply
Roope Koski says:

September 2, 2020 at 2:16 pm

Hi Alessandro,
Thank you for this and all your other great contributions to search.
How do the weighted synonyms play along with Learning to Rank?

Loading...

Reply
1. Anna Ruggero says:
  
  April 2, 2021 at 12:44 pm
  
  Hi Roope,
  First of all, you have to decide how to manage free text queries in the Learning to Rank model. An approach could be the addition of TF, IDF, length of the field, BM25 score, etc. as a feature.
  Second, a possibility to incorporate synonyms in the model could be the addition of the Solr score as a feature (since synonyms impact the Solr score) but I wouldn’t recommend this approach since the Solr score value depends on several different factors. In this way the explainability of the model decrease.
  Other options could be:
  – the addition of TF, IDF, length field, etc. features for each synonym you use.
  – the addition of the BM25 score computed for each synonym you use.
  There isn’t a unique answer to your question, it really depends on your specific scenario, the feature you have/want to use, the searching problem you are modeling..
  
  There is a lot of freedom in solving the problem and you can use all your creativity in this.
  
  I hope to had helped you a bit!
  
  Loading...
  
  Reply

About the company

about our work

Rated Ranking Evaluator (RRE)

Rated Ranking Evaluator Enterprise (RREE)

Apache Solr LLM Highlighter plugin

News

Main Blog

TIPS AND TRICKS

LATEST BLOG POST

contact us

Don't miss all the news - subscribe to our newsletter!

Introducing Weighted Synonyms in Apache Lucene/Solr

Our Contribution

Apache Solr

Configuration

Query Time

Single Term Query - Single Terms Synonym

Single Term Query - Multi Terms Synonyms

MULTI Term Query - MULTI Terms Synonyms

Pre-existent BOOST compatibility

Conclusion

Need Help With This Topic?​​

Need Help with this topic?​

Other posts you may find useful

Apache Solr/Elasticsearch: How to Manage Multi-term Concepts out of the Box?

Apache Solr Distributed Facets

A Learning to Rank Project on a Daily Song Ranking Problem – Part 4

Alessandro Benedetti

Alessandro Benedetti

Follow Us

Top Categories

Recent Posts

London Information Retrieval & AI Meetup [November 2025]

GLiNER as an Alternative to LLMs for Query Parsing – Evaluation

GLiNER as an Alternative to LLMs for Query Parsing – Introduction

Monthly video

Sign up for our Newsletter

8 Responses

Leave a Reply Cancel reply

Rated Ranking Evaluator
(RRE)

Need Help With This Topic?

Need Help with this topic?