Apache Lucene Apache Solr Main Blog
weighted synonyms

Introducing Weighted Synonyms in Apache Lucene/Solr

This blog post is about our latest contribution to the Apache Lucene/Solr project:
introducing the ability of assigning different weights to synonyms.
This contribution aims to help users that deal with complex synonyms dictionaries where it’s important to associate a numerical weight to each of them, for example to boost the ones that are more important in the domain or closer to the original concept.

Contribution

The contribution is detailed in the following official Jira issues :
SOLR-12238 [1]
LUCENE-9171 [2]

The code review and merge process has been tracked in the Github Pull Request [3]

This new features will be available with Apache Lucene/Solr 8.5

The changes happened mostly Lucene side :
– a new token filter, that is able to extract the weight and store it as a token boost attribute
query building, that checks for boost attributes and use them to build boosted queries when present

This makes the contribution usable by both Apache Solr and Elasticsearch(coming soon).

Solr side, the change affected the Solr base query parser, to be compatible with the synonym query style approach.

Apache Solr

Configuration

Enabling query time weighted synonyms requires two configurations:

    • defining the synonyms with the associated weight in the synonyms.txt file following the syntax of the delimitedBoost token filter (you can use the managed REST API to do that if you prefer)

synonyms.txt

tiger, tigre|0.9
lynx => lince|0.8, lynx_canadensis|0.9

leopard, big cat|0.8, bagheera|0.9, panthera pardus|0.85
lion => panthera leo|0.9, simba leo|0.8, kimba|0.75

panthera pardus, leopard|0.6
panthera tigris => tiger|0.99

snow leopard, panthera uncia|0.9, big cat|0.8, white_leopard|0.6
panthera onca => jaguar|0.95, big cat|0.85, black panther|0.65
panthera blytheae, oldest|0.5 ancient|0.9 panthera
    • defining a fieldType in the schema.xml that applies the delimitedBoost filter after synonyms are expanded at query time

schema.xml

<fieldType name="boostedSynonyms" class="solr.TextField" positionIncrementGap="100"  synonymQueryStyle="as_distinct_terms" autoGeneratePhraseQueries="true">
    <analyzer type="index">
      ...
    </analyzer>
    <analyzer type="query">
      ...
      <filter name="synonymGraph" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
      <filter name="delimitedBoost"/>
    </analyzer>
  </fieldType>

N.B. by default ‘|’ is used as a separator for the weights, if you prefer any other character, there is a “delimiter” parameter available:

<filter name="delimitedBoost" delimiter="/"/>

That’s all!
Now you are ready to explore the various ways synonyms query expansion works and how boosts are applied

Query Time

At Query Time the weight you configured for the synonym is going to be used to build a boost query that wraps the synonym.
At Scoring time this is a multiplicative factor that is applied to the score produced by the synonym match.

e.g.
given a <query1, document1> pair where document1 is a search result of query1
query1 = title:(tiger OR tigre^0.8)
Score(document1) (let’s consider just the tigre^0.9 score component

2.5526304 = weight(title:tigre in 14) [SchemaSimilarity], result of:
    2.5526304 = score(freq=1.0), product of:
      0.8 = boost
      6.531849 = idf
      0.4884969 = tf

How is a query with synonyms parsed ?

Apache Solr currently supports three different Synonym Query Styles:

as_same_term (default): to blend terms document frequencies, i.e., SynonymQuery(tshirt,tee) where each term will be treated as equally important independently of their rarity in the corpus of information (blended document frequency)
pick_best to select the most significant (rarest document frequency) synonym when scoring Dismax(tee,tshirt)with 0 tie factor.
as_distinct_terms to bias scoring towards documents that contains more synonyms (pants OR slacks).

You can configure the synonym query style in the schema.xml :

  <fieldType name="text_as_distinct" class="solr.TextField" positionIncrementGap="100"  synonymQueryStyle="as_distinct_terms" autoGeneratePhraseQueries="true">

Depending on the synonym query style you choose you will have a different queries parsed, for the sake of this blog post, the following examples use the as_same_term (default) synonym query style.
Weighted synonyms are compatible with all the synonym query styles.

Single Term Query – Single Terms Synonym


synonyms.txt

tiger, tigre|0.9

Query = title:tiger
Parsed Query:
Synonym(title:tiger title:tigre^0.9)

Single Term Query – Multi Terms Synonyms


synonyms.txt

leopard, big cat|0.8, bagheera|0.9, panthera pardus|0.85

Query = title:leopard
Parsed Query:
((title:”big cat”)^0.8 OR
(title:bagheera)^0.9 OR
(title:”panthera pardus”)^0.85 OR
title:leopard)

N.B. multi term synonyms are supported and the weight is parsed as the boost for the phrase query.

MULTI Term Query – MULTI Terms Synonyms


synonyms.txt

snow leopard, panthera uncia|0.9, big cat|0.8, white_leopard|0.6

Query = title:(snow leopard)
Parsed Query:
((title:”panthera uncia”)^0.9 OR
(title:”big cat”)^0.8 OR
(title:white_leopard)^0.6 OR
title:”snow leopard”)

N.B. multi term synonyms are supported and the weight is parsed as the boost for the phrase query.

Pre-existent BOOST compatibility

The contribution is compatible with pre-existent boosts (for example coming from the edismax query parser).

synonyms.txt

snow leopard, panthera uncia|0.9, big cat|0.8, white_leopard|0.6

Edismax Query = snow leopard
qf = title^10
Parsed Query:
(((title:”panthera uncia”)^0.9 OR
(title:”big cat”)^0.8 OR
(title:white_leopard)^0.6 OR
title:”snow leopard”)^10.0

N.B. multi term synonyms are supported and the weight is parsed as the boost for the phrase query.

Conclusion

There is still a lot of work to do to improve how synonyms are managed in Apache Lucene/Solr, especially Hyper-nyms(generalisations) and Hypo-nyms(specifications).
This is just a first step 🙂

TO BE CONTINUED

// our service

Shameless plug for our training and services!

Did I mention we do Apache Solr Beginner and Elasticsearch Beginner training?
We also provide consulting on these topics, get in touch if you want to bring your search engine to the next level!

// STAY ALWAYS UP TO DATE

Subscribe to our newsletter

Did you like this post about the Weighted Synonyms in Apache Lucene/Solr? Don’t forget to subscribe to our Newsletter to stay always updated from the Information Retrieval world!

Author

Alessandro Benedetti

Alessandro Benedetti is the founder of Sease Ltd. Senior Search Software Engineer, his focus is on R&D in information retrieval, information extraction, natural language processing, and machine learning.

Comments (8)

  1. elkon
    April 8, 2020

    This is a great contribution!

    I was wondering if you know the answer to this question:

    I would like to use Solr to index documents with term weights.

    Doc1: this(w=0.3) is(w=0.4) the(w=0.1) first(w=0.7) doc(w=0.2)

    Doc2: this(w=0.1) is(w=0.2) the(w=0.5) second(w=0.8) doc(w=0.1)

    Note that the weight for the same term can be different for two documents.

    After indexing I would like the search function to consider these weights when scoring the documents. For example, if the query is “doc”, I would like Doc1 to get a higher score.

    Is this possible?

    Thanks!

  2. Ivana (@icca_kg)
    April 14, 2020

    Hi Alessandro,

    Thanks for the blog and explanation.

    I wonder if you tried to see how it works if you have different types of similarity?

    I’m developing custom java application with lucene 8.5.0.

    When I’m using BM25Similarity and delimitedBoost filter everything works as expected, but if I switch to BooleanSimilarity nothing happens. Parsed query look ok it has synonyms with proper boost value, but the final score hasn’t changed.

    Thanks.

    • Alessandro Benedetti
      April 14, 2020

      Hi Ivana,
      do you mind debugging the score and pasting here the output ?
      In Solr it is as easy as debug=results .
      In Lucene you can get it using : org.apache.lucene.search.IndexSearcher#explain(org.apache.lucene.search.Query, int)

      Now, from a very quick look to the Similarity classes, BM25Similarity has support for boosting :

      org/apache/lucene/search/similarities/BM25Similarity.java:219
      BM25Scorer(float boost, float k1, float b, Explanation idf, float avgdl, float[] cache) {
      this.boost = boost;
      ….
      this.weight = boost * idf.getValue().floatValue();
      }

      And so does BooleanSimilarity :

      Simple similarity that gives terms a score that is equal to their query
      * boost. This similarity is typically used with disabled norms since neither
      * document statistics nor index statistics are used for scoring. That said,
      * if norms are enabled, they will be computed the same way as
      * {@link SimilarityBase} and {@link BM25Similarity} with
      * {@link SimilarityBase#setDiscountOverlaps(boolean) discounted overlaps}
      * so that the {@link Similarity} can be changed after the index has been
      * created.

      • Ivana (@icca_kg)
        April 16, 2020

        Hi Alessandro,

        Here’s my debug output and some additional info:

        I’m using StandardAnalyzer for search, and my SynonymGraphFilter has default configuration as in your example.

        Query: +Synonym(morphology_term_original_name_key:neoplasm^0.7 morphology_term_original_name_key:tumor^0.8 morphology_term_original_name_key:tumour^0.6)

        1.0 = weight(Synonym(morphology_term_original_name:neoplasm^0.7 morphology_term_original_name:tumor^0.8 morphology_term_original_name:tumour^0.6) in 0) [BooleanSimilarity], result of:
        1.0 = score(BooleanWeight), computed from:
        1.0 = boost, query boost

        If I use the BM25Similarity, the printout is as follows:

        0.75188845 = weight(Synonym(morphology_term_original_name:neoplasm^0.7 morphology_term_original_name:tumor^0.8 morphology_term_original_name:tumour^0.6) in 0) [BM25Similarity], result of:
        0.75188845 = score(freq=0.8), computed as boost * idf * tf from:
        1.3862944 = idf, computed as log(1 + (N – n + 0.5) / (n + 0.5)) from:
        1 = n, number of documents containing term
        5 = N, total number of documents with field
        0.5423729 = tf, computed as freq / (freq + k1 * (1 – b + b * dl / avgdl)) from:
        0.8 = termFreq=0.8
        1.2 = k1, term saturation parameter
        0.75 = b, length normalization parameter
        1.0 = dl, length of field
        2.4 = avgdl, average length of field

  3. Alessandro Benedetti
    April 16, 2020

    mmmm It is suspicious, I am afraid it requires some debug of the boolean similarity to understand the reasons.
    I don’t see any obvious mistake here!

  4. Roope Koski
    September 2, 2020

    Hi Alessandro,
    Thank you for this and all your other great contributions to search.
    How do the weighted synonyms play along with Learning to Rank?

    • Anna Ruggero
      April 2, 2021

      Hi Roope,
      First of all, you have to decide how to manage free text queries in the Learning to Rank model. An approach could be the addition of TF, IDF, length of the field, BM25 score, etc. as a feature.
      Second, a possibility to incorporate synonyms in the model could be the addition of the Solr score as a feature (since synonyms impact the Solr score) but I wouldn’t recommend this approach since the Solr score value depends on several different factors. In this way the explainability of the model decrease.
      Other options could be:
      – the addition of TF, IDF, length field, etc. features for each synonym you use.
      – the addition of the BM25 score computed for each synonym you use.
      There isn’t a unique answer to your question, it really depends on your specific scenario, the feature you have/want to use, the searching problem you are modeling..

      There is a lot of freedom in solving the problem and you can use all your creativity in this.

      I hope to had helped you a bit!

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d