This blog post is about our latest contribution to the Apache Lucene/Solr project:
introducing the ability of assigning different weights to synonyms.
This contribution aims to help users that deal with complex synonyms dictionaries where it’s important to associate a numerical weight to each of them, for example to boost the ones that are more important in the domain or closer to the original concept.

Contribution

The contribution is detailed in the following official Jira issues :
https://issues.apache.org/jira/browse/SOLR-12238
https://issues.apache.org/jira/browse/LUCENE-9171

The code review and merge process has been tracked in the Pull Request:
https://github.com/apache/lucene-solr/pull/357

This new features will be available with Apache Lucene/Solr 8.5

The changes happened mostly Lucene side :
– a new token filter, that is able to extract the weight and store it as a token boost attribute
query building, that checks for boost attributes and use them to build boosted queries when present

This makes the contribution usable by both Apache Solr and Elasticsearch(coming soon).

Solr side, the change affected the Solr base query parser, to be compatible with the synonym query style approach.

Apache Solr

Configuration

Enabling query time weighted synonyms requires two configurations:

  • defining the synonyms with the associated weight in the synonyms.txt file following the syntax of the delimitedBoost token filter (you can use the managed REST API to do that if you prefer)

synonyms.txt

tiger, tigre|0.9
lynx => lince|0.8, lynx_canadensis|0.9

leopard, big cat|0.8, bagheera|0.9, panthera pardus|0.85
lion => panthera leo|0.9, simba leo|0.8, kimba|0.75

panthera pardus, leopard|0.6
panthera tigris => tiger|0.99

snow leopard, panthera uncia|0.9, big cat|0.8, white_leopard|0.6
panthera onca => jaguar|0.95, big cat|0.85, black panther|0.65
panthera blytheae, oldest|0.5 ancient|0.9 panthera
  • defining a fieldType in the schema.xml that applies the delimitedBoost filter after synonyms are expanded at query time

schema.xml

<fieldType name="boostedSynonyms" class="solr.TextField" positionIncrementGap="100"  synonymQueryStyle="as_distinct_terms" autoGeneratePhraseQueries="true">
    <analyzer type="index">
      ...
    </analyzer>
    <analyzer type="query">
      ...
      <filter name="synonymGraph" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
      <filter name="delimitedBoost"/>
    </analyzer>
  </fieldType>

N.B. by default ‘|’ is used as a separator for the weights, if you prefer any other character, there is a “delimiter” parameter available:

<filter name="delimitedBoost" delimiter="/"/>

That’s all!
Now you are ready to explore the various ways synonyms query expansion works and how boosts are applied

Query Time

At Query Time the weight you configured for the synonym is going to be used to build a boost query that wraps the synonym.
At Scoring time this is a multiplicative factor that is applied to the score produced by the synonym match.

e.g.
given a <query1, document1> pair where document1 is a search result of query1
query1 = title:(tiger OR tigre^0.8)
Score(document1) (let’s consider just the tigre^0.9 score component

2.5526304 = weight(title:tigre in 14) [SchemaSimilarity], result of:
    2.5526304 = score(freq=1.0), product of:
      0.8 = boost
      6.531849 = idf
      0.4884969 = tf

How is a query with synonyms parsed ?

Apache Solr currently supports three different Synonym Query Styles:

as_same_term (default): to blend terms document frequencies, i.e., SynonymQuery(tshirt,tee) where each term will be treated as equally important independently of their rarity in the corpus of information (blended document frequency)
pick_best to select the most significant (rarest document frequency) synonym when scoring Dismax(tee,tshirt)with 0 tie factor.
as_distinct_terms to bias scoring towards documents that contains more synonyms (pants OR slacks).

You can configure the synonym query style in the schema.xml :

  <fieldType name="text_as_distinct" class="solr.TextField" positionIncrementGap="100"  synonymQueryStyle="as_distinct_terms" autoGeneratePhraseQueries="true">

Depending on the synonym query style you choose you will have a different queries parsed, for the sake of this blog post, the following examples use the as_same_term (default) synonym query style.
Weighted synonyms are compatible with all the synonym query styles.

Single Term Query – Single Terms Synonyms

synonyms.txt

tiger, tigre|0.9

Query = title:tiger
Parsed Query:
Synonym(title:tiger title:tigre^0.9)

Single Term Query – Multi Terms Synonyms

synonyms.txt

leopard, big cat|0.8, bagheera|0.9, panthera pardus|0.85

Query = title:leopard
Parsed Query:
((title:”big cat”)^0.8 OR
(title:bagheera)^0.9 OR
(title:”panthera pardus”)^0.85 OR
title:leopard)

N.B. multi term synonyms are supported and the weight is parsed as the boost for the phrase query.

MULTI Term Query – MULTI Terms Synonyms

synonyms.txt

snow leopard, panthera uncia|0.9, big cat|0.8, white_leopard|0.6

Query = title:(snow leopard)
Parsed Query:
((title:”panthera uncia”)^0.9 OR
(title:”big cat”)^0.8 OR
(title:white_leopard)^0.6 OR
title:”snow leopard”)

N.B. multi term synonyms are supported and the weight is parsed as the boost for the phrase query.

Pre-existent BOOST compatibility

The contribution is compatible with pre-existent boosts (for example coming from the edismax query parser).

synonyms.txt

snow leopard, panthera uncia|0.9, big cat|0.8, white_leopard|0.6

Edismax Query = snow leopard
qf = title^10
Parsed Query:
(((title:”panthera uncia”)^0.9 OR
(title:”big cat”)^0.8 OR
(title:white_leopard)^0.6 OR
title:”snow leopard”)^10.0

N.B. multi term synonyms are supported and the weight is parsed as the boost for the phrase query.

Conclusion

There is still a lot of work to do to improve how synonyms are managed in Apache Lucene/Solr, especially Hyper-nyms(generalisations) and Hypo-nyms(specifications).
This is just a first step 🙂

TO BE CONTINUED

6 thoughts on “Introducing Weighted Synonyms in Apache Lucene/Solr

  1. This is a great contribution!

    I was wondering if you know the answer to this question:

    I would like to use Solr to index documents with term weights.

    Doc1: this(w=0.3) is(w=0.4) the(w=0.1) first(w=0.7) doc(w=0.2)

    Doc2: this(w=0.1) is(w=0.2) the(w=0.5) second(w=0.8) doc(w=0.1)

    Note that the weight for the same term can be different for two documents.

    After indexing I would like the search function to consider these weights when scoring the documents. For example, if the query is “doc”, I would like Doc1 to get a higher score.

    Is this possible?

    Thanks!

    1. Hi Yes, it is possible in various ways.
      You must use a combination of :
      – PayloadDelimiterTokenFilter to add the weight payload to the term at indexing time.

      Then you need to manage the query time, here you can:
      1) fine tune your boosting, using the edismax query parser for example
      – then you can create boost factors to multiply to your scoring matching clauses, using the payload function query ( https://lucene.apache.org/solr/guide/7_7/function-queries.html#payload-function ) , you can do that for all your query terms and apply the math you like, multiplicative or additive

      2) just use the Payload score query parser : https://lucene.apache.org/solr/guide/7_5/other-parsers.html#payload-score-parser

      And you should be good to go 🙂

      For more info about using payloads :https://lucidworks.com/post/solr-payloads/
      It is not super clear from that blog how to achieve terms weighting at indexing time, but I hope my explanation clarifies it.

  2. Hi Alessandro,

    Thanks for the blog and explanation.

    I wonder if you tried to see how it works if you have different types of similarity?

    I’m developing custom java application with lucene 8.5.0.

    When I’m using BM25Similarity and delimitedBoost filter everything works as expected, but if I switch to BooleanSimilarity nothing happens. Parsed query look ok it has synonyms with proper boost value, but the final score hasn’t changed.

    Thanks.

    1. Hi Ivana,
      do you mind debugging the score and pasting here the output ?
      In Solr it is as easy as debug=results .
      In Lucene you can get it using : org.apache.lucene.search.IndexSearcher#explain(org.apache.lucene.search.Query, int)

      Now, from a very quick look to the Similarity classes, BM25Similarity has support for boosting :

      org/apache/lucene/search/similarities/BM25Similarity.java:219
      BM25Scorer(float boost, float k1, float b, Explanation idf, float avgdl, float[] cache) {
      this.boost = boost;
      ….
      this.weight = boost * idf.getValue().floatValue();
      }

      And so does BooleanSimilarity :

      Simple similarity that gives terms a score that is equal to their query
      * boost. This similarity is typically used with disabled norms since neither
      * document statistics nor index statistics are used for scoring. That said,
      * if norms are enabled, they will be computed the same way as
      * {@link SimilarityBase} and {@link BM25Similarity} with
      * {@link SimilarityBase#setDiscountOverlaps(boolean) discounted overlaps}
      * so that the {@link Similarity} can be changed after the index has been
      * created.

      1. Hi Alessandro,

        Here’s my debug output and some additional info:

        I’m using StandardAnalyzer for search, and my SynonymGraphFilter has default configuration as in your example.

        Query: +Synonym(morphology_term_original_name_key:neoplasm^0.7 morphology_term_original_name_key:tumor^0.8 morphology_term_original_name_key:tumour^0.6)

        1.0 = weight(Synonym(morphology_term_original_name:neoplasm^0.7 morphology_term_original_name:tumor^0.8 morphology_term_original_name:tumour^0.6) in 0) [BooleanSimilarity], result of:
        1.0 = score(BooleanWeight), computed from:
        1.0 = boost, query boost

        If I use the BM25Similarity, the printout is as follows:

        0.75188845 = weight(Synonym(morphology_term_original_name:neoplasm^0.7 morphology_term_original_name:tumor^0.8 morphology_term_original_name:tumour^0.6) in 0) [BM25Similarity], result of:
        0.75188845 = score(freq=0.8), computed as boost * idf * tf from:
        1.3862944 = idf, computed as log(1 + (N – n + 0.5) / (n + 0.5)) from:
        1 = n, number of documents containing term
        5 = N, total number of documents with field
        0.5423729 = tf, computed as freq / (freq + k1 * (1 – b + b * dl / avgdl)) from:
        0.8 = termFreq=0.8
        1.2 = k1, term saturation parameter
        0.75 = b, length normalization parameter
        1.0 = dl, length of field
        2.4 = avgdl, average length of field

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.