This blog post is about our latest contribution to the Apache Lucene/Solr project:
introducing the ability of assigning different weights to synonyms.
This contribution aims to help users that deal with complex synonyms dictionaries where it’s important to associate a numerical weight to each of them, for example to boost the ones that are more important in the domain or closer to the original concept.

Contribution

The contribution is detailed in the following official Jira issues :
https://issues.apache.org/jira/browse/SOLR-12238
https://issues.apache.org/jira/browse/LUCENE-9171

The code review and merge process has been tracked in the Pull Request:
https://github.com/apache/lucene-solr/pull/357

This new features will be available with Apache Lucene/Solr 8.5

The changes happened mostly Lucene side :
– a new token filter, that is able to extract the weight and store it as a token boost attribute
query building, that checks for boost attributes and use them to build boosted queries when present

This makes the contribution usable by both Apache Solr and Elasticsearch(coming soon).

Solr side, the change affected the Solr base query parser, to be compatible with the synonym query style approach.

Apache Solr

Configuration

Enabling query time weighted synonyms requires two configurations:

  • defining the synonyms with the associated weight in the synonyms.txt file following the syntax of the delimitedBoost token filter (you can use the managed REST API to do that if you prefer)

synonyms.txt

tiger, tigre|0.9
lynx => lince|0.8, lynx_canadensis|0.9

leopard, big cat|0.8, bagheera|0.9, panthera pardus|0.85
lion => panthera leo|0.9, simba leo|0.8, kimba|0.75

panthera pardus, leopard|0.6
panthera tigris => tiger|0.99

snow leopard, panthera uncia|0.9, big cat|0.8, white_leopard|0.6
panthera onca => jaguar|0.95, big cat|0.85, black panther|0.65
panthera blytheae, oldest|0.5 ancient|0.9 panthera
  • defining a fieldType in the schema.xml that applies the delimitedBoost filter after synonyms are expanded at query time

schema.xml

<fieldType name="boostedSynonyms" class="solr.TextField" positionIncrementGap="100"  synonymQueryStyle="as_distinct_terms" autoGeneratePhraseQueries="true">
    <analyzer type="index">
      ...
    </analyzer>
    <analyzer type="query">
      ...
      <filter name="synonymGraph" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
      <filter name="delimitedBoost"/>
    </analyzer>
  </fieldType>

N.B. by default ‘|’ is used as a separator for the weights, if you prefer any other character, there is a “delimiter” parameter available:

<filter name="delimitedBoost" delimiter="/"/>

That’s all!
Now you are ready to explore the various ways synonyms query expansion works and how boosts are applied

Query Time

At Query Time the weight you configured for the synonym is going to be used to build a boost query that wraps the synonym.
At Scoring time this is a multiplicative factor that is applied to the score produced by the synonym match.

e.g.
given a <query1, document1> pair where document1 is a search result of query1
query1 = title:(tiger OR tigre^0.8)
Score(document1) (let’s consider just the tigre^0.9 score component

2.5526304 = weight(title:tigre in 14) [SchemaSimilarity], result of:
    2.5526304 = score(freq=1.0), product of:
      0.8 = boost
      6.531849 = idf
      0.4884969 = tf

How is a query with synonyms parsed ?

Apache Solr currently supports three different Synonym Query Styles:

as_same_term (default): to blend terms document frequencies, i.e., SynonymQuery(tshirt,tee) where each term will be treated as equally important independently of their rarity in the corpus of information (blended document frequency)
pick_best to select the most significant (rarest document frequency) synonym when scoring Dismax(tee,tshirt)with 0 tie factor.
as_distinct_terms to bias scoring towards documents that contains more synonyms (pants OR slacks).

You can configure the synonym query style in the schema.xml :

  <fieldType name="text_as_distinct" class="solr.TextField" positionIncrementGap="100"  synonymQueryStyle="as_distinct_terms" autoGeneratePhraseQueries="true">

Depending on the synonym query style you choose you will have a different queries parsed, for the sake of this blog post, the following examples use the as_same_term (default) synonym query style.
Weighted synonyms are compatible with all the synonym query styles.

Single Term Query – Single Terms Synonyms

synonyms.txt

tiger, tigre|0.9

Query = title:tiger
Parsed Query:
Synonym(title:tiger title:tigre^0.9)

Single Term Query – Multi Terms Synonyms

synonyms.txt

leopard, big cat|0.8, bagheera|0.9, panthera pardus|0.85

Query = title:leopard
Parsed Query:
((title:”big cat”)^0.8 OR
(title:bagheera)^0.9 OR
(title:”panthera pardus”)^0.85 OR
title:leopard)

N.B. multi term synonyms are supported and the weight is parsed as the boost for the phrase query.

MULTI Term Query – MULTI Terms Synonyms

synonyms.txt

snow leopard, panthera uncia|0.9, big cat|0.8, white_leopard|0.6

Query = title:(snow leopard)
Parsed Query:
((title:”panthera uncia”)^0.9 OR
(title:”big cat”)^0.8 OR
(title:white_leopard)^0.6 OR
title:”snow leopard”)

N.B. multi term synonyms are supported and the weight is parsed as the boost for the phrase query.

Pre-existent BOOST compatibility

The contribution is compatible with pre-existent boosts (for example coming from the edismax query parser).

synonyms.txt

snow leopard, panthera uncia|0.9, big cat|0.8, white_leopard|0.6

Edismax Query = snow leopard
qf = title^10
Parsed Query:
(((title:”panthera uncia”)^0.9 OR
(title:”big cat”)^0.8 OR
(title:white_leopard)^0.6 OR
title:”snow leopard”)^10.0

N.B. multi term synonyms are supported and the weight is parsed as the boost for the phrase query.

Conclusion

There is still a lot of work to do to improve how synonyms are managed in Apache Lucene/Solr, especially Hyper-nyms(generalisations) and Hypo-nyms(specifications).
This is just a first step 🙂

TO BE CONTINUED

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.