Apache Solr, Main Blog

Hybrid Search with Apache Solr

Vector-based search gained incredible popularity in the last few years: Large Language Models fine-tuned for sentence similarity proved to be quite effective in encoding text to vectors and representing some of the semantics of sentences in a numerical form.
These vectors can be used to run a K-nearest neighbour search and look for documents/paragraphs close to the query in a n-dimensional vector space, effectively mimicking a similarity search in the semantic space.

Limitations

Although exciting, vector-based search nowadays still presents many limitations:

it’s very difficult to explain – e.g. why is document A returned and why at position K?
it works better on paragraphs so text chunking is strongly recommended – most vector embedding models are fine-tuned on sentences of a certain length, so it works better on queries and documents of such length
it doesn’t care about exact keyword matching and users still rely on keyword searches a lot.

Hybrid Search

It’s reasonable to think that these limitations will be possibly solved in the future, but right now a trend that is becoming extremely popular is to mitigate these limitations through search solutions that combine lexical (traditional keyword-based) search with neural (vector-based) search.
So, what does it mean to combine these two worlds?
It starts with the retrieval of two sets of candidates:

one set of results coming from lexical matches with the query keywords
a set of results coming from the K-Nearest Neighbours search with the query vector

Then these results must be combined and presented in a ranking that maximises the relevance for the user query.
This blog post focuses on how you can run Hybrid Search with Apache Solr, so let’s dive into the details!

Apache Solr supports Hybrid Search

The heading clarifies the first misconception I’ve read in many places: Apache Solr supports Hybrid Search, but right now it’s not documented (SOLR-17103).
Hybrid approaches are a relatively new trend and back in the days we contributed the vector search capabilities it was not a big thing, and we didn’t add an explicit section to the reference guide.

But effectively some sort of Hybrid Search was supported as soon as vector-based search was implemented.
N.B. This blog focuses on query time, your schema needs to be configured with traditional fields and solr.DenseVectorField (click here if you want to know more about Apache Solr neural search)

Retrieval Stage

To retrieve the two sets of candidates an old friend comes to the rescue: The Boolean Query Parser [x].
This query parser can combine queries through boolean clauses, each clause can be a different type of query, using its query parser.
Using the Boolean Query Parser we can leverage various boolean conditions to build the hybrid candidates result set with flexibility:

Union

				
					q = {!bool should=$lexicalQuery should=$vectorQuery}&
lexicalQuery = {!type=edismax qf=text_field}term1&
vectorQuery = {!knn f=vector topK=10}[0.001, -0.422, -0.284, ...]

The hybrid candidate result set is the union of the results coming from the two models:
the top-K results coming from the K-Nearest Neighbours search and the <numFound> results coming from the lexical (keyword-based) search.
The cardinality of the combined result set is <= (K + NumFound).
The result set doesn’t include any duplicates.

Intersection

				
					q = {!bool must=$lexicalQuery must=$vectorQuery}&
lexicalQuery = {!type=edismax qf=text_field}term1&
vectorQuery = {!knn f=vector topK=10}[0.001, -0.422, -0.284, ...]

The hybrid candidate result set is the intersection of the results coming from the two models:
only the top-K results coming from the K-Nearest Neighbour that satisfy the lexical query are returned.
The cardinality of the combined result set is <= K .
This is effectively post-filtering K-Nearest Neighbours results but affecting the score, let’s see what it means.

Bonus point: Pre-Filtering and Post-filtering

With Apache Solr < 9.1, when adding a filter query (fq) to a K-Nearest Neighbours vector search, a post filter was applied, so only the subset of K-Nearest Neighbours search results matching the filter was returned (with a cardinality <= K)

Since 9.1, filter queries are pre-filters by default (SOLR-16246), this means that the filtering condition is applied as Apache Solr (Lucene actually) retrieves the K-Nearest Neighbours.
So if you have at least K documents that satisfy the filter, you are guaranteed to retrieve K results after.

You can still run filter queries as post-filters, specifying the filter cost, in the same fashion you would do post-filtering with any other Solr query (https://solr.apache.org/guide/solr/latest/query-guide/common-query-parameters.html#cache-local-parameter)

e.g.
N.B. frange has a default cost of 100.

				
					&q={!knn f=vector topK=10}[1.0, 2.0, 3.0, 4.0]&fq={!frange cache=false l=0.99}$q

Should I do post-filtering or Boolean Query Parser intersection?

With post-filtering you won’t calculate the lexical score for the filter clause, so our recommendation here is to use one approach or the other depending on if you care to make the lexical score impact the vector-based score (and affect the ranking).
Let’s see more about scoring, in the next section.

Ranking Stage

Once we have the hybrid candidate set, we want to calculate a score for each document that reflects the best ordering in terms of relevance for the user query.
Out of the box, we get a score from -1 to 1 from the K-Nearest Neighbours search that is summed to an unbounded score from the lexical side (that could be way above that scale).

To be fair, there’s no quick answer to how to best combine the two scores and guarantee the best final ranking, but let’s see the options we have in Apache Solr.

N.B. In the future, we would love to see more Hybrid retrieval algorithms implemented, such as Reciprocal Rank Fusion (do you want to make this happen? Take a look at our AI roadmap for Apache Solr)

Sum Normalised Scores

				
					q = {!bool filter=$retrievalStage must=$rankingStage}&
retrievalStage = {!bool should=$lexicalQuery should=$vectorQuery}&
rankingStage = {!func}sum(query($normalisedLexicalQuery),query($vectorQuery))&
normalisedLexicalQuery = {!func}scale(query($lexicalQuery),0,1)&
lexicalQuery = {!type=edismax qf=field1}term1&
vectorQuery = {!knn f=vector topK=10}[0.001, -0.422, -0.284, ...]

The filter component ignores any scoring and just builds the hybrid result set.
The must clause is responsible for assigning the score, using the appropriate function query.

The lexical score is min-max normalised to be scaled between 0 and 1 and it is summed to the K-Nearest Neighbours score.
This simple linear combination of the scores could be a good starting point.

Multiply Normalised Scores

				
					q = {!bool filter=$retrievalStage must=$rankingStage}&
retrievalStage = {!bool should=$lexicalQuery should=$vectorQuery}&
rankingStage = {!func}product(query($normalisedLexicalQuery),query($vectorQuery))&
normalisedLexicalQuery = {!func}scale(query($lexicalQuery),0.1,1)&
lexicalQuery = {!type=edismax qf=field1}term1&
vectorQuery = {!knn f=vector topK=10}[0.001, -0.422, -0.284, ...]

The filter component ignores any scoring and just builds the hybrid result set.
The must clause is responsible for assigning the score, using the appropriate function query.

The lexical score is min-max normalised to be scaled between 0.1 and 1 and it is multiplied by the K-Nearest Neighbours score.
There’s no evidence that this should bring better results than the simple sum, it’s always recommended to build a prototype and test your assumptions on real queries and rated datasets (RRE).

Learning To Rank

In hybrid search or any other search scenario where you need to combine multiple factors (features) to build the final ranking, there’s no quick answer to the mathematical function to use to combine such features.
Sum? Normalised Sum? Product? linear or non-linear function?
It’s likely you want to solve this problem, by making a Machine Learning the function.

Apache Solr supports learning to rank since 6.4 and since 9.3 we contributed a basic support for vector similarity function queries, that can be used as features in a Learning To Rank model.
If you want to know more about Apache Solr Learning To Rank: https://sease.io/category/learning-to-rank

The idea is to build a training set, outside Solr where you define a feature vector that describes your <query, document> pair and a rating that states how relevant the document is for the query.

This feature vector contains as many features as you like, but for the sake of this blog, we’ll focus on two features:

				
					[
{
"name": "lexicalScore",
"class": "org.apache.solr.ltr.feature.SolrFeature",
"params": { "q" : "{!func}scale(query(${lexicalQuery}),0,1)" }, 
"store": "feature-store-1"
}, 

{ 
"name": "vectorSimilarityScore",
"class": "org.apache.solr.ltr.feature.SolrFeature",
"params": { "q" : "{!func}vectorSimilarity(FLOAT32, DOT_PRODUCT, vectorField, ${queryVector})" },
"store": "feature-store-1" 
}
]

The first feature is a normalised lexical score and the second feature is the vector similarity

Then you train a model that learns from the training dataset how to best combine such scores:

e.g.

This could be the resulting model after training:

				
					{ 
 "class":"org.apache.solr.ltr.model.LinearModel", 
 "name":"linear", 
 "features":[ 
  {"name":"lexicalScore"}, 
  {"name":"vectorSimilarityScore"} 
 ], 
 "params":{ 
  "weights":{ 
   "lexicalScore":1.0,
   "vectorSimilarityScore":2.0 
  } 
 } 
}

Finally you can rerank your hybrid candidate set:

				
					q = {!bool should=$normalisedLexicalQuery should=$vectorQuery}&
normalisedLexicalQuery = {!func}scale(query($lexicalQuery),0,1)&
lexicalQuery = {!type=edismax qf=field1}term1&
vectorQuery = {!knn f=vector topK=10}[0.001, -0.422, -0.284, ...]&
&rq={!ltr model=linear reRankDocs=100 efi.lexicalQuery='{!type=edismax qf=field1 v=term1}'
efi.queryVector='[0.001, -0.422, -0.284, ...]'}&fl=id,score,[features]

Final Considerations

There’s a lot that can and will be done to improve the support for hybrid search in Apache Solr, but it’s important to know the basics and what can be achieved using what is available right now. Ideally, in the future, we’ll have a dedicated set of components to pick the two separate sets of candidates and combine them with performant state-of-the-art algorithms. Currently, many of the scores are re-calculated on the fly, it’s not ideal and there’s a lot of room for improvement!
Reach out to us if you want to make it happen!

Click Here

apache solr, hybrid search, information retrieval, search

Sign up for our Newsletter

Did you like this post? Don’t forget to subscribe to our Newsletter to stay always updated in the Information Retrieval world!

One Response

Guruprasad says:

January 7, 2025 at 9:27 pm

Hello,

I am trying to construct a hybrid search query, our application has a pagination setup that users rely on to retrieve results. While the pagination works well with the edismax parser using the rows attribute, we’re experiencing an issue when using BoolQParser. The rows parameter is being overwritten by the topK parameter from the knn query.

Here’s my current query:

{!bool should=”{!edismax qf=’title_desc’}generative ai” should=”{!knn f=title_desc_vector topK=20}[{{vector}}]”}

Is there a way to implement pagination with knn parser, specifically for this type of query?

Thank you for your help.

Loading...

Reply

About the company

about our work

Rated Ranking Evaluator (RRE)

Rated Ranking Evaluator Enterprise (RREE)

Apache Solr LLM Highlighter plugin

News

Main Blog

TIPS AND TRICKS

LATEST BLOG POST

contact us

Don't miss all the news - subscribe to our newsletter!

Hybrid Search with Apache Solr

Limitations

Hybrid Search

Apache Solr supports Hybrid Search

Retrieval Stage

Union

Intersection

Bonus point: Pre-Filtering and Post-filtering

Should I do post-filtering or Boolean Query Parser intersection?

Ranking Stage

Sum Normalised Scores

Multiply Normalised Scores

Learning To Rank

Final Considerations

Other posts you may find useful

Apache Solr autoGeneratePhraseQueries and Schema

Tackling Vocabulary Mismatch with Document Expansion

OpenSearch Neural Search Plugin Tutorial: Additional Useful Tools

Alessandro Benedetti

Alessandro Benedetti

Follow Us

Top Categories

Recent Posts

Hybrid Search Using a Custom Algorithm in Apache Solr

Hybrid Search with Reciprocal Rank Fusion in Apache Solr

Apache Solr Multivalued Vectors Tutorial

Monthly video

Sign up for our Newsletter

One Response

Leave a Reply Cancel reply

Rated Ranking Evaluator
(RRE)