Vector-based search gained incredible popularity in the last few years: Large Language Models fine-tuned for sentence similarity proved to be quite effective in encoding text to vectors and representing some of the semantics of sentences in a numerical form.
These vectors can be used to run a K-nearest neighbour search and look for documents/paragraphs close to the query in a n-dimensional vector space, effectively mimicking a similarity search in the semantic space.
Limitations
Although exciting, vector-based search nowadays still presents many limitations:
- it’s very difficult to explain – e.g. why is document A returned and why at position K?
- it works better on paragraphs so text chunking is strongly recommended – most vector embedding models are fine-tuned on sentences of a certain length, so it works better on queries and documents of such length
- it doesn’t care about exact keyword matching and users still rely on keyword searches a lot.
Hybrid Search
It’s reasonable to think that these limitations will be possibly solved in the future, but right now a trend that is becoming extremely popular is to mitigate these limitations through search solutions that combine lexical (traditional keyword-based) search with neural (vector-based) search.
So, what does it mean to combine these two worlds?
It starts with the retrieval of two sets of candidates:
- one set of results coming from lexical matches with the query keywords
- a set of results coming from the K-Nearest Neighbours search with the query vector
Then these results must be combined and presented in a ranking that maximises the relevance for the user query.
This blog post focuses on how you can run Hybrid Search with Apache Solr, so let’s dive into the details!
Apache Solr supports Hybrid Search
The heading clarifies the first misconception I’ve read in many places: Apache Solr supports Hybrid Search, but right now it’s not documented (SOLR-17103).
Hybrid approaches are a relatively new trend and back in the days we contributed the vector search capabilities it was not a big thing, and we didn’t add an explicit section to the reference guide.
But effectively some sort of Hybrid Search was supported as soon as vector-based search was implemented.
N.B. This blog focuses on query time, your schema needs to be configured with traditional fields and solr.DenseVectorField (click here if you want to know more about Apache Solr neural search)
Retrieval Stage
To retrieve the two sets of candidates an old friend comes to the rescue: The Boolean Query Parser [x].
This query parser can combine queries through boolean clauses, each clause can be a different type of query, using its query parser.
Using the Boolean Query Parser we can leverage various boolean conditions to build the hybrid candidates result set with flexibility:
Union
q = {!bool should=$lexicalQuery should=$vectorQuery}&
lexicalQuery = {!type=edismax qf=text_field}term1&
vectorQuery = {!knn f=vector topK=10}[0.001, -0.422, -0.284, ...]
The hybrid candidate result set is the union of the results coming from the two models:
the top-K results coming from the K-Nearest Neighbours search and the <numFound> results coming from the lexical (keyword-based) search.
The cardinality of the combined result set is <= (K + NumFound).
The result set doesn’t include any duplicates.
Intersection
q = {!bool must=$lexicalQuery must=$vectorQuery}&
lexicalQuery = {!type=edismax qf=text_field}term1&
vectorQuery = {!knn f=vector topK=10}[0.001, -0.422, -0.284, ...]
The hybrid candidate result set is the intersection of the results coming from the two models:
only the top-K results coming from the K-Nearest Neighbour that satisfy the lexical query are returned.
The cardinality of the combined result set is <= K .
This is effectively post-filtering K-Nearest Neighbours results but affecting the score, let’s see what it means.
Bonus point: Pre-Filtering and Post-filtering
With Apache Solr < 9.1, when adding a filter query (fq) to a K-Nearest Neighbours vector search, a post filter was applied, so only the subset of K-Nearest Neighbours search results matching the filter was returned (with a cardinality <= K)
Since 9.1, filter queries are pre-filters by default (SOLR-16246), this means that the filtering condition is applied as Apache Solr (Lucene actually) retrieves the K-Nearest Neighbours.
So if you have at least K documents that satisfy the filter, you are guaranteed to retrieve K results after.
You can still run filter queries as post-filters, specifying the filter cost, in the same fashion you would do post-filtering with any other Solr query (https://solr.apache.org/guide/solr/latest/query-guide/common-query-parameters.html#cache-local-parameter)
e.g.
N.B. frange has a default cost of 100.
&q={!knn f=vector topK=10}[1.0, 2.0, 3.0, 4.0]&fq={!frange cache=false l=0.99}$q
Should I do post-filtering or Boolean Query Parser intersection?
With post-filtering you won’t calculate the lexical score for the filter clause, so our recommendation here is to use one approach or the other depending on if you care to make the lexical score impact the vector-based score (and affect the ranking).
Let’s see more about scoring, in the next section.
Ranking Stage
Once we have the hybrid candidate set, we want to calculate a score for each document that reflects the best ordering in terms of relevance for the user query.
Out of the box, we get a score from -1 to 1 from the K-Nearest Neighbours search that is summed to an unbounded score from the lexical side (that could be way above that scale).
To be fair, there’s no quick answer to how to best combine the two scores and guarantee the best final ranking, but let’s see the options we have in Apache Solr.
N.B. In the future, we would love to see more Hybrid retrieval algorithms implemented, such as Reciprocal Rank Fusion (do you want to make this happen? Take a look at our AI roadmap for Apache Solr)
Sum Normalised Scores
q = {!bool filter=$retrievalStage must=$rankingStage}&
retrievalStage = {!bool should=$lexicalQuery should=$vectorQuery}&
rankingStage = {!func}sum(query($normalisedLexicalQuery),query($vectorQuery))&
normalisedLexicalQuery = {!func}scale(query($lexicalQuery),0,1)&
lexicalQuery = {!type=edismax qf=field1}term1&
vectorQuery = {!knn f=vector topK=10}[0.001, -0.422, -0.284, ...]
The filter component ignores any scoring and just builds the hybrid result set.
The must clause is responsible for assigning the score, using the appropriate function query.
The lexical score is min-max normalised to be scaled between 0 and 1 and it is summed to the K-Nearest Neighbours score.
This simple linear combination of the scores could be a good starting point.
Multiply Normalised Scores
q = {!bool filter=$retrievalStage must=$rankingStage}&
retrievalStage = {!bool should=$lexicalQuery should=$vectorQuery}&
rankingStage = {!func}product(query($normalisedLexicalQuery),query($vectorQuery))&
normalisedLexicalQuery = {!func}scale(query($lexicalQuery),0.1,1)&
lexicalQuery = {!type=edismax qf=field1}term1&
vectorQuery = {!knn f=vector topK=10}[0.001, -0.422, -0.284, ...]
The filter component ignores any scoring and just builds the hybrid result set.
The must clause is responsible for assigning the score, using the appropriate function query.
The lexical score is min-max normalised to be scaled between 0.1 and 1 and it is multiplied by the K-Nearest Neighbours score.
There’s no evidence that this should bring better results than the simple sum, it’s always recommended to build a prototype and test your assumptions on real queries and rated datasets (RRE).
Learning To Rank
In hybrid search or any other search scenario where you need to combine multiple factors (features) to build the final ranking, there’s no quick answer to the mathematical function to use to combine such features.
Sum? Normalised Sum? Product? linear or non-linear function?
It’s likely you want to solve this problem, by making a Machine Learning the function.
Apache Solr supports learning to rank since 6.4 and since 9.3 we contributed a basic support for vector similarity function queries, that can be used as features in a Learning To Rank model.
If you want to know more about Apache Solr Learning To Rank: https://sease.io/category/learning-to-rank
The idea is to build a training set, outside Solr where you define a feature vector that describes your <query, document> pair and a rating that states how relevant the document is for the query.
This feature vector contains as many features as you like, but for the sake of this blog, we’ll focus on two features:
[
{
"name": "lexicalScore",
"class": "org.apache.solr.ltr.feature.SolrFeature",
"params": { "q" : "{!func}scale(query(${lexicalQuery}),0,1)" },
"store": "feature-store-1"
},
{
"name": "vectorSimilarityScore",
"class": "org.apache.solr.ltr.feature.SolrFeature",
"params": { "q" : "{!func}vectorSimilarity(FLOAT32, DOT_PRODUCT, vectorField, ${queryVector})" },
"store": "feature-store-1"
}
]
The first feature is a normalised lexical score and the second feature is the vector similarity
Then you train a model that learns from the training dataset how to best combine such scores:
e.g.
This could be the resulting model after training:
{
"class":"org.apache.solr.ltr.model.LinearModel",
"name":"linear",
"features":[
{"name":"lexicalScore"},
{"name":"vectorSimilarityScore"}
],
"params":{
"weights":{
"lexicalScore":1.0,
"vectorSimilarityScore":2.0
}
}
}
Finally you can rerank your hybrid candidate set:
q = {!bool should=$normalisedLexicalQuery should=$vectorQuery}&
normalisedLexicalQuery = {!func}scale(query($lexicalQuery),0,1)&
lexicalQuery = {!type=edismax qf=field1}term1&
vectorQuery = {!knn f=vector topK=10}[0.001, -0.422, -0.284, ...]&
&rq={!ltr model=linear reRankDocs=100 efi.lexicalQuery='{!type=edismax qf=field1 v=term1}'
efi.queryVector='[0.001, -0.422, -0.284, ...]'}&fl=id,score,[features]
Final Considerations
Reach out to us if you want to make it happen!






One Response
Hello,
I am trying to construct a hybrid search query, our application has a pagination setup that users rely on to retrieve results. While the pagination works well with the edismax parser using the rows attribute, we’re experiencing an issue when using BoolQParser. The rows parameter is being overwritten by the topK parameter from the knn query.
Here’s my current query:
{!bool should=”{!edismax qf=’title_desc’}generative ai” should=”{!knn f=title_desc_vector topK=20}[{{vector}}]”}
Is there a way to implement pagination with knn parser, specifically for this type of query?
Thank you for your help.