Hybrid search has been a popular topic in the search community over the past few years.
In our previous blog post, Hybrid Search with Apache Solr, we introduced the concept of hybrid search, discussed its limitations, and explored how to implement it in Apache Solr when using versions that do not yet provide native support. At that time, hybrid search was achievable, but not with a built-in solution.
Things started to change when my colleague Alessandro Benedetti began working on a contribution to Apache Solr to support Reciprocal Rank Fusion (hereafter referred to as RRF). The goal was to enable a clean and efficient way to fuse results from multiple queries, and RRF, widely adopted in the information retrieval community, is one of the most popular algorithms for this task.
A first pull request was opened at that time PR#2489, and this initial work was presented at Berlin Buzzwords 2024 (if you are curious, the talk is available here), where the idea generated strong interest from the community.
For this reason, the work was later revisited and further refined. Sonu Sharma and David Smiley helped give the feature a more solid structure, improving its design and integration into the Solr framework.
And here we are today. Starting with Apache Solr versions 9.11 and 10.1 (SOLR-17319), this new feature, called Combined Query, becomes available, enabling the execution of multiple queries of multiple kinds across multiple shards (PR#3418). By default, results from multiple queries are merged into a single ranked result set using the RRF algorithm, while still giving users the flexibility to plug in a custom fusion algorithm when needed.
In this post, we explore this new capability with a hands-on approach, showing how to use it in practice.
Before going into the details of the tutorial, we first provide some background and basic concepts to better understand how this feature works. In particular, we briefly review RRF, the JSON Combined Query DSL, and the principles of Distributed Search.
Reciprocal Rank Fusion (RRF)
Reciprocal Rank Fusion [1] is a simple but powerful algorithm, introduced at the SIGIR Conference in 2009, that combines multiple ranked lists into a single unified result set. Examples of use cases where RRF can be used include hybrid search (lexical + vector search), but also the parallel execution of multiple k-nearest neighbour (kNN) vector queries, as well as the fusion of results from multiple lexical queries.
RRF is based on the concept of reciprocal rank, which is the inverse of the rank of a document in a ranked list of search results. Given a set D of documents to be ranked and a set of rankers R, the score assigned to a document d is computed as the sum of contributions derived from its rank across multiple result lists:
RRF formula (from the original paper)
where:
– k is a ranking constant (i.e. 60) that helps to balance between high and low rankings
– R is the set of rankers
– r(d) is the rank (position) of document d in a given ranked list r
What makes RRF effective is its simplicity: it combines rankings based only on item positions in each list rather than original raw scores, assigning higher scores to those that are ranked higher in multiple lists. This makes RRF particularly reliable and robust when combining heterogeneous rankers whose scoring functions are not directly comparable.
If you are curious, here is the Solr code where the RRF was implemented. The responsibility of this class is to combine the result sets from different rankers and also explain them.
JSON Combined Query DSL
The Combined Query functionality does not introduce a completely new query mechanism from scratch; instead, it is built on top of an API that already exists in Solr: the JSON Request API.
This API provides the framework for submitting requests in JSON format, and within it, the JSON Query DSL (Domain Specific Language) defines how queries are expressed. The DSL is essentially a JSON-based representation of the traditional URL parameter–based syntax (q, fq, sort, and so on), which is more readable and easier to extend, and allows complex queries to be defined in a clear and structured way.
The JSON Combined Query DSL is therefore an extension of the Solr JSON Query DSL and enables multiple queries to be combined using a single JSON input object.
In practice, the Combined Query feature leverages the structure and principles of the existing JSON Query DSL, introducing additional parameters to support query combination and to specify the algorithm used to aggregate the results.
Distributed Search
Distributed Search is a key feature in Solr that enables scalability. It runs a single search query across multiple Solr nodes or shards and merges the results into a single response. This allows Solr to handle large datasets and improve performance. (Here you can find a useful “tips and tricks” blog, Distributed Search Tips for Apache Solr).
As implemented, this new feature works in both Standalone and SolrCloud mode, and it behaves correctly with one or multiple shards.
Internally, however, it always forces a distributed search by calling rb.setForcedDistrib(true) (here is the reference code). This is required because the combined query logic is designed to operate with a coordinator that can collect and merge results coming from multiple executions of the query pipeline.
The execution flow is: each shard executes all the queries independently and returns separate ranked lists for each query. The coordinator then merges the results of the same query coming from different shards, producing a global ranked list per query. Only after this per-query distributed merge is completed, the coordinator applies Reciprocal Rank Fusion across the different queries to compute the final ranking returned to the client.
IMPORTANT TO KNOW!
When running in Standalone (user-managed) mode, this forced distributed execution has an additional configuration implication (even when only a single shard is involved).
Because the query is internally executed through the distributed search infrastructure, the shard URLs specified in the shards parameter must be explicitly allowed-listed using the allowUrls property in solr.xml.
When starting Solr in Standalone mode, use this command to pass a JVM system property and define the list of shard URLs that Solr is allowed to contact during internally executed distributed queries. In our case:
bin/solr start --user-managed --jvm-opts "-Dsolr.security.allow.urls=http://localhost:8983/solr/"
Without this configuration, Solr rejects the shard requests with an HTTP 403 error because the target URLs are not included in the configured allow list. In SolrCloud mode, instead, this allow-list is automatically managed via ZooKeeper and requires no manual configuration.
Tutorial – How To Use It
Solr Overview
For this tutorial, let’s imagine we want to perform a hybrid search, meaning a combination of a lexical query and a vector query, using the Combined Query feature.
We therefore have a Solr collection containing documents that include both descriptive text (text field) and the corresponding vector representations (vector field).
To run the test, we use the same data as in the vector search tutorial (for more information, please refer to the related blog post), i.e. a subset of MS MARCO, a collection of large-scale information retrieval datasets designed for deep learning.
To better understand the results, we will first execute a lexical-only query, followed by a vector-only query, and finally a hybrid query combining both approaches using this new feature.
Configuration
In this section, we describe the Solr configuration, including the config.xml and schema.xml files.
solrconfig.xml
In this example, the configuration file is intentionally kept as simple as possible, showing only the parts that are relevant for the tutorial:
explicit
text
2
The /select request handler is the one commonly used by default, and it will be used to execute both lexical and vector queries.
The part we are most interested in, however, is the new request handler and its associated search component introduced with this contribution, namely the CombinedQuerySearchHandler and the CombinedQueryComponent.
This means that hybrid queries will be executed using the /combined endpoint, which is responsible for coordinating multiple queries and combining their results.
The CombinedQueryComponent search component accepts the parameter:
-
maxCombinerQueries: which defines the maximum number of queries that can be combined in a single request. If the parameter is not set, the default value is 5.
schema.xml
Here we define the data model, specifying the fields to be indexed and their types.
We keep it as minimal as possible, including only the necessary fields, a text field with minimal text analysis and a vector field defined as a DenseVectorField (with a dimension of 384, as the embeddings were generated using the all-MiniLM-L6-v2 model):
id
Queries
Now we will run the first query, which is based on keywords, and the vector query, which uses a more natural-language formulation.
To be clear, the point here isn’t to judge the quality of the results, but to show how the fusion process works.
Lexical Query
GET http://localhost:8983/solr/ms-marco/select?
q=text:(tax payment id)
&fl=id,text,score
&rows=10
Here are the results:
"docs": [
{
"id": "7688",
"text": "A tax ID number is not required if you operate a sole proprietorship or an LLC with no employees, in which case you would simply use your own Social Security Number as a tax ID. But you must obtain an EIN if you are a sole proprietor who files pension or excise tax returns.\n",
"score": 6.9102087
},
{
"id": "7684",
"text": "An employer identification number (EIN), also called a tax ID number or taxpayer ID, is required for most business entities. As its name implies, this is the number used by the Internal Revenue Service (IRS) to identify businesses with respect to their tax obligations.\n",
"score": 6.870281
},
{
"id": "7690",
"text": "Download article as a PDF. An employer identification number (EIN), also called a tax ID number or taxpayer ID, is required for most business entities. As its name implies, this is the number used by the Internal Revenue Service (IRS) to identify businesses with respect to their tax obligations.\n",
"score": 6.721591
},
{
"id": "7689",
"text": "If your state taxes personal services, or if you are required to collect sales taxes on your sales, you need a federal tax ID number. All the government forms you will be required to file for your business will require either a Social Security number or a tax ID number.\n",
"score": 6.6496334
},
{
"id": "7687",
"text": "All the government forms you will be required to file for your business will require either a Social Security number or a tax ID number. It's safe to say that any business that has employees and/or pays any kind of taxes will need a federal tax ID. Best advice is, when in doubt, get one. It's easy to do.\n",
"score": 6.3117824
},
{
"id": "7686",
"text": "A. A federal tax identification number (also known as an employer identification number or EIN), is a number assigned solely to your business by the IRS. Your tax ID number is used to identify your business to several federal agencies responsible for the regulation of business.\n",
"score": 5.7892227
},
etc..
]
},
Vector Query
In the following example, we use the natural language query “How is a business identified for tax payment“, which was converted into a vector using the same model employed to generate the document embeddings:
POST http://localhost:8983/solr/ms-marco/select?fl=id,text,score
{
"query": "{!knn f=vector topK=10}[0.0009692322928458452, 0.028254959732294083, -0.005096305627375841, -0.09961161762475967, -0.11519775539636612, 0.06311386823654175, 0.0852086991071701, -0.07076137512922287, 0.04222959280014038, -0.11359747499227524, -0.01666460931301117, -0.0423760749399662, -0.051786914467811584, -0.015746962279081345, -0.061401840299367905, -0.02211417444050312, 0.012279793620109558, 0.028413966298103333, 0.10297070443630219, 0.018956752493977547, ......., -0.050939954817295074]
"
}
Here are the results:
"docs": [
{
"id": "7686",
"text": "A. A federal tax identification number (also known as an employer identification number or EIN), is a number assigned solely to your business by the IRS. Your tax ID number is used to identify your business to several federal agencies responsible for the regulation of business.\n",
"score": 0.763386
},
{
"id": "7691",
"text": "A. A federal tax identification number (also known as an employer identification number or EIN), is a number assigned solely to your business by the IRS.\n",
"score": 0.76241565
},
{
"id": "7684",
"text": "An employer identification number (EIN), also called a tax ID number or taxpayer ID, is required for most business entities. As its name implies, this is the number used by the Internal Revenue Service (IRS) to identify businesses with respect to their tax obligations.\n",
"score": 0.75415957
},
{
"id": "7687",
"text": "All the government forms you will be required to file for your business will require either a Social Security number or a tax ID number. It's safe to say that any business that has employees and/or pays any kind of taxes will need a federal tax ID. Best advice is, when in doubt, get one. It's easy to do.\n",
"score": 0.7533202
},
{
"id": "7692",
"text": "Letâs start at the beginning. A tax ID number or employer identification number (EIN) is a number you get from the U.S. federal government that gives an identification number to a business, much like a social security number does for a person.\n",
"score": 0.75231916
},
{
"id": "7685",
"text": "A tax ID number or employer identification number (EIN) is a number you get from the U.S. federal government that gives an identification number to a business, much like a social security number does for a person.\n",
"score": 0.75182426
},
etc...
]
},
Hybrid Query Using the Combined Query Feature
Now, let’s combine the two previous queries using the new feature:
http://localhost:8983/solr/ms-marco/combined?
{
"queries": {
"lexical": {
"lucene": {
"query": "text:(tax payment id)"
}
},
"vector": {
"knn": {
"f": "vector",
"topK" :10,
"query": "[0.0009692322928458452, 0.028254959732294083, -0.005096305627375841, -0.09961161762475967, -0.11519775539636612, 0.06311386823654175, 0.0852086991071701, -0.07076137512922287, 0.04222959280014038, -0.11359747499227524, -0.01666460931301117, -0.0423760749399662, -0.051786914467811584, -0.015746962279081345, -0.061401840299367905, -0.02211417444050312, 0.012279793620109558, 0.028413966298103333, 0.10297070443630219, 0.018956752493977547, ......., -0.050939954817295074]"
}
}
},
"limit": 10,
"fields": ["id", "text", "score"],
"params": {
"combiner": true,
"combiner.query": ["lexical", "vector"],
"combiner.algorithm": "rrf",
"combiner.rrf.k": "60"
}
}
As already said before, here the endpoint used is /combined, and the query structure is similar to JSON Query DSL, where:
queries: is the key used to specify multiple queries. In our case, we combine two queries: the lexical query and the vector one.limit: defines how many final documents are returned after all queries are executed and combined fields: it specifies which fields are included in the response params: is the section where you specify the parameters that control how the queries are combined, including:
combiner(defaultfalse): set totrueto enable the combined query mode.combiner.query: to specify the list of queries to be executed and combined as defined in thequerieskey.combiner.algorithm(defaultrrf): to specify the algorithm to be used for combining the results. Reciprocal Rank Fusion is the built-in algorithm; however, the system has already been designed to support custom fusion algorithms through the use of plugins (Stay tuned! A dedicated blog post about this will be published).combiner.rrf.k(default60): is thekparameter in the RRF formula.
Here are the results:
"docs": [
{
"id": "7684",
"text": "An employer identification number (EIN), also called a tax ID number or taxpayer ID, is required for most business entities. As its name implies, this is the number used by the Internal Revenue Service (IRS) to identify businesses with respect to their tax obligations.\n",
"score": 0.032002047
},
{
"id": "7686",
"text": "A. A federal tax identification number (also known as an employer identification number or EIN), is a number assigned solely to your business by the IRS. Your tax ID number is used to identify your business to several federal agencies responsible for the regulation of business.\n",
"score": 0.031544957
},
{
"id": "7687",
"text": "All the government forms you will be required to file for your business will require either a Social Security number or a tax ID number. It's safe to say that any business that has employees and/or pays any kind of taxes will need a federal tax ID. Best advice is, when in doubt, get one. It's easy to do.\n",
"score": 0.031009614
},
{
"id": "7690",
"text": "Download article as a PDF. An employer identification number (EIN), also called a tax ID number or taxpayer ID, is required for most business entities. As its name implies, this is the number used by the Internal Revenue Service (IRS) to identify businesses with respect to their tax obligations.\n",
"score": 0.03079839
},
{
"id": "7692",
"text": "Letâs start at the beginning. A tax ID number or employer identification number (EIN) is a number you get from the U.S. federal government that gives an identification number to a business, much like a social security number does for a person.\n",
"score": 0.02967033
},
{
"id": "7685",
"text": "A tax ID number or employer identification number (EIN) is a number you get from the U.S. federal government that gives an identification number to a business, much like a social security number does for a person.\n",
"score": 0.02964427
},
{
"id": "7688",
"text": "A tax ID number is not required if you operate a sole proprietorship or an LLC with no employees, in which case you would simply use your own Social Security Number as a tax ID. But you must obtain an EIN if you are a sole proprietor who files pension or excise tax returns.\n",
"score": 0.016393442
},
{
"id": "7691",
"text": "A. A federal tax identification number (also known as an employer identification number or EIN), is a number assigned solely to your business by the IRS.\n",
"score": 0.016129032
},
{
"id": "7689",
"text": "If your state taxes personal services, or if you are required to collect sales taxes on your sales, you need a federal tax ID number. All the government forms you will be required to file for your business will require either a Social Security number or a tax ID number.\n",
"score": 0.015625
},
{
"id": "8277",
"text": "Online Bill Payment. The City of Austell offers residents an easy and secure way to view, print, and pay their utility bill, sanitation, and property tax/stormwater bill online. We support electronic bill presentment (viewing) and payment because it is more convenient for residents and better for our environment.\n",
"score": 0.014925373
}
Results Explanation (Debug)
How can we understand how these final scores were calculated using RFF? Using the Debug parameter.
If we add debug=all or debug=true to our request, we can return all available debug information about it. The part we are interested in is the section called combinerExplanations, i.e.:
"combinerExplanations": {
"7684": "org.apache.lucene.search.Explanation:0.032002047 = 1/(60+2) + 1/(60+3) because its ranks were: 2 for query(lexical), 3 for query(vector)\n",
"7686": "org.apache.lucene.search.Explanation:0.031544957 = 1/(60+6) + 1/(60+1) because its ranks were: 6 for query(lexical), 1 for query(vector)\n",
"7687": "org.apache.lucene.search.Explanation:0.031009614 = 1/(60+5) + 1/(60+4) because its ranks were: 5 for query(lexical), 4 for query(vector)\n",
"7690": "org.apache.lucene.search.Explanation:0.03079839 = 1/(60+3) + 1/(60+7) because its ranks were: 3 for query(lexical), 7 for query(vector)\n",
"7692": "org.apache.lucene.search.Explanation:0.02967033 = 1/(60+10) + 1/(60+5) because its ranks were: 10 for query(lexical), 5 for query(vector)\n",
"7685": "org.apache.lucene.search.Explanation:0.02964427 = 1/(60+9) + 1/(60+6) because its ranks were: 9 for query(lexical), 6 for query(vector)\n",
"7688": "org.apache.lucene.search.Explanation:0.016393442 = 1/(60+1) because its ranks were: 1 for query(lexical), not in the results for query(vector)\n",
"7691": "org.apache.lucene.search.Explanation:0.016129032 = 1/(60+2) because its ranks were: not in the results for query(lexical), 2 for query(vector)\n",
"7689": "org.apache.lucene.search.Explanation:0.015625 = 1/(60+4) because its ranks were: 4 for query(lexical), not in the results for query(vector)\n",
etc...
In this block, it is clearly shown how the final score is calculated for each document.
For example, the document with id=7684 is ranked first because it has rank #2 in the keyword search and rank #3 in the vector search.
By applying the RRF formula: 1 / (60 + 2) + 1 / (60 + 3), we obtain its final score: 0.032002047
The same logic applies to the other documents, based on their respective ranks in each query.
Current Limitations
As we can see from the documentation, the Combined Query feature is currently unsupported for:
I hope you found this blog post/tutorial useful, interesting, and easy to follow. Stay tuned for more exciting updates coming soon!
Need Help with this topic?
Need Help With This Topic?
If you’re struggling with Reciprocal Rank Fusion, don’t worry – we’re here to help!
Our team offers expert services and training to help you optimize your Solr search engine and get the most out of your system. Contact us today to learn more!





