Apache Solr, Main Blog

Apache Solr Neural Search

The first milestone for Neural Search in Apache Solr has been contributed to the open source community by Sease [1] with the work of Alessandro Benedetti (Apache Lucene/Solr PMC member and committer) and Elia Porciani (Sease R&D software engineer).
It relies on the Apache Lucene implementation [2] for the K-nearest neighbour search.
Special thanks to Christine Poerschke, Cassandra Targett, Michael Gibney, and all the other reviewers who helped a lot in the last phases of the contribution. Even a single comment has been much appreciated, it’s always thanks to the community if we achieve progress.

Let’s start with a short introduction, on how the neural approach is improving search.

We can summarise Search as four primary areas:

generate a representation of the query that specifies the information need
generate a representation of the document that captures the information contained
match the query and the document representations from the corpus of information
assign a score to each matched document to establish a meaningful document ranking by relevance in the results

Neural Search is an industry derivation from the academic field of Neural Information Retrieval [3], and it focuses on improving any of these areas with neural network-based techniques.

Artificial Intelligence, Deep Learning and Vector Representations

More and more frequently, we hear about how Artificial Intelligence (AI from now on) permeates many aspects of our lives.

When we talk about AI we are referring to a set of techniques that enable machines to learn and show intelligence similar to humans.

With the strong and steady advance of computer power in the recent past, AI has seen a resurgence and it is now used in many domains, including software engineering and Information Retrieval (the science that regulates Search Engines and similar systems).

In particular, the advent of Deep Learning [4] introduced the use of deep neural networks to solve complex problems that were very challenging for classic algorithms.

For the sake of this blog post, it suffices to know that Deep Learning can be used to produce a vector representation of both the query and the documents in a corpus of information.

Dense Vector Representation

A traditional inverted index can be considered to model text as a “sparse” vector, in which each term in the corpus corresponds to one vector dimension. In such a model (see also the Bag-of-words approach), the number of dimensions corresponds to the term dictionary cardinality and the vector for any given document contains mostly zeros (hence it is called sparse, as only a few terms that exist in the overall dictionary will be present in any given document).

Dense vector representation contrasts with term-based sparse vector representation in that it distils approximate semantic meaning into a fixed (and limited) number of dimensions.

The number of dimensions in this approach is generally much lower than in the sparse case, and the vector for any given document is dense, as most of its dimensions are populated by non-zero values.

In contrast to the sparse approach (for which tokenizers are used to generate sparse vectors directly from text input) the task of generating vectors must be handled in application logic external to Apache Solr.

Various Deep Learning models such as BERT [5] can encode textual information as dense vectors, to be used for Dense Retrieval strategies.

For additional information, you can refer to our blog post about how to use BERT to improve search relevance.

Approximate Nearest Neighbor

Given a dense vector v that models the information need, the easiest approach for providing dense vector retrieval would be to calculate the distance(euclidean, dot product, etc.) between v and each vector d that represents a document in the corpus of information.

This approach is quite expensive, so many approximate strategies are currently under active research.

An approximate nearest neighbour search algorithm returns results, whose distance from the query vector is at most $c$ times the distance from the query vector to its nearest vectors.
The benefit of this approach is that most of the time, an approximate nearest neighbour is almost as good as the exact one.
In particular, if the distance measure accurately captures the notion of user quality, then small differences in the distance should not matter [6].

Hierarchical Navigable Small World Graph

The strategy implemented in Apache Lucene and used by Apache Solr is based on Navigable Small-world graphs.

It provides an efficient approximate nearest neighbour search for high dimensional vectors[7][8][9][10].

Hierarchical Navigable Small World Graph (HNSW) is a method based on the concept of proximity neighbours graphs:
every vector in the vectorial space associated with the corpus of information is uniquely associated with a

vertex $v_{i}\in V$ in the graph $G(V,E)$ .

Vertices are connected through edges based on their proximity, the closer (in terms of a distance function) ones are linked.

Building the graph is affected by hyper-parameters that regulate how many connections to build per layer and how many layers to build.

At query time, the neighbour structure is navigated to find the closest vector(s) to the target one, starting from a seed node and iterating as we get closer and closer to the target.

I found this blog to be quite useful for a deep dive into the topic.

Apache Lucene Implementation

The first thing to notice is that the current Lucene implementation is not hierarchical.
So there is just one single layer in the graph, see the latest comments in the original Jira issue tracking the development progress[11].
The main reason is to find an easier design, development, and integration process for such a simplified implementation in the Apache Lucene ecosystem.
It’s agreed that introducing the hierarchical layered structure will bring benefits in terms of lower dimension vector management and query time(reducing the candidate nodes to traverse).
This implementation is in progress[12].

So what are the Apache Lucene components involved with the Navigable Small World Graph and K-Nearest Neighbors capability?
Let’s explore the code:

N.B. You can skip this paragraph if you are not interested in Lucene internals and codecs

The org.apache.lucene.document.KnnVectorField is the entry point:
it requires at index time the vector dimension and the similarity Function(the vector distance function to use when building the NSW graph).
Those are set in the org.apache.lucene.document.FieldType through the #setVectorDimensionsAndSimilarityFunction method.

When updating the document field schema org.apache.lucene.index.IndexingChain#updateDocFieldSchema the information is extracted from the FieldType and saved in the org.apache.lucene.index.IndexingChain.FieldSchema.

And from the FieldSchema the KnnVectorField configurations finally reach the org.apache.lucene.index.FieldInfo in org.apache.lucene.index.IndexingChain#initializeFieldInfo.

Now the Lucene codec has all the field-specific configurations necessary to build the NSW graph.
Let’s see how:

First of all from the org.apache.lucene.codecs.lucene90.Lucene90Codec#Lucene90Codec you see that the default format for the KNN vectors is the org.apache.lucene.codecs.lucene90.Lucene90HnswVectorsFormat.

The associated index writer is the org.apache.lucene.codecs.lucene90.Lucene90HnswVectorsWriter.
This component has access to the FieldInfo previously initialized when writing fields to the index in org.apache.lucene.codecs.lucene90.Lucene90HnswVectorsWriter#writeField.
When writing a KnnVectorField, the org.apache.lucene.util.hnsw.HnswGraphBuilder is involved and finally a
org.apache.lucene.util.hnsw.HnswGraph is built.

Apache Solr implementation

Available from Apache Solr 9.0

This first contribution allows to index single-valued dense vector fields and to search for K-nearest neighbours using an approximated distance function.

Current features :

DenseVectorField type
Knn Query Parser

DenseVectorField Type

The dense vector field gives the possibility of indexing and searching dense vectors of float elements.

e.g.

[1.0, 2.5, 3.7, 4.1]

Here’s how DenseVectorField should be configured in the schema:

				
					<fieldType name="knn_vector" class="solr.DenseVectorField" vectorDimension="4" similarityFunction="cosine"/> <field name="vector" type="knn_vector" indexed="true" stored="true"/>

Parameter Name	Required	Default	Description	Accepted values
`vectorDimension`	True		The dimension of the dense vector to pass in.	Integer < = 1024
`similarityFunction`	False	`euclidean`	Vector similarity function; used in search to return top K most similar vectors to a target vector.	`euclidean`, `dot_product` or `cosine`.

euclidean: Euclidean distance
dot_product: Dot product. NOTE: This similarity is intended as an optimized way to perform cosine similarity. To use it, all vectors must be of unit length, including both document and query vectors. Using dot product with vectors that are not unit length can result in errors or poor search results..
cosine: Cosine similarity. NOTE: the preferred way to perform cosine similarity is to normalize all vectors to unit length, and instead use DOT_PRODUCT. You should only use this function if you need to preserve the original vectors and cannot normalize them in advance.

DenseVectorField supports the attributes: indexed, stored.

N.B. currently multivalue is not supported.

Customise the Index Codec

To use the following advanced parameters that customise the codec format and the hyper-parameter of the HNSW algorithm make sure you set this configuration in the solrconfig.xml:

				
					<codecFactory class="solr.SchemaCodecFactory"/>
...

Here’s how DenseVectorField can be configured with the advanced codec hyper-parameters:

				
					<fieldType name="knn_vector" class="solr.DenseVectorField" vectorDimension="4" similarityFunction="cosine" codecFormat="Lucene90HnswVectorsFormat" hnswMaxConnections="10" hnswBeamWidth="40"/>
<field name="vector" type="knn_vector" indexed="true" stored="true"/>

Parameter Name	Required	Default	Description	Accepted values
`codecFormat`	False	`Lucene90HnswVectorsFormat`	Specifies the knn codec implementation to use	`Lucene90HnswVectorsFormat`
`hnswMaxConnections`	False	`16`	`Lucene90HnswVectorsFormat` only: Controls how many of the nearest neighbor candidates are connected to the new node. It has the same meaning as `M` from the paper[8].	`Integer`
`hnswBeamWidth`	False	`100`	`Lucene90HnswVectorsFormat` only: It is the number of nearest neighbor candidates to track while searching the graph for each newly inserted node. It has the same meaning as `efConstruction` from the paper[8].	`Integer`

Please note that the codecFormat accepted values may change in future releases.

NOTE Lucene index back-compatibility is only supported for the default codec. If you choose to customize the codecFormat in your schema, upgrading to a future version of Solr may require you to either switch back to the default codec and optimize your index to rewrite it into the default codec before upgrading, or re-build your entire index from scratch after upgrading.

For the hyper-parameters of the HNSW implementation, see [8] for details.

How to Index a Vector

Here’s how a DenseVectorField should be indexed:

JSON

				
					[{ "id": "1",
"vector": [1.0, 2.5, 3.7, 4.1]
},
{ "id": "2",
"vector": [1.5, 5.5, 6.7, 65.1]
}
]

XML

				
					<field name="id">1
<field name="vector">1.0
<field name="vector">2.5
<field name="vector">3.7
<field name="vector">4.1


<field name="id">2
<field name="vector">1.5
<field name="vector">5.5
<field name="vector">6.7
<field name="vector">65.1

Java – SolrJ

				
					final SolrClient client = getSolrClient();

final SolrInputDocument d1 = new SolrInputDocument();
d1.setField("id", "1");
d1.setField("vector", Arrays.asList(1.0f, 2.5f, 3.7f, 4.1f));


final SolrInputDocument d2 = new SolrInputDocument();
d2.setField("id", "2");
d2.setField("vector", Arrays.asList(1.5f, 5.5f, 6.7f, 65.1f));

client.add(Arrays.asList(d1, d2));

Knn Query Parser

The knn K-Nearest Neighbors query parser allows finding the k-nearest documents to the target vector according to indexed dense vectors in the given field.

It takes the following parameters:

Parameter Name	Required	Default	Description
`f`	True		The DenseVectorField to search in.
`topK`	False	10	How many k-nearest results to return.

Here’s how to run a KNN search:

				
					&q={!knn f=vector topK=10}[1.0, 2.0, 3.0, 4.0]

The search results retrieved are the K-nearest to the vector in the input [1.0, 2.0, 3.0, 4.0], ranked by the similarityFunction configured at indexing time.

Usage with Filter Queries

The knn query parser can be used in filter queries:

				
					&q=id:(1 2 3)&fq={!knn f=vector topK=10}[1.0, 2.0, 3.0, 4.0]

The knn query parser can be used with filter queries:

				
					&q={!knn f=vector topK=10}[1.0, 2.0, 3.0, 4.0]&fq=id:(1 2 3)

IMPORTANT When using knn in these scenarios make sure you have a clear understanding of how filter queries work in Apache Solr:
The Ranked List of document IDs resulting from the main query q is intersected with the set of document IDs deriving from each filter query fq.e.g.Ranked List from q=[ID1, ID4, ID2, ID10] Set from fq={ID3, ID2, ID9, ID4} = [ID4,ID2]

Usage as Re-Ranking Query

The knn query parser can be used to rerank first-pass query results:

				
					&q=id:(3 4 9 2)&rq={!rerank reRankQuery=$rqq reRankDocs=4 reRankWeight=1}&rqq={!knn f=vector topK=10}[1.0, 2.0, 3.0, 4.0]

IMPORTANT When using knn in reranking pay attention to the topK parameter.
The second pass score(deriving from knn) is calculated only if the document d from the first pass is within the K-nearest neighbours (in the whole index) of the target vector to search.
This means the second pass knn is executed on the whole index anyway, which is a current limitation.
The final ranked list of results will have the first pass score(main query q) added to the second pass score(the approximated similarityFunction distance to the target vector to search) multiplied by a multiplicative factor(reRankWeight).
So if the document d is not present in the knn results, you get a zero contribution to the original score even if the distance vector calculation with the target query vector is not zero.
Details about using the ReRank Query Parser can be found in the Apache Solr Wiki [13] section.

Future Works

This is just the beginning of the Apache Solr neural capabilities, more is coming! Don’t forget to keep an eye on our Apache Solr blog posts category.
I hope you enjoyed the overview and stay tuned for updates!

references

[6] Andoni, A.; Indyk, P. (2006-10-01). “Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions”. 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06). pp. 459–468. CiteSeerX 10.1.1.142.3471. doi:10.1109/FOCS.2006.49. ISBN 978-0-7695-2720-8.

[7] Yury Malkov, Alexander Ponomarenko, Andrey Logvinov, Vladimir Krylov, Approximate nearest neighbor algorithm based on navigable small world graphs,
Information Systems, Volume 45, 2014, Pages 61-68, ISSN 0306-4379, https://doi.org/10.1016/j.is.2013.10.006.

[8] Malkov, Yury; Yashunin, Dmitry (2016). “Efficient and robust approximate nearest neighbour search using Hierarchical Navigable Small World graphs”. arXiv:1603.09320 [cs.DS].

Need Help with this topic?

Did I mention we do Artificial Intelligence in Search training and specific one-day training about Deep Learning for Search? We also provide consulting on these topics, get in touch if you want to bring your search engine to the next level with the power of AI!

Click Here

apache lucene, apache solr, approximate nearest neighbor, artificial intelligence, deep learning, machine learning, neural search

Sign up for our Newsletter

Did you like this post? Don’t forget to subscribe to our Newsletter to stay always updated in the Information Retrieval world!

8 Responses

Michael Pulis says:

March 2, 2022 at 10:58 am

Would it be possible to represent the entire document as a list of vectors, and preform the search on this list of vector representations of the document and query, using Apache Solr ?

Loading...

Reply
1. Alessandro Benedetti says:
  
  March 11, 2022 at 5:36 pm
  
  Hi Michael, we are working in that direction. The short answer right now is no, only single-valued vectors are accepted for a dense vector field.
  
  Loading...
  
  Reply
  1. Dallan Quass says:
    
    July 9, 2022 at 9:53 pm
    
    I’m extremely thankful you are working on dense vector representation! Do you have a guesstimate on when multi-valued vectors would be supported? Then we could query sentence embeddings alongside keywords and metadata, which would be wonderful!
    
    Loading...
Pingback: Migrating from Solr to Elasticsearch, and their differences | by Canva Engineering - Business - News - Health - Technology - Finance - Pal Website
Yvonne says:

September 10, 2023 at 1:58 pm

Hi, is the topK currently work for values other than 10? and what is the maximum of values returned? Thanks very much!

Loading...

Reply
1. Ilaria Petreti says:
  
  September 17, 2023 at 9:41 am
  
  Hi Yvonne,
  with the topK parameter you can specify the number of k-nearest results you want to return.
  Currently, 10 is just the default value but you can specify any integer >= 1
  https://solr.apache.org/guide/solr/latest/query-guide/dense-vector-search.html#knn-query-parser
  
  Loading...
  
  Reply
  1. Yvonne says:
    
    September 17, 2023 at 11:30 am
    
    Currently I use parameter other than 10, but the returned results are still size of 10:
    # print(“Encoded Query:”, search_list)
    knn_query = ‘{!knn f=vector topK=15}’ + str(search_list)
    # print(knn_query)
    results = self.solr.search(q=knn_query)
    print(len(results))
    
    Loading...
  2. Ilaria Petreti says:
    
    September 18, 2023 at 2:20 pm
    
    This happens for the ‘rows’ parameter which specifies the maximum number of documents from the complete result set that Solr should return to the client at one time (for the pagination). The default value is 10.
    If you have set topK=15 for the dense vector search, try to add rows=15 to the query as well.
    e.g.
    &q={!knn f=vector topK=15}[1.0, 2.0, 3.0, 4.0]&rows=15
    
    Loading...

About the company

about our work

Rated Ranking Evaluator (RRE)

Rated Ranking Evaluator Enterprise (RREE)

Apache Solr LLM Highlighter plugin

News

Main Blog

TIPS AND TRICKS

LATEST BLOG POST

contact us

Don't miss all the news - subscribe to our newsletter!

Apache Solr Neural Search

Artificial Intelligence, Deep Learning and Vector Representations

Dense Vector Representation

Approximate Nearest Neighbor

Hierarchical Navigable Small World Graph

Apache Lucene Implementation

Apache Solr implementation

DenseVectorField Type

Customise the Index Codec

How to Index a Vector

Java – SolrJ

Knn Query Parser

Usage with Filter Queries

Usage as Re-Ranking Query

Future Works

references

Need Help with this topic?​

Other posts you may find useful

Apache Solr/Elasticsearch: How to Manage Multi-term Concepts out of the Box?

OpenSearch Neural Search Tutorial: How Filtering Works

Semantic Search (Text to Vector) with Apache Solr

Alessandro Benedetti

Alessandro Benedetti

Follow Us

Top Categories

Recent Posts

Retrieval and Responsibility: The Ethics of Augmented Knowledge

Faster Vector Search: Early Termination Strategy Now in Apache Solr

OpenSearch and Large Language Models

Monthly video

Sign up for our Newsletter

8 Responses

Leave a Reply Cancel reply

Rated Ranking Evaluator
(RRE)

Need Help with this topic?