Apache Solr Main Blog

The first milestone for Neural Search in Apache Solr has been contributed to the open source community by Sease[1] with the work of Alessandro Benedetti(Apache Lucene/Solr PMC member and committer) and Elia Porciani(Sease R&D software engineer).
It relies on the Apache Lucene implementation[2] for K-nearest neighbor search.
Special thanks to Christine Poerschke, Cassandra Targett, Michael Gibney, and all the other reviewers that helped a lot in the last phases of the contribution. Even a single comment has been much appreciated, it’s always thanks to the community if we achieve progress.

Let’s start from a short introduction, on how the neural approach is improving search.

We can summarise Search as four primary areas:

  • generate a representation of the query that specifies the information need
  • generate a representation of the document that captures the information contained
  • match the query and the document representations from the corpus of information
  • assign a score to each matched document in order to establish a meaningful document ranking by relevance in the results

Neural Search is an industry derivation from the academic field of Neural information Retrieval[3], and it focuses on improving any of these areas with neural network based techniques.

Artificial Intelligence, Deep Learning and Vector Representations

More and more frequently, we hear about how Artificial Intelligence (AI from now on) permeates many aspects of our lives.

When we talk about AI we are referring to a set of techniques that enable machines to learn and show intelligence similar to humans.

With the strong and steady advance of computer power in the recent past, AI has seen a resurgence and it is now used in many domains, including software engineering and Information Retrieval (the science that regulates Search Engines and similar systems).

In particular, the advent of Deep Learning[4] introduced the use of deep neural networks to solve complex problems that were very challenging for classic algorithms.

For the sake of this blog post, it suffices to know that Deep Learning can be used to produce a vector representation of both the query and the documents in a corpus of information.

Dense Vector Representation

A traditional inverted index can be considered to model text as a “sparse” vector, in which each term in the corpus corresponds to one vector dimension. In such a model (see also the Bag-of-words approach), the number of dimensions is corresponding to the term dictionary cardinality and the vector for any given document contains mostly zeros (hence it is called sparse, as only a few terms that exist in the overall dictionary will be present in any given document).

Dense vector representation contrasts with term-based sparse vector representation in that it distills approximate semantic meaning into a fixed (and limited) number of dimensions.

The number of dimensions in this approach is generally much lower than the sparse case, and the vector for any given document is dense, as most of its dimensions are populated by non-zero values.

In contrast to the sparse approach (for which tokenizers are used to generate sparse vectors directly from text input) the task of generating vectors must be handled in application logic external to Apache Solr.

Various Deep Learning models such as BERT[5] are able to encode textual information as dense vectors, to be used for Dense Retrieval strategies.

For additional information, you can refer to this blog post of ours.

Approximate Nearest Neighbor

Given a dense vector v that models the information need, the easiest approach for providing dense vector retrieval would be to calculate the distance(euclidean, dot product, etc.) between v and each vector d that represents a document in the corpus of information.

This approach is quite expensive, so many approximate strategies are currently under active research.

An approximate nearest neighbor search algorithm returns results, whose distance from the query vector is at most c times the distance from the query vector to its nearest vectors.
The benefit of this approach is that most of the time, an approximate nearest neighbor is almost as good as the exact one.
In particular, if the distance measure accurately captures the notion of user quality, then small differences in the distance should not matter[6]

Hierarchical Navigable Small World Graph

The strategy implemented in Apache Lucene and used by Apache Solr is based on Navigable Small-world graphs.

It provides an efficient approximate nearest neighbor search for high dimensional vectors[7][8][9][10].

Hierarchical Navigable Small World Graph (HNSW) is a method based on the concept of proximity neighbors graphs:
every vector in the vectorial space associated with the corpus of information is uniquely associated with a

vertex v_{i}\in V in the graph G(V,E)

Vertices are connected through edges based on their proximity, the closer (in terms of a distance function) ones are linked.

Building the graph is affected by hyper-parameters that regulate how many connections to build per layer and how many layers to build.

At query time, the neighbor structure is navigated to find the closest vector(s) to the target one, starting from a seed node and iterating as we get closer and closer to the target.

I found this blog to be quite useful for a deep dive into the topic.

Apache Lucene Implementation

The first thing to notice is that the current Lucene implementation is not hierarchical.
So there is just one single layer in the graph, see the latest comments in the original Jira issue tracking the development progress[11].
The main reason is to find in the easier design, development, and integration process for such a simplified implementation in the Apache Lucene ecosystem.
It’s agreed that introducing the hierarchical layered structure will bring benefits in terms of lower dimension vector management and query time(reducing the candidate nodes to traverse).
This implementation is in progress[12].

So what are the Apache Lucene components involved with the Navigable Small World Graph and K-Nearest Neighbors capability?
Let’s explore the code:

N.B. you can just skip this paragraph if you are not interested in Lucene internals and codecs

The org.apache.lucene.document.KnnVectorField is the entry point:
it requires at index time the vector dimension and the similarity Function(the vector distance function to use when building the NSW graph).
Those are set in the org.apache.lucene.document.FieldType through the #setVectorDimensionsAndSimilarityFunction method.

When updating the document field schema org.apache.lucene.index.IndexingChain#updateDocFieldSchema the information is extracted from the FieldType and saved in the org.apache.lucene.index.IndexingChain.FieldSchema.

And from the FieldSchema the KnnVectorField configurations finally reach the org.apache.lucene.index.FieldInfo in org.apache.lucene.index.IndexingChain#initializeFieldInfo.

Now the Lucene codec has all the field-specific configurations necessary to build the NSW graph.
Let’s see how:

First of all from the org.apache.lucene.codecs.lucene90.Lucene90Codec#Lucene90Codec you see that the default format for the KNN vectors is the org.apache.lucene.codecs.lucene90.Lucene90HnswVectorsFormat.

The associated index writer is the org.apache.lucene.codecs.lucene90.Lucene90HnswVectorsWriter.
This component has access to the FieldInfo previously initialized when writing fields to the index in org.apache.lucene.codecs.lucene90.Lucene90HnswVectorsWriter#writeField.
When writing a KnnVectorField, the org.apache.lucene.util.hnsw.HnswGraphBuilder is involved and finally a
org.apache.lucene.util.hnsw.HnswGraph is built.

Apache Solr implementation

Available from Apache Solr 9.0

Expected First Quarter 2022

This first contribution allows to index single-valued dense vector fields and to search for K-nearest neighbors using an approximated distance function.

Current features :

  • DenseVectorField type
  • Knn Query Parser

DenseVectorField

The dense vector field gives the possibility of indexing and searching dense vectors of float elements.

e.g.

[1.0, 2.5, 3.7, 4.1]

Here’s how DenseVectorField should be configured in the schema:

<fieldType name="knn_vector" class="solr.DenseVectorField" vectorDimension="4" similarityFunction="cosine"/>
<field name="vector" type="knn_vector" indexed="true" stored="true"/>

Parameter Name Required Default Description Accepted values
vectorDimension True   The dimension of the dense vector to pass in. Integer < = 1024
similarityFunction False euclidean Vector similarity function; used in search to return top K most similar vectors to a target vector. euclideandot_product or cosine.

  • euclideanEuclidean distance
  • dot_productDot productNOTE: this similarity is intended as an optimized way to perform cosine similarity. In order to use it, all vectors must be of unit length, including both document and query vectors. Using dot product with vectors that are not unit length can result in errors or poor search results..
  • cosineCosine similarityNOTE: the preferred way to perform cosine similarity is to normalize all vectors to unit length, and instead use DOT_PRODUCT. You should only use this function if you need to preserve the original vectors and cannot normalize them in advance.

DenseVectorField supports the attributes: indexedstored.

N.B. currently multivalue is not supported

Customise the Index Codec

To use the following advanced parameters that customise the codec format and the hyper-parameter of the HNSW algorithm make sure you set this configuration in the solrconfig.xml:


<codecFactory class="solr.SchemaCodecFactory"/>
...

Here’s how DenseVectorField can be configured with the advanced codec hyper-parameters:

<fieldType name="knn_vector" class="solr.DenseVectorField" vectorDimension="4" similarityFunction="cosine" codecFormat="Lucene90HnswVectorsFormat" hnswMaxConnections="10" hnswBeamWidth="40"/>
<field name="vector" type="knn_vector" indexed="true" stored="true"/>

Parameter Name Required Default Description Accepted values
codecFormat False Lucene90HnswVectorsFormat Specifies the knn codec implementation to use Lucene90HnswVectorsFormat
hnswMaxConnections False 16 Lucene90HnswVectorsFormat only:
Controls how many of the nearest neighbor candidates are connected to the new node.
It has the same meaning as M from the paper[8].
Integer
hnswBeamWidth False 100 Lucene90HnswVectorsFormat only: It is the number of nearest neighbor candidates to track while searching the graph for each newly inserted node.
It has the same meaning as efConstruction from the paper[8].
Integer

Please note that the codecFormat accepted values may change in future releases.

NOTE Lucene index back-compatibility is only supported for the default codec. If you choose to customize the codecFormat in your schema, upgrading to a future version of Solr may require you to either switch back to the default codec and optimize your index to rewrite it into the default codec before upgrading, or re-build your entire index from scratch after upgrading.

For the hyper-parameters of the HNSW implementation, see [8] for details.

How to Index a Vector

Here’s how a DenseVectorField should be indexed:

JSON

[{ "id": "1",
"vector": [1.0, 2.5, 3.7, 4.1]
},
{ "id": "2",
"vector": [1.5, 5.5, 6.7, 65.1]
}
]

XML



<field name="id">1
<field name="vector">1.0
<field name="vector">2.5
<field name="vector">3.7
<field name="vector">4.1


<field name="id">2
<field name="vector">1.5
<field name="vector">5.5
<field name="vector">6.7
<field name="vector">65.1

Java – SolrJ

final SolrClient client = getSolrClient();

final SolrInputDocument d1 = new SolrInputDocument();
d1.setField("id", "1");
d1.setField("vector", Arrays.asList(1.0f, 2.5f, 3.7f, 4.1f));


final SolrInputDocument d2 = new SolrInputDocument();
d2.setField("id", "2");
d2.setField("vector", Arrays.asList(1.5f, 5.5f, 6.7f, 65.1f));

client.add(Arrays.asList(d1, d2));

knn Query Parser

The knn K-Nearest Neighbors query parser allows finding the k-nearest documents to the target vector according to indexed dense vectors in the given field.

It takes the following parameters:

Parameter Name Required Default Description
f True   The DenseVectorField to search in.
topK False 10 How many k-nearest results to return.

Here’s how to run a KNN search:

&q={!knn f=vector topK=10}[1.0, 2.0, 3.0, 4.0]

The search results retrieved are the K-nearest to the vector in input [1.0, 2.0, 3.0, 4.0], ranked by the similarityFunction configured at indexing time.

Usage with Filter Queries

The knn query parser can be used in filter queries:

&q=id:(1 2 3)&fq={!knn f=vector topK=10}[1.0, 2.0, 3.0, 4.0]

The knn query parser can be used with filter queries:

&q={!knn f=vector topK=10}[1.0, 2.0, 3.0, 4.0]&fq=id:(1 2 3)

IMPORTANT When using knn in these scenarios make sure you have a clear understanding of how filter queries work in Apache Solr:
The Ranked List of document IDs resulting from the main query q is intersected with the set of document IDs deriving from each filter query fq.e.g.Ranked List from q=[ID1, ID4, ID2, ID10]  Set from fq={ID3, ID2, ID9, ID4} = [ID4,ID2]

Usage as Re-Ranking Query

The knn query parser can be used to rerank first pass query results:

&q=id:(3 4 9 2)&rq={!rerank reRankQuery=$rqq reRankDocs=4 reRankWeight=1}&rqq={!knn f=vector topK=10}[1.0, 2.0, 3.0, 4.0]

IMPORTANT When using knn in reranking pay attention to the topK parameter.
The second pass score(deriving from knn) is calculated only if the document d from the first pass is within the K-nearest neighbors(in the whole index) of the target vector to search.
This means the second pass knn is executed on the whole index anyway, which is a current limitation.
The final ranked list of results will have the first pass score(main query q) added to the second pass score(the approximated similarityFunction distance to the target vector to search) multiplied by a multiplicative factor(reRankWeight).
So if the document d is not present in the knn results, you get a zero contribution to the original score even if the distance vector calculation with the target query vector is not zero.
Details about using the ReRank Query Parser can be found in the Apache Solr Wiki[13] section.

Future Works

This is just the beginning for the Apache Solr neural capabilities, more is coming!
I hope you enjoyed the overview and stay tuned for updates!

References

[6] Andoni, A.; Indyk, P. (2006-10-01). “Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions”. 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06). pp. 459–468. CiteSeerX 10.1.1.142.3471doi:10.1109/FOCS.2006.49ISBN 978-0-7695-2720-8.

[7] Yury Malkov, Alexander Ponomarenko, Andrey Logvinov, Vladimir Krylov,
Approximate nearest neighbor algorithm based on navigable small world graphs,
Information Systems,
Volume 45,
2014,
Pages 61-68,
ISSN 0306-4379,
https://doi.org/10.1016/j.is.2013.10.006.

[8]  Malkov, Yury; Yashunin, Dmitry (2016). “Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs”. arXiv:1603.09320 [cs.DS].

// our service

Shameless plug for our training and services!

Did I mention we do Artificial Intelligence in Search training and specific one-day training about Deep Learning for Search?
We also provide consulting on these topics, get in touch if you want to bring your search engine to the next level with the power of AI!

// STAY ALWAYS UP TO DATE

Subscribe to our newsletter

Did you like this post about Apache Solr Neural Search? Don’t forget to subscribe to our Newsletter to stay always updated from the Information Retrieval world!

Author

Alessandro Benedetti

Alessandro Benedetti is the founder of Sease Ltd. Senior Search Software Engineer, his focus is on R&D in information retrieval, information extraction, natural language processing, and machine learning.

Comments (2)

  1. Michael Pulis
    March 2, 2022

    Would it be possible to represent the entire document as a list of vectors, and preform the search on this list of vector representations of the document and query, using Apache Solr ?

    • Alessandro Benedetti
      March 11, 2022

      Hi Michael, we are working in that direction. The short answer right now is no, only single-valued vectors are accepted for a dense vector field.

Leave a comment

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: