Apache Solr Neural Search
The first milestone for Neural Search in Apache Solr has been contributed to the open source community by Sease[1] with the work of Alessandro Benedetti(Apache Lucene/Solr PMC member and committer) and Elia Porciani(Sease R&D software engineer).
It relies on the Apache Lucene implementation[2] for K-nearest neighbor search.
Special thanks to Christine Poerschke, Cassandra Targett, Michael Gibney, and all the other reviewers that helped a lot in the last phases of the contribution. Even a single comment has been much appreciated, it’s always thanks to the community if we achieve progress.
Let’s start from a short introduction, on how the neural approach is improving search.
We can summarise Search as four primary areas:
- generate a representation of the query that specifies the information need
- generate a representation of the document that captures the information contained
- match the query and the document representations from the corpus of information
- assign a score to each matched document in order to establish a meaningful document ranking by relevance in the results
Neural Search is an industry derivation from the academic field of Neural information Retrieval[3], and it focuses on improving any of these areas with neural network based techniques.
Artificial Intelligence, Deep Learning and Vector Representations
More and more frequently, we hear about how Artificial Intelligence (AI from now on) permeates many aspects of our lives.
When we talk about AI we are referring to a set of techniques that enable machines to learn and show intelligence similar to humans.
With the strong and steady advance of computer power in the recent past, AI has seen a resurgence and it is now used in many domains, including software engineering and Information Retrieval (the science that regulates Search Engines and similar systems).
In particular, the advent of Deep Learning[4] introduced the use of deep neural networks to solve complex problems that were very challenging for classic algorithms.
For the sake of this blog post, it suffices to know that Deep Learning can be used to produce a vector representation of both the query and the documents in a corpus of information.
Dense Vector Representation
A traditional inverted index can be considered to model text as a “sparse” vector, in which each term in the corpus corresponds to one vector dimension. In such a model (see also the Bag-of-words approach), the number of dimensions is corresponding to the term dictionary cardinality and the vector for any given document contains mostly zeros (hence it is called sparse, as only a few terms that exist in the overall dictionary will be present in any given document).
Dense vector representation contrasts with term-based sparse vector representation in that it distills approximate semantic meaning into a fixed (and limited) number of dimensions.
The number of dimensions in this approach is generally much lower than the sparse case, and the vector for any given document is dense, as most of its dimensions are populated by non-zero values.
In contrast to the sparse approach (for which tokenizers are used to generate sparse vectors directly from text input) the task of generating vectors must be handled in application logic external to Apache Solr.
Various Deep Learning models such as BERT[5] are able to encode textual information as dense vectors, to be used for Dense Retrieval strategies.
For additional information, you can refer to this blog post of ours.
Approximate Nearest Neighbor
Given a dense vector v
that models the information need, the easiest approach for providing dense vector retrieval would be to calculate the distance(euclidean, dot product, etc.) between v
and each vector d
that represents a document in the corpus of information.
This approach is quite expensive, so many approximate strategies are currently under active research.
An approximate nearest neighbor search algorithm returns results, whose distance from the query vector is at most times the distance from the query vector to its nearest vectors.
The benefit of this approach is that most of the time, an approximate nearest neighbor is almost as good as the exact one.
In particular, if the distance measure accurately captures the notion of user quality, then small differences in the distance should not matter[6]
Hierarchical Navigable Small World Graph
The strategy implemented in Apache Lucene and used by Apache Solr is based on Navigable Small-world graphs.
It provides an efficient approximate nearest neighbor search for high dimensional vectors[7][8][9][10].
Hierarchical Navigable Small World Graph (HNSW) is a method based on the concept of proximity neighbors graphs:
every vector in the vectorial space associated with the corpus of information is uniquely associated with a
vertex in the graph
.
Vertices are connected through edges based on their proximity, the closer (in terms of a distance function) ones are linked.
Building the graph is affected by hyper-parameters that regulate how many connections to build per layer and how many layers to build.
At query time, the neighbor structure is navigated to find the closest vector(s) to the target one, starting from a seed node and iterating as we get closer and closer to the target.
I found this blog to be quite useful for a deep dive into the topic.
Apache Lucene Implementation
The first thing to notice is that the current Lucene implementation is not hierarchical.
So there is just one single layer in the graph, see the latest comments in the original Jira issue tracking the development progress[11].
The main reason is to find in the easier design, development, and integration process for such a simplified implementation in the Apache Lucene ecosystem.
It’s agreed that introducing the hierarchical layered structure will bring benefits in terms of lower dimension vector management and query time(reducing the candidate nodes to traverse).
This implementation is in progress[12].
So what are the Apache Lucene components involved with the Navigable Small World Graph and K-Nearest Neighbors capability?
Let’s explore the code:
N.B. you can just skip this paragraph if you are not interested in Lucene internals and codecs
The org.apache.lucene.document.KnnVectorField is the entry point:
it requires at index time the vector dimension and the similarity Function(the vector distance function to use when building the NSW graph).
Those are set in the org.apache.lucene.document.FieldType through the #setVectorDimensionsAndSimilarityFunction method.
When updating the document field schema org.apache.lucene.index.IndexingChain#updateDocFieldSchema the information is extracted from the FieldType and saved in the org.apache.lucene.index.IndexingChain.FieldSchema.
And from the FieldSchema the KnnVectorField configurations finally reach the org.apache.lucene.index.FieldInfo in org.apache.lucene.index.IndexingChain#initializeFieldInfo.
Now the Lucene codec has all the field-specific configurations necessary to build the NSW graph.
Let’s see how:
First of all from the org.apache.lucene.codecs.lucene90.Lucene90Codec#Lucene90Codec you see that the default format for the KNN vectors is the org.apache.lucene.codecs.lucene90.Lucene90HnswVectorsFormat.
The associated index writer is the org.apache.lucene.codecs.lucene90.Lucene90HnswVectorsWriter.
This component has access to the FieldInfo previously initialized when writing fields to the index in org.apache.lucene.codecs.lucene90.Lucene90HnswVectorsWriter#writeField.
When writing a KnnVectorField, the org.apache.lucene.util.hnsw.HnswGraphBuilder is involved and finally a
org.apache.lucene.util.hnsw.HnswGraph is built.
Apache Solr implementation
Available from Apache Solr 9.0
Expected First Quarter 2022
This first contribution allows to index single-valued dense vector fields and to search for K-nearest neighbors using an approximated distance function.
Current features :
- DenseVectorField type
- Knn Query Parser
DenseVectorField
The dense vector field gives the possibility of indexing and searching dense vectors of float elements.
e.g.
[1.0, 2.5, 3.7, 4.1]
Here’s how DenseVectorField
should be configured in the schema:
<fieldType name="knn_vector" class="solr.DenseVectorField" vectorDimension="4" similarityFunction="cosine"/>
<field name="vector" type="knn_vector" indexed="true" stored="true"/>
Parameter Name | Required | Default | Description | Accepted values |
vectorDimension |
True | The dimension of the dense vector to pass in. | Integer < = 1024 | |
similarityFunction |
False | euclidean |
Vector similarity function; used in search to return top K most similar vectors to a target vector. | euclidean , dot_product or cosine . |
euclidean
: Euclidean distancedot_product
: Dot product. NOTE: this similarity is intended as an optimized way to perform cosine similarity. In order to use it, all vectors must be of unit length, including both document and query vectors. Using dot product with vectors that are not unit length can result in errors or poor search results..cosine
: Cosine similarity. NOTE: the preferred way to perform cosine similarity is to normalize all vectors to unit length, and instead use DOT_PRODUCT. You should only use this function if you need to preserve the original vectors and cannot normalize them in advance.
DenseVectorField supports the attributes: indexed
, stored
.
N.B. currently multivalue is not supported
Customise the Index Codec
To use the following advanced parameters that customise the codec format and the hyper-parameter of the HNSW algorithm make sure you set this configuration in the solrconfig.xml:
<codecFactory class="solr.SchemaCodecFactory"/>
...
Here’s how DenseVectorField
can be configured with the advanced codec hyper-parameters:
<fieldType name="knn_vector" class="solr.DenseVectorField" vectorDimension="4" similarityFunction="cosine" codecFormat="Lucene90HnswVectorsFormat" hnswMaxConnections="10" hnswBeamWidth="40"/>
<field name="vector" type="knn_vector" indexed="true" stored="true"/>
Parameter Name | Required | Default | Description | Accepted values |
codecFormat |
False | Lucene90HnswVectorsFormat |
Specifies the knn codec implementation to use | Lucene90HnswVectorsFormat |
|
False | 16 |
Lucene90HnswVectorsFormat only:Controls how many of the nearest neighbor candidates are connected to the new node. It has the same meaning as M from the paper[8]. |
Integer |
hnswBeamWidth |
False | 100 |
Lucene90HnswVectorsFormat only: It is the number of nearest neighbor candidates to track while searching the graph for each newly inserted node.It has the same meaning as efConstruction from the paper[8]. |
Integer |
Please note that the codecFormat
accepted values may change in future releases.
NOTE | Lucene index back-compatibility is only supported for the default codec. If you choose to customize the codecFormat in your schema, upgrading to a future version of Solr may require you to either switch back to the default codec and optimize your index to rewrite it into the default codec before upgrading, or re-build your entire index from scratch after upgrading. |
For the hyper-parameters of the HNSW implementation, see [8] for details.
How to Index a Vector
Here’s how a DenseVectorField
should be indexed:
JSON
[{ "id": "1",
"vector": [1.0, 2.5, 3.7, 4.1]
},
{ "id": "2",
"vector": [1.5, 5.5, 6.7, 65.1]
}
]
XML
<field name="id">1
<field name="vector">1.0
<field name="vector">2.5
<field name="vector">3.7
<field name="vector">4.1
<field name="id">2
<field name="vector">1.5
<field name="vector">5.5
<field name="vector">6.7
<field name="vector">65.1
Java – SolrJ
final SolrClient client = getSolrClient();
final SolrInputDocument d1 = new SolrInputDocument();
d1.setField("id", "1");
d1.setField("vector", Arrays.asList(1.0f, 2.5f, 3.7f, 4.1f));
final SolrInputDocument d2 = new SolrInputDocument();
d2.setField("id", "2");
d2.setField("vector", Arrays.asList(1.5f, 5.5f, 6.7f, 65.1f));
client.add(Arrays.asList(d1, d2));
knn Query Parser
The knn
K-Nearest Neighbors query parser allows finding the k-nearest documents to the target vector according to indexed dense vectors in the given field.
It takes the following parameters:
Parameter Name | Required | Default | Description |
f |
True | The DenseVectorField to search in. | |
topK |
False | 10 | How many k-nearest results to return. |
Here’s how to run a KNN search:
&q={!knn f=vector topK=10}[1.0, 2.0, 3.0, 4.0]
The search results retrieved are the K-nearest to the vector in input [1.0, 2.0, 3.0, 4.0]
, ranked by the similarityFunction configured at indexing time.
Usage with Filter Queries
The knn
query parser can be used in filter queries:
&q=id:(1 2 3)&fq={!knn f=vector topK=10}[1.0, 2.0, 3.0, 4.0]
The knn
query parser can be used with filter queries:
&q={!knn f=vector topK=10}[1.0, 2.0, 3.0, 4.0]&fq=id:(1 2 3)
IMPORTANT | When using knn in these scenarios make sure you have a clear understanding of how filter queries work in Apache Solr:The Ranked List of document IDs resulting from the main query q is intersected with the set of document IDs deriving from each filter query fq .e.g.Ranked List from q =[ID1, ID4, ID2, ID10] Set from fq ={ID3, ID2, ID9, ID4} = [ID4,ID2] |
Usage as Re-Ranking Query
The knn
query parser can be used to rerank first pass query results:
&q=id:(3 4 9 2)&rq={!rerank reRankQuery=$rqq reRankDocs=4 reRankWeight=1}&rqq={!knn f=vector topK=10}[1.0, 2.0, 3.0, 4.0]
IMPORTANT | When using knn in reranking pay attention to the topK parameter.The second pass score(deriving from knn) is calculated only if the document d from the first pass is within the K-nearest neighbors(in the whole index) of the target vector to search.This means the second pass knn is executed on the whole index anyway, which is a current limitation.The final ranked list of results will have the first pass score(main query q ) added to the second pass score(the approximated similarityFunction distance to the target vector to search) multiplied by a multiplicative factor(reRankWeight).So if the document d is not present in the knn results, you get a zero contribution to the original score even if the distance vector calculation with the target query vector is not zero.Details about using the ReRank Query Parser can be found in the Apache Solr Wiki[13] section. |
Future Works
This is just the beginning for the Apache Solr neural capabilities, more is coming!
I hope you enjoyed the overview and stay tuned for updates!
References
[6] Andoni, A.; Indyk, P. (2006-10-01). “Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions”. 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06). pp. 459–468. CiteSeerX 10.1.1.142.3471. doi:10.1109/FOCS.2006.49. ISBN 978-0-7695-2720-8.
[7] Yury Malkov, Alexander Ponomarenko, Andrey Logvinov, Vladimir Krylov,
Approximate nearest neighbor algorithm based on navigable small world graphs,
Information Systems,
Volume 45,
2014,
Pages 61-68,
ISSN 0306-4379,
https://doi.org/10.1016/j.is.2013.10.006.
[8] Malkov, Yury; Yashunin, Dmitry (2016). “Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs”. arXiv:1603.09320 [cs.DS].
Shameless plug for our training and services!
Did I mention we do Artificial Intelligence in Search training and specific one-day training about Deep Learning for Search?
We also provide consulting on these topics, get in touch if you want to bring your search engine to the next level with the power of AI!
Subscribe to our newsletter
Did you like this post about Apache Solr Neural Search? Don’t forget to subscribe to our Newsletter to stay always updated from the Information Retrieval world!
Related
Author
Alessandro Benedetti
Alessandro Benedetti is the founder of Sease Ltd. Senior Search Software Engineer, his focus is on R&D in information retrieval, information extraction, natural language processing, and machine learning.
Comments (8)
Leave a comment Cancel reply
This site uses Akismet to reduce spam. Learn how your comment data is processed.
Michael Pulis
March 2, 2022Would it be possible to represent the entire document as a list of vectors, and preform the search on this list of vector representations of the document and query, using Apache Solr ?
Alessandro Benedetti
March 11, 2022Hi Michael, we are working in that direction. The short answer right now is no, only single-valued vectors are accepted for a dense vector field.
Dallan Quass
July 9, 2022I’m extremely thankful you are working on dense vector representation! Do you have a guesstimate on when multi-valued vectors would be supported? Then we could query sentence embeddings alongside keywords and metadata, which would be wonderful!
Migrating from Solr to Elasticsearch, and their differences | by Canva Engineering - Business - News - Health - Technology - Finance - Pal Website
August 28, 2022[…] Our migration to Elasticsearch unlocks some vector search options with the ability to map fields as dense and sparse vectors, by encoding vectors as binary doc_values. It is worth noting that lucene (and hence both Elasticsearch and Solr) is adding features in this space, including Solr’s recent support for HNSW Vector search. […]
Yvonne
September 10, 2023Hi, is the topK currently work for values other than 10? and what is the maximum of values returned? Thanks very much!
Ilaria Petreti
September 17, 2023Hi Yvonne,
with the topK parameter you can specify the number of k-nearest results you want to return.
Currently, 10 is just the default value but you can specify any integer >= 1
https://solr.apache.org/guide/solr/latest/query-guide/dense-vector-search.html#knn-query-parser
Yvonne
September 17, 2023Currently I use parameter other than 10, but the returned results are still size of 10:
# print(“Encoded Query:”, search_list)
knn_query = ‘{!knn f=vector topK=15}’ + str(search_list)
# print(knn_query)
results = self.solr.search(q=knn_query)
print(len(results))
Ilaria Petreti
September 18, 2023This happens for the ‘rows’ parameter which specifies the maximum number of documents from the complete result set that Solr should return to the client at one time (for the pagination). The default value is 10.
If you have set topK=15 for the dense vector search, try to add rows=15 to the query as well.
e.g.
&q={!knn f=vector topK=15}[1.0, 2.0, 3.0, 4.0]&rows=15