We are recently working on contributing knn search in Solr leveraging on the latest Lucene developments. This blog post goal is to give some numbers about the benchmark mesaures gathered during the development process.
Setup and collection
To benchmark our solution we setup our solr instances using dockerized solr in a t3.large aws machine (2 vCPU, 8GB ram).
We have chosen to use the MS MARCO collection for document retrieval. We actually used a subset of the documents as it is very expansive to transform all of them with BERT and our goal was only to perform a simple benchmark.
We have taken the first 461K documents from the collection. From here, we have applied BERT to all the documents in the sub-sample and all the queries and we stored the embeddings in separate files structured in the following way:
- one line for each vector
- each vector is a comma separated list of float values
Here are the stats of the resulting files:
|embedding vector length||768|
|document file size||3.1 GB|
|embedding file size||3.8 GB|
|AVG document length||1087 words|
|AVG query length||5.9 words|
Indexing speed and size
We created two indexes: one only with the documents indexed as text, and the other one with the embeddings (using DenseVectorField field type). Here are the results of the indexing process.
- Indexing time: 15 minutes
- index size (after optimization, no stored fields): 1.17 GB
- Indexing time: 32 minutes
- Index size (after optimization, no stored fields): 1.34 GB
NB. These numbers are very specific to this use case
We wanted to have some kind of indication that stored fields are managed correctly. We created another solr index where we indexed the vectors are multivalue FloatPointField. Than we compared the space occupancy of the stored fields only between the two solr instances.
The stored fields for DenseVectorField field type are taking 1420MB. Instead, the stored fields for FloatPointField take 1480MB. There is basically no difference in space occupancy of DenseVectorField and multivalue FloatPointField.
For measuring query performance, we have taken the average round trip time of the rest call execution. We repeated the measurements after the index optimization.
|before optimization||after optimization|
|knn vector queries||22ms||8ms|
From the results in the table, we can see as in this specific use case knn vector search is more efficient than full text search (especially when the index is optimized). However, keep in mind that for executing the queries we already have transformed text queries in vectors in a preprocessing step. This step has a non negligible cost. We decided to exclude it from the benchmarks as it depends on the model used.