Main Blog

Neural Search in Apache Solr has been contributed to the Open Source community by Sease[1] with the work of Alessandro Benedetti (Apache Lucene/Solr PMC member and committer) and Elia Porciani (Sease R&D software engineer).
It relies on the Apache Lucene implementation[2] for K-nearest neighbor search.

For more information about the contribution see the main blog post.

Setup and collection

To benchmark our solution we setup our solr instances using dockerized solr in a t3.large aws machine (2 vCPU, 8GB ram).

We have chosen to use the MS MARCO collection for document retrieval. We used a subset of the documents as it is very expansive to transform all of them with BERT and our goal was only to perform a simple benchmark.

We have taken the first 461K documents from the collection. From here, we have applied BERT to all the documents in the sub-sample and all the queries and we stored the embeddings in separate files structured in the following way:

  • one line for each vector
  • each vector is a comma separated list of float values

Here are the stats of the resulting files:

#documents 461k
embedding vector length 768
document file size 3.1 GB
embedding file size 3.8 GB
#queries 5750
AVG document length 1087 words
AVG query length 5.9 words

Indexing speed and size

We created two indexes: one only with the documents indexed as text, and the other one with the embeddings (using DenseVectorField field type). Here are the results of the indexing process.

Text

  • Indexing time: 15 minutes
  • index size (after optimization, no stored fields): 1.17 GB

Embeddings

  • Indexing time: 32 minutes
  • Index size (after optimization, no stored fields): 1.34 GB

NB. These numbers are very specific to this use case

Stored fields

We wanted to have some kind of indication that stored fields are managed correctly. We created another solr index where we indexed the embedding data as multivalue FloatPointField. Then, we compared the space occupancy of the stored fields only between the two solr instances.

The stored fields for DenseVectorField field type are taking 1420MB. Instead, the stored fields for FloatPointField take 1480MB. There is no difference in space occupancy when we store the same data with DenseVectorField or multivalue FloatPointField.

Query performance

For measuring query performance, we have taken the average round trip time of the rest call execution. We repeated the measurements after the index optimization.

  before optimization after optimization
text queries 32ms 27ms
knn vector queries 22ms 8ms

From the results in the table, we can see that in this specific use case knn vector search is more efficient than full text search (especially when the index is optimized). However, keep in mind that for executing the queries we already have transformed text queries in vectors in a preprocessing step and that this step has a non negligible cost. We decided to exclude it from the benchmarks as it depends on the model used.

// our service

Shameless plug for our training and services!

Did I mention we do Artificial Intelligence in Search training and specific one-day training about Deep Learning for Search?
We also provide consulting on these topics, get in touch if you want to bring your search engine to the next level with the power of AI!

// STAY ALWAYS UP TO DATE

Subscribe to our newsletter

Did you like this post about Apache Solr Neural Search Knn benchmark? Don’t forget to subscribe to our Newsletter to stay always updated from the Information Retrieval world!

Author

Elia Porciani

Elia is a Software Engineer passionate about algorithms and data structures concerning search engines and efficiency. He is active part of the information retrieval research community, attending international conferences such as SIGIR and ECIR.

Leave a comment

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.