Search

Apache Solr Multivalued Vectors Tutorial

Hi there!

In this blog post, I’m going to introduce a new feature added by Alessandro Benedetti in the upcoming Apache Solr 10.1, thanks to an anonymous, generous sponsor: multivalued vectors!

The scope of this contribution (SOLR-18074) is to enable the fieldType inheriting from solr.DenseVectorField to be able to be multivalued, alongside stored and indexed (already available before). The main features added are:

  • Possibility to index using vectors as multivalued fields, meaning that we can index more than one vector for each document. This has been done by leveraging internal nested vectors.
  • Return the vectors as other multivalued fields. This is done leveraging a child transformer in both of the 2 different cases:
    • return all the vectors associated with the document
    • return just the highest-scoring vector from each document

This contribution can be used in a lot of cases. The first thing that comes to my mind is using embeddings of chunks in vector search instead of the embedding of the whole document. Using the embedding of the whole document can be detrimental for the performance of your vector search algorithm since, if the document is big, it could be that a lot of different information is spread inside the document. Thus, the vector representation of the document can be very polluted, and the embedding may not be able to capture too much information. Splitting the document into chunks can be a good way to solve this problem, since chunks contain less information compared to the whole document.

Its internals leverage the Diversifying children query and at indexing time, the modelling of multiple vectors per document through nested vectors.
At query time, you can use the same queries seen in the nested vectors tutorial: Searching Children, Finding Parents: Nested KNN Vector Search in Solr, plus a dedicated functionality for the children transformer to render the multivalued vector in the search results (flattened).

Index Time

At index time, we must be sure some prerequisites are satisfied.

  1. RemoveBlankFieldUpdateProcessorFactory doesn’t work with multivalued vectors at the moment. It must be removed if you are using it in your configuration (the default uses it).
  2. A vector fieldType must be defined in your schema. One example is the one in the official DenseVectorField documentation. I report the definition here to be sure you are able to reproduce the example as easily as possible.
  3. A new field to accommodate multivalued vectors must be declared in your schema, of the type just defined. Be sure to define this field as multivalued.

I report below the definition for the field and the fieldType to be sure you are able to reproduce the example as easily as possible:

				
					<fieldType name="knn_vector" class="solr.DenseVectorField" vectorDimension="4" similarityFunction="cosine"/>
				
			
				
					<field name="vector_multivalued" type="knn_vector" indexed="true" stored="true" multiValued="true"/>
				
			

Following the documentation, you can index vectors with the following JSON payload:

				
					[
    { 
        "id": "1",
        "color_s":"RED",
        "vector_multivalued": [
            [1.0, 2.0, 3.0, 4.0],
            [5.0, 6.0, 7.0, 8.0]
            ]
    },
    { 
        "id": "2",
        "color_s":"BLUE",
        "vector_multivalued": [
            [1.0, 2.0, 3.0, 4.0],
            [5.0, 6.0, 7.0, 8.0]
            ]
    }
]
				
			

In the same way, it can be easily done through the SolrJ API in Java as follows:

				
					Http2SolrClient client = new Http2SolrClient.Builder(solrUrl).build()

final SolrInputDocument d1 = new SolrInputDocument();
d1.setField("id", "1");
d1.setField("color_s", "RED");
List<List<Float>> floatVectors1 = new ArrayList<>(2);
floatVectors1.add(Arrays.asList(1.0f, 2.0f, 3.0f, 4.0f));
floatVectors1.add(Arrays.asList(5.0f, 6.0f, 7.0f, 8.0f));
d1.setField("vector_multivalued", floatVectors1);


final SolrInputDocument d2 = new SolrInputDocument();
d2.setField("id", "2");
d2.setField("color_s", "BLUE");
List<List<Float>> floatVectors2 = new ArrayList<>(2);
floatVectors2.add(Arrays.asList(1.0f, 2.0f, 3.0f, 4.0f));
floatVectors2.add(Arrays.asList(5.0f, 6.0f, 7.0f, 8.0f));
d2.setField("vector_multivalued", floatVectors2);

client.add(Arrays.asList(d1, d2));
				
			

where solrUrl is a String containing the endpoint of your collection (e.g., “http://localhost:8983/solr/YOUR_COLLECTION&#8221;).

Regardless of the method used during indexing, if you then query your collection with a “catch-all” query, you will obtain 6 documents, where:

  • 2 of them are parents, in which the field id is the same as the _root_ field.
  • 4 of them are children, specifically 2 children for each parent. The children are actually documented with id different from the root and with the vector_multivalued field that contains just one vector.

This is evidence of what happens internally: nested documents are used to model multiple vectors per parent document (this happens automatically).

Note that the other field color_s stays in the parent documents.

Query Time

Now we want to query our collection. How can we do it? I’ll show you 3 (+1) different queries; in this way, you can gain a better understanding and develop your own.

A brief description can be captured in the documentation:

Behind the scenes, a multivalued vector field is handled by Solr as nested documents with a single vector each (see the parameters for the knn query parser that deal with nested vectors parents.preFilter and ‘childrenOf’).

Some Simple Queries

In the following query, we want to return the top 2 documents that match the vector query, alongside all the vectors associated with a certain document. This is how we do it in the configuration set up above.

				
					q={!parent which=$allParents score=max v=$children.q}&
children.q={!knn f=vector_multivalued topK=2 childrenOf=$allParents}[1.0, 2.0, 3.0, 4.0]&
allParents=*:* -_nest_path_:*
				
			

The query above returns the following documents in the response:

				
					[
    {
        "id": "1",
        "color_s": "RED",
    },
    {
        "id": "2",
        "color_s": "BLUE",
    }
]
				
			

What if we want to filter out some parent documents? We can use preFilter parameter available when we query a DenseVectorField in Solr.

Here is an example of how we can do it for the color_s field:

				
					q={!parent which=$allParents score=max v=$children.q}&
children.q={!knn f=vector_multivalued topK=2 parents.preFilter=$someParents childrenOf=$allParents}[1.0, 2.0, 3.0, 4.0]&
allParents=*:* -_nest_path_:*&
someParents=color_s:RED
				
			

As expected, we obtain:

				
					[
    {
        "id": "1",
        "color_s": "RED",
    }
]
				
			

That’s it for these first very simple queries!

Rendering of Results

I think you noticed that, in the first two queries, no vector is displayed in the response. If you want to display also the vectors, as mentioned in the beginning, there are 2 ways to render the result:

  • return all vectors associated with each of the top documents
  • return only the top vector associated with each of the top documents

Both of them leverage the child transformer to do so. Let’s see them together.

All Vectors Returned

In the following query, we want to return the top 2 documents that match the vector query, alongside all the vectors associated with a certain document. This is how we do it in the configuration set up above.

				
					q={!parent which=$allParents score=max v=$children.q}&
children.q={!knn f=vector_multivalued topK=2 childrenOf=$allParents}[1.0, 2.0, 3.0, 4.0]&
allParents=*:* -_nest_path_:*&
fl=id,color_s,vector_multivalued,[child fl="vector_multivalued"]
				
			

This leads to the following response:

				
					[
    {
        "id": "1",
        "color_s": "RED",
        "vector_multivalued": [
            [1.0, 2.0, 3.0, 4.0],
            [5.0, 6.0, 7.0, 8.0]
        ]
    },
    {
        "id": "2",
        "color_s": "BLUE",
        "vector_multivalued": [
            [1.0, 2.0, 3.0, 4.0],
            [5.0, 6.0, 7.0, 8.0]
        ]
    }
]
				
			
Only the Top Vector Returned

It might be too much to display all vectors; maybe you are interested only in the one that best matches your query’s embedding. Similarly to before, we can also add to the result only the top vector associated with the query performed. This is how we do it in the configuration set up above.

				
					q={!parent which=$allParents score=max v=$children.q}&
children.q={!knn f=vector_multivalued topK=2 childrenOf=$allParents}[1.0, 2.0, 3.0, 4.0]&
allParents=*:* -_nest_path_:*&
fl=id,color_s,vector_multivalued,[child fl=\\"vector_multivalued\\" childFilter=$children.q]
				
			

Note that here we added filtering with the childFilter parameter from the child transformer. This ends up in the following response:

				
					[
    {
        "id": "1",
        "color_s": "RED",
        "vector_multivalued": [
            [1.0, 2.0, 3.0, 4.0]
        ]
    },
    {
        "id": "2",
        "color_s": "BLUE",
        "vector_multivalued": [
            [1.0, 2.0, 3.0, 4.0]
        ]
    }
]
				
			

You should now have everything you need to start working with multivalued vectors in Apache Solr. Hopefully, this post proved to be helpful and gave you some useful insights. Keep an eye on our website for updates, news, and future blog posts.

Need Help with this topic?​

If you're struggling with Apache Solr multivalued vectors, don't worry - we're here to help! Our team offers expert services and training to help you optimize your Solr search engine and get the most out of your system. Contact us today to learn more!

Need Help With This Topic?​​

If you’re struggling with Apache Solr multivalued vectors, don’t worry – we’re here to help!
Our team offers expert services and training to help you optimize your Solr search engine and get the most out of your system. Contact us today to learn more!

Other posts you may find useful

Sign up for our Newsletter

Did you like this post? Don’t forget to subscribe to our Newsletter to stay always updated in the Information Retrieval world!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.