Apache Solr, Main Blog

Semantic Search (Text to Vector) with Apache Solr

Dense vector search was introduced in Apache Solr 9.0 in 2022 and since then it has received substantial adoption from the community.
Text vectorisation had to happen outside Solr, as there was no support to encode text to vector within the search engine transparently.
Apache Solr 9.8 changes this, introducing a module that allows interaction with well-known large language model providers such as OpenAI, Cohere, HuggingFace, and Mistral AI via the open-source library LangChain4j.

Traditional (keyword) Search Problems

One of the biggest problems we’ve seen in many traditional search engines is the ‘vocabulary mismatch problem‘:

Incorrect or not exhaustive search results are returned because the terms used at query time (lexicon) don’t match the terms used in the documents of the corpus of information.

i.e.

queries and documents use different terms to describe the same concepts (or closely related).

Vector Search

A solution to the problem is to use Large Language Models (specifically fine-tuned for sentence similarity) to encode text to a numerical vector, in a way that sentences that are semantically similar are encoded to vectors that are close to each other in the vector space.

In this way, searching for content that is semantically close to a query sentence maps to running a k-nearest-neighbor query on vectors.

Up to Solr 9.7, as you can see from the diagram, the text vectorisation had to happen outside Solr, the search engine was only able to handle vectors, not supporting end-to-end semantic search transparently.

Semantic Search from Apache Solr 9.8

This changes with Apache Solr 9.8: with the introduction of the LLM module, you can configure Solr to talk with an external service to do the text vectorisation for you, offering a transparent semantic search experience end-to-end.

Once configured with a vectorisation model (and we’ll see shortly how to do it), Solr is able to encode text to vector (both at query and indexing time) and run vector search to find relevant to the user information need.

llm module (from Apache Solr 9.8, January 2025)

This module:

stores the configuration to access text vectorisation APIs external to Solr (Langchain4j is internally used to interact with such APIs).
implements a query parser (that encodes the query to a vector and then builds a Knn query)
implements an Update Request Processor to vectorise the content of textual fields

To enable the module you can follow the standard Solr documentation:

bin/solr start -e techproducts -Dsolr.modules=llm
add a <str name=”modules”> tag to your solr.xml
environment variable SOLR_MODULES (e.g. in solr.in.sh or solr.in.cmd)
system property solr.modules

Once enabled you can configure and use its internal components (the query parser and the update request processor).

Models

A text-to-vector model has the responsibility of encoding text to vector.

At the time of writing only external models are supported: the text encoding doesn’t happen in the Solr JVM nor locally: ONLY EXTERNALLY.

A model (with the parameters to access it) is described via a JSON payload.

the Solr vectorisation model specifies the parameters to access the APIs, the model doesn’t run internally in Solr.

A model is described by these parameters:

class

Required

Default: none

The model implementation. Accepted values:

dev.langchain4j.model.huggingface.HuggingFaceEmbeddingModel.
dev.langchain4j.model.mistralai.MistralAiEmbeddingModel.
dev.langchain4j.model.openai.OpenAiEmbeddingModel.
dev.langchain4j.model.cohere.CohereEmbeddingModel.

name

Required

Default: none

The identifier of your model is used by any component that intends to use the model (knn_text_to_vector query parser).

params

Optional

Default: none

Each model class has potentially different parameters. Many are shared but for the full set of parameters of the model you are interested in please refer to the official documentation of the LangChain4j version included in Solr: Vectorisation Models in LangChain4j.

Currently four models are supported: Hugging Face, MistralAI, OpenAI and Cohere.

To upload the model from a file: a /path/myModel.json file, please run:

				
					curl -XPUT 'http://localhost:8983/solr/techproducts/schema/text-to-vector-model-store' --data-binary "@/path/myModel.json" -H 'Content-type:application/json'

To view all models:

				
					http://localhost:8983/solr/techproducts/schema/text-to-vector-model-store

To view a model (‘model1’):

				
					http://localhost:8983/solr/collection/schema/text-to-vector-model-store/
model1

To delete a model (‘model1’):

				
					curl -XDELETE 'http://localhost:8983/solr/techproducts/schema/text-to-vector-model-store/model1'

Indexing Time

The ‘llm’ module introduces the ‘solr.llm.textvectorisation.update.processor.TextToVectorUpdateProcessor’, a component that processes a Solr document in input, enriching it with a vectorised encoding of a textual field.

				
					<updateRequestProcessorChain name="textToVector">
  <processor class="solr.llm.textvectorisation.update.processor.TextToVectorUpdateProcessorFactory">
   <str name="inputField">_text_</str>
   <str name="outputField">vector</str>
   <str name="model">dummy-1</str>
  </processor>
  <processor class="solr.RunUpdateProcessorFactory"/>
 </updateRequestProcessorChain>

Adding this component to an update request processor chain means that all documents you index will be enriched with a vector, encoded from the ‘inputField’ using the model in the parameters.

The content of your document (‘inputField’) is sent to a remote hosted model. Be careful with your performance and privacy requirements!

Enrich Documents with Vectors on a Second Pass

Naive Approach

Vectorising is considered slow (especially when network latency is added to the picture).
You may want to first index your documents and then, in the background, add the vectorised fields.

Unfortunately, right now Solr doesn’t offer the capability of building the vector data structures in the background after the traditional indexing is complete (making it searchable and later vectorised).

There are still some workarounds that can mitigate the situation a bit.

A first approach is to define two update request processor chains, identical except one adds the vectorisation:

				
					<updateRequestProcessorChain name="no-vectorisation">
  <processor class="solr.processor1">
   ...
  </processor>
   ...
  <processor class="solr.processorN">
   ...
  </processor>
  <processor class="solr.RunUpdateProcessorFactory"/>
 </updateRequestProcessorChain>

You first target the ‘no-vectorisation’ chain and index all your documents.

				
					<updateRequestProcessorChain name="vectorisation">
<processor class="solr.processor1">
   ...
  </processor>
...
<processor class="solr.processorN">
   ...
  </processor>
<processor class="solr.llm.textvectorisation.update.processor.
TextToVectorUpdateProcessorFactory">
   <str name="inputField">_text_</str>
   <str name="outputField">vector</str>
   <str name="model">dummy-1</str>
  </processor>
  <processor class="solr.RunUpdateProcessorFactory"/>
 </updateRequestProcessorChain>

Once it’s finished, you re-index all your documents targeting the second.

The effect you’ll see is that vectors will be added increasingly, while your documents become searchable lexically in a shorter amount of time.

Internally Solr will re-index everything, so there’s a lot of data traffic (data is sent again to Solr), lot of cpu waste (text is processed again and data structures rebuilt), a lot of deletions behind the scenes (each updated document is deleted and added again) and a lot of segment merges (as new segments are created).

Be careful with this approach as it affects the number of segments, deleted docs and merges that happen behind the scenes in Solr. For small scale systems this may be neglectable, but scaling up this negative aspect can be extremely important.

Partial Updates

A slightly better solution is to use partial updates:
you avoid sending the full document again to Solr when you want to add vectors.

This time, the chains look slightly different:

				
					<updateRequestProcessorChain name="no-vectorisation">
  <processor class="solr.processor1">
   ...
  </processor>
   ...
  <processor class="solr.processorN">
   ...
  </processor>
  <processor class="solr.RunUpdateProcessorFactory"/>
 </updateRequestProcessorChain>

The ‘no-vectorisation’ looks exactly the same; you first index all your documents in a traditional way, executing all the update request processors you like.

Then, when you want to run the second pass to add the vectors, you target a new chain that only contains the text-to-vector processor (+ the mandatory ones).

				
					<updateRequestProcessorChain name="vectorisation">
…
<processor class="solr.llm.textvectorisation.update.processor.
TextToVectorUpdateProcessorFactory">
   <str name="inputField">_text_</str>
   <str name="outputField">vector</str>
   <str name="model">dummy-1</str>
  </processor>
  <processor class="solr.RunUpdateProcessorFactory"/>
 </updateRequestProcessorChain>

At this point, you can define a boolean field in your schema that keeps track of what documents have been vectorised already and just partially update that field, targeting the vectorisation chain:

				
					<field name="vectorised" type="boolean" indexed="true" stored="true" docValues="true" default="false"/>

re-index all your docs {"id":"mydoc","vectorised":{"set":true}}

Each partial update will just set the new field to ‘true’, with no data waste in networking. Solr will set the field and run the vectorisation chain that will take in input the field to be vectorised and encode it for each document.

The benefits in comparison to the naive solution are still minimal, as all the deleted and inserts will happen anyway, but at least we avoid the data networking transfer.

A big nice to have here would be an equivalent of the in place update for the vectors.

Interested in helping or sponsoring new Solr features like in place vectorisation? Reach out to us!

Query Time

Once you have your vectors indexed, running a natural language query is quite simple with the ‘org.apache.solr.llm.texttovector.search.TextToVectorQParserPlugin’.

First you define it in your solrconfig.xml:

				
					<queryParser name="knn_text_to_vector" class="org.apache.solr.llm.texttovector.search.TextToVectorQParserPlugin"/>

Then, at query time, you just use it as you would for the knn query parser, but rather than passing the vector, you pass the natural language query and a reference to the model you want to use for vectorisation.

				
					?q={!knn_text_to_vector model=modelName f=vectorField topK=10}a query

Behind the scenes, Solr will use the model to vectorise your query text and then run a vector search.

The text of your query is sent to a remote hosted model. Be careful with your performance and privacy requirements!

What's next?

There’s plenty of future work we want to contribute to enhance this new, exciting functionality:

Local models (to run on your local machine or potentially in the Solr JVM)
In place vectorisation update
Vector search optimisations
- nested/multi vectors -> in progress [Github]
- scalar quantisation -> in progress [Github]
- binary quantisation
- new knn queries (especially the Seeded one)
Retrieval augmented generation
- conversational search (including short\long term memory)
Query/document expansion
LLM highlighter/explainer
From natural language to structured queries
Better hybrid search

Do you want to know more?

Check my talk at Berlin Buzzwords 2025

Click Here

Do you want to know more?

Check my talk at Berlin Buzzwords 2025

apache solr, information retrieval, large language models, search, semantic search, vector search

Sign up for our Newsletter

Did you like this post? Don’t forget to subscribe to our Newsletter to stay always updated in the Information Retrieval world!

About the company

about our work

Rated Ranking Evaluator (RRE)

Rated Ranking Evaluator Enterprise (RREE)

Apache Solr LLM Highlighter plugin

News

Main Blog

TIPS AND TRICKS

LATEST BLOG POST

contact us

Don't miss all the news - subscribe to our newsletter!

Semantic Search (Text to Vector) with Apache Solr

Traditional (keyword) Search Problems

Vector Search

Semantic Search from Apache Solr 9.8

llm module (from Apache Solr 9.8, January 2025)

Models

Indexing Time

Enrich Documents with Vectors on a Second Pass

Naive Approach

Partial Updates

Query Time

What's next?

Do you want to know more?

Do you want to know more?

Other posts you may find useful

Entity Search with graph embeddings – Part 4 – Evaluation and conclusion

Semantic Web & Linked Open Data

Solr Is Learning To Rank Better – Part 3 – Ltr tools

Alessandro Benedetti

Alessandro Benedetti

Follow Us

Top Categories

Recent Posts

Retrieval and Responsibility: The Ethics of Augmented Knowledge

Faster Vector Search: Early Termination Strategy Now in Apache Solr

OpenSearch and Large Language Models

Monthly video

Sign up for our Newsletter

Leave a Reply Cancel reply

Rated Ranking Evaluator
(RRE)