Search

The AI side of the Vespa Search Engine

Hi readers!
In this blog post, we are going to explore the vector search features available in Vespa 8.626.55 (12th January 2026).

Everything written here is inspired by Alessandro Benedetti’s book: How Large Language Models Can Help Your Search Project.
An additional study has also been done to integrate the newly introduced features between the 8.530.11 and the 8.626.55 versions.

Vespa

First of all, what is Vespa?

The Vespa project is a search engine (written in Java and C++ and exposing REST APIs) that enables organisations to efficiently manage and analyse large, evolving datasets using a combination of vector/traditional search, structured data handling, and machine-learned model inference.

LLM-Based Features

At the point of this analysis, Vespa supports these LLM-based features:

  • Text and multimodal vectorisation (both index and query time)
  • Vector search capabilities (quantisation and GPU acceleration)
    • Exact and Approximate Nearest Neighbor
    • Binary Quantization
    • Filtering and Post-Filtering
  • Multi-Vector Search
  • Learned Sparse Retrieval
    • SPLADE
    • For both the tasks of retrieval and re-ranking
  • Cross-Encoders and Late Interaction Models Reranking
    • ColBERT
  • Possibility to use Local, Remote LLMs and also models on private HuggingFace hubs
  • Hybrid search through linear score combinations and fusion techniques
  • Retrieval Augmented Generation
  • Document Enrichment through LLMs
    • Named Entity Recognition
    • Categorisation and Tagging
    • Generation of relevant keywords, queries and questions
    • Translation
    • LLM Chunking
  • ACORN-1 technique
  • Adaptive Beam Search strategy

Let’s go into a bit of detail about all these features!

Text/Multi-Modal Vectorisation

Vespa can interact with locally or remotely hosted LLMs fine-tuned for vector embedding generation to provide users with a transparent semantic search experience.

This applies to both text and multi-modal, giving the possibility of also encoding images/audios/videos to vector.

Vespa can host certain families of models directly in process through ONNX files (be aware that this is computationally expensive) or interact with external models. This means you can rely on a set of embedders explicitly supported that can run in the Vespa process or create a customised embedder implementation that calls models external to Vespa (potentially self-hosted locally or hosted remotely as a service).

What you need is:

  1. Define an embedder component which specifies the model to use for generating vectors.
    This needs to be added to the services.xml file.
  2. Define a vector field for hosting the embeddings. Here, in the indexing pipeline, you also define which embedder to use to populate this field.
    This needs to be added to the schema.
  3. Call the same embedder at query time to also convert the query to a vector and allow the vector search capability.

For more details about the implementation, you can have a look at the Vespa documentation on embeddings.

Vector Search

Vespa implements vector search through both exact (very expensive) and approximate nearest neighbor (using Hierarchical Navigable Small World graph under the hood).

In terms of vector data types support Vespa supports 64-bit double, 32-bit float, 16-bit float and 8-bit signed integer.

Vespa supports a wide variety of quantisation. In general, the level of quantisation will provide a trade-off between the accuracy of the nearest neighbor search and the size of the memory footprint consumed by the vector search. Vespa out of the box supports only naive binary quantisation through ’converters’.

Vespa also supports filtering and post-filtering to combine vectors with lexical search and hybrid search.

What you need is:

  1. Define a vector field for hosting the embeddings. You mostly need to specify the vector element type (e.g. 32-bit float), the vector dimensionality and the distance metric to use at indexing time to build the HNSW graph.
    This needs to be added to the schema.
  2. Define a rank profile for nearest neighbor search and then run the query calling that profile.

For more details about the implementation, you can have a look at the Vespa documentation on vector search

Multi-Vector Search

For vector search on large text, it is very common to split long documents into chunks, vectorise chunks and then perform vector search. The idea is to retrieve the best chunks and then return the original parent document.

Vespa supports this, allowing the user to store multi-valued vectors per field in the document.

What you need is:

  1. Define a vector field for hosting more than one multi-valued vector (one vector for each chunk).
    This needs to be added to the schema.
  2. Define a rank profile for nearest neighbor search and then run the query calling that profile. Here, the match-features parameter can be added to return the label of the closest vector, therefore, the best matching paragraph.

For more details about the implementation, you can have a look at the Vespa documentation on working with chunks.

Learned Sparse Retrieval

Vespa supports SPLADE models, enabling learned sparse retrieval for efficient and effective search.

These models represent queries and documents in a high-dimensional sparse space, allowing Vespa to perform both retrieval and re-ranking with high precision. By leveraging SPLADE, Vespa combines the advantages of neural embeddings with the efficiency of inverted index-based search, making it suitable for large-scale information retrieval applications.

What you need is:

  1. Define an embedder component which specifies the model to use for generating vectors (in this case, a SPLADE model).
    This needs to be added to the services.xml file.
  2. Define a weightedset field for hosting the sparse representation.
    This needs to be added to the schema.
  3. Make a query using the wand operator for top-k retrieval.

For more details about the implementation, you can have a look at this Vespa blog post and the Vespa documentation on SPLADE models.

Cross-Encoders and Late Interaction Models for Reranking

Vespa supports cross-encoders through the Open Neural Network Exchange (ONNX) format. Each model depends on a tokeniser approach, so it’s also necessary to pass it to Vespa to ensure they are aligned.

An interesting optimisation offered by Vespa is the possibility of pre-tokenising the documents and query, so that when the similarity score happens, that passage can be skipped and performance optimised. Normally, tokenisation is repeated for each pair of query and document individually in cross-encoding.

On the other hand, late interaction models encode both the query and the documents into a multi-vector representation. They tend to be less expensive than cross-encoders but still more costly than running approximate nearest neighbor on bi-encoders’ vectorised text. For this reason, they are generally employed as rerankers as well.

What you need is:

  1. Define an embedder component which specifies the model to use for generating vectors (in this case, a ColBERT model).
    This needs to be added to the services.xml file.
  2. Define a vector field for hosting the embeddings. The cardinality of the tensor is the number of tokens, while each vector has a dimensionality that depends on the model. Here, you also define which embedder to use to populate this field.
    This needs to be added to the schema.
  3. Define a rank profile for max similarity implementation and then run the query calling that profile.

For more details about the implementation, you can have a look at the Vespa documentation on Cross Encoders.

Local or Remote LLMs and Private Model Hubs

Vespa supports the integration with general-purpose LLMs directly in the process or through external services (self-hosted or remote APIs).

Running LLMs locally offers various advantages, particularly in terms of data security and privacy; sensitive information remains within the confines of the application or network, eliminating the risks associated with sharing delicate data with external services.

This is also the reason why Vespa has lately introduced support for models from private HuggingFace hubs.

For local and remote models, what you need is:

  1. Define an inference engine. For local models, Vespa internally uses llama.cpp, therefore, all their supported models are compatible. For remote models, the important thing is to ensure compatibility with the API rather than the LLM service, therefore having compatible input/output.
    This needs to be added to the services.xml file.

For more details about the implementation, you can have a look at the Vespa documentation on running local LLMs.

Hybrid Search

Hybrid search is a technique that combines different retrieval methods to improve search quality.
A typical way of implementing it starts with the retrieval of two sets of candidates:

  • One set of results comes from lexical matches with the query keywords 
  • A second set of results comes from the K-Nearest Neighbors search with the query vector

Then these results must be combined and presented in a ranking that maximises the relevance for the user query.

Vespa also supports this strategy since it allows us to:

  1. Define the query to execute. Which, in this case, will contain all the retrieval strategies we want to use (lexical + vector).
  2. Explicitly define the formula to compute the relevance of the retrieved documents.

As shown in their tutorial. This is a hybrid query example:

				
					vespa query \
'yql=select * from doc where ({targetHits:10}userInput(@user-query)) or ({targetHits:10}nearestNeighbor(embedding,e))' \
'user-query=Do Cholesterol Statin Drugs Cause Breast Cancer?' \
'input.query(e)=embed(@user-query)' \
'hits=1' \
'language=en' \
'ranking=hybrid'
				
			

Here we can see that we have:

  1. A lexical query: {targetHits:10}userInput(@user-query) from which we take the top 10 documents.
  2. A vector query: {targetHits:10}nearestNeighbor(embedding,e) from which we take the top 10 documents.

Then we can define how to rank these documents, therefore how to compute their relevance. For example:

				
					expression: closeness(field, embedding) * (1 + (bm25(title) + bm25(text)))
				
			

So for the 20 documents we retrieved, we use as relevance the combination of the closeness between the query vector and the document vector, and the bm25 value on the title and the text fields.

Retrieval Augmented Generation

Retrieval-Augmented Generation (RAG) is a method that combines information retrieval with text generation. Instead of generating responses solely from the model’s internal knowledge, a RAG system first retrieves relevant documents or data from an external knowledge source and then uses that information to generate more accurate, context-aware responses.

In Vespa, this is supported through an RAGSearcher that first performs the query as specified by the user, creates a prompt based on the results, and queries the language model to generate a response.

What you need is:

  1. Specify the LLM connection (which model you want to use for the generation) and the RAGSearcher (the search chain that will execute as described before).
    This needs to be added to the services.xml file.
  2. Create a query where passing: the prompt, the context and the searchChain to use (the searchChain name is defined in the RAGSearcher).

For more details about the implementation, you can have a look at the Vespa documentation on RAG search.

Document Enrichment

Document enrichment enables the automatic generation of document field values using LLMs or custom code during feeding.

Examples of enrichment tasks include:

  • Named entity recognition (e.g. extracting people, organisations, locations…).
  • Categorisation and tagging (e.g., sentiment and topic analysis) to be later used for filtering and faceting.
  • Generation of relevant keywords, queries, and questions for document expansion.
  • Translation of content for multilingual search

What you need is:

  1. Define the generator component with the prompt to pass to the specified LLM.
    This needs to be added to the services.xml file.
  2. Define a field storing the generation results. Here, in the indexing pipeline, you call the generator component to use to populate this field.
    This needs to be added to the schema.

For more details about the implementation, you can have a look at the Vespa documentation on Document Enrichment.

ACORN-1

Vespa implements a new technique called ACORN. This is a novel approach that aims to obtain better results for Filtered Nearest Neighbor Search, therefore those searches that applies both filtering and vector search.

The idea behind the algorithm is to:

  1. Apply the filter first to avoid computing vector distances on nodes that do not satisfy the query time condition.
  2. Compute the nearest neighbor only on the remaining nodes.

To maintain connectivity between nodes inside the filtered HNSW graph and allow the user to obtain results, ACORN:

  1. Increase the number fo connections between nodes creating a denser graph at indexing time.
  2. Consider 2-hop neighbors during the graph traveral at query time.

As explained in Vespa related blog post, to make use of this new search strategy for HNSW in Vespa, one has to adjust the rank profile parameter/query API parameter:

  • filter-first-threshold/ranking.matching.filterFirstThreshold – default 0.00.

This strategy is used when the hit ratio of the given filter is below filter-first-threshold, i.e., it is disabled by default for now. We plan to adjust the default values of these parameters at some point in the future such that ACORN-1 is enabled by default and the fallback to an exact search occurs later.

Adaptive Beam Search

A further addition for improving recall during vector search is the implementation of the Adaptive Beam Search strategy.
This introduces the possibility to augment the number of retrieved results depending on a distance threshold from the query vector instead of simply increasing the number of target hits, reducing the possibility to get stuck in a local minimum of the HNSW graph.

In Vespa this behavior is exposed through the slack rank profile parameter/query API parameter:

  • exploration-slack/ranking.matching.explorationSlack – default 0.00.

Need Help with this topic?​

If you're struggling with AI features in Vespa, don't worry - we're here to help! Our team offers expert services and training to help you optimise your Vespa search engine and get the most out of your system. Contact us today to learn more!

Need Help With This Topic?​​

If you’re struggling with AI features in Vespa, don’t worry – we’re here to help!
Our team offers expert services and training to help you optimise your Vespa search engine and get the most out of your system. Contact us today to learn more!

Other posts you may find useful

We are Sease, an Information Retrieval Company based in London, focused on providing R&D project guidance and implementation, Search consulting services, Training, and Search solutions using open source software like Apache Lucene/Solr, Elasticsearch, OpenSearch and Vespa.

Follow Us

Top Categories

Recent Posts

Monthly video

Sign up for our Newsletter

Did you like this post? Don’t forget to subscribe to our Newsletter to stay always updated in the Information Retrieval world!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.