Modern information retrieval systems face a trade-off between accuracy and efficiency. On the one hand, traditional methods like BM25 are fast and scalable, but they lack a deep semantic understanding. On the other hand, cross-encoder models achieve accuracy by jointly encoding queries and documents, but at a high cost, making them impractical for real-time retrieval at scale.
This is where ColBERT (short for Contextualised Late Interaction over BERT) steps in. The paper ColBERT has been introduced, presenting a novel model that leverages deep language models, specifically BERT, for efficient information retrieval. The model introduced a late interaction mechanism to balance cost and quality.
In this blog post, we will examine how ColBERT works, how its successors have improved it, and what real-world applications of ColBERT are currently available.
Outline:
- ColBERT Architecture
- How ColBERT Works
- Query and Document Encoders
- MaxSim Operation
- ColBERT Model Family
- ColBERT-v2
- ColBERTer
- ColBERT-XM
- Jina-ColBERT-v2
- ColBERT in Industrial Applications
ColBERT Architecture
How ColBERT Works
ColBERT proposes a unique approach to computing relevance between a query and a document. Instead of jointly encoding them (like cross-encoders do), ColBERT encodes them separately using BERT, and matches them later using a fast, token-level similarity operation called MaxSim.
Each query or document is encoded into a set of contextual embeddings/vectors representing each token. At retrieval time, ColBERT calculates the cosine similarity(dot product) between every token in the query and every token in the document. For each query token, it keeps only the maximum similarity score from all document tokens (MaxSim) and sums these scores to get the final relevance score.
This technique is not only effective but also highly parallelizable and GPU-friendly. So it works well with large-scale search.
The following diagram illustrates the high-level architecture of ColBERT, as explained above.
Query and Document Encoders
ColBERT uses the same BERT backbone for both the query and document encoders, but it conditions the inputs with a role marker: [Q] for queries and [D] for documents alongside the usual [CLS] token. Queries are tokenised and padded/truncated to a fixed budget (commonly 32 tokens). For documents, we optionally prune uninformative tokens such as punctuation and special tokens at indexing time: after tokenisation, we identify them and drop their corresponding vectors so they do not consume space. After BERT, a linear projection reduces the per-token dimensionality (e.g., 768→128) to speed up storage and inference. And all output vectors are L2-normalised so that dot products are equivalent to cosine similarity.
MaxSim Operation
After encoding, ColBERT computes the relevance score between a query and a document using the Late Interaction mechanism. This involves:
- Creating a similarity matrix of dot products between every query and document token.
- For each query token, select the maximum similarity across all document tokens (MaxSim).
- Summing these maximum scores to produce a single relevance score.
To formulate this, given a query Q and document D, and their token embeddings/vectors are as follows:
and the MaxSim operation to get the relevance score:
This ColBERT architecture enables fast, scalable retrieval with improved accuracy compared to bi-encoders and significantly faster inference than cross-encoders.
ColBERT Model Family
The initial ColBERT approach is promising, but it has a huge drawback – storing a massive amount of vectors is a big challenge in terms of storage and memory. So, major variants of ColBERT have been introduced for improvement. In this context, we refer to them as a ColBERT model family as all share a common architectural foundation.
ColBERT-v2
ColBERT-v2 (paper, 2021) is an improved successor to the original ColBERT (paper, 2020), tackling efficiency through knowledge distillation from a stronger teacher model and compression technique for token embeddings to reduce index size/space.
ColBERTer
ColBERTer (paper, 2022) is an enhanced reduction model of ColBERT (paper, 2020). It improves interpretability by aggregating subword pieces into whole-word vectors (BOW2) and cuts index storage by reducing the number/size of token vectors. It also unifies fast single-vector retrieval with multi-vector late-interaction refinement via explicit multi-task training.
ColBERT-XM
ColBERT-XM (paper, 2024) is a multilingual late-interaction retriever that plugs XMOD (Cross-lingual,
Modular language model) adapters into ColBERT. Despite English-only fine-tuning, it performs zero-shot retrieval in many languages and maintains a small index via compressed representations.
Jina-ColBERT-v2
Jina-ColBERT-v2 (paper, 2024) is introduced to support multilingual and long-context late interaction retrieval. It uses the same ColBERT architecture with a different backbone model – enhanced XLM-RoBERTa (Jina-XLM-RoBERTa). Jina-ColBERT-v2 outperforms ColBERT-v2 on multilingual information retrieval tasks, as demonstrated on heterogeneous benchmarks such as BEIR while maintaining fast inference and significantly lower storage requirements.
ColBERT in Industrial Applications
We have covered ColBERT conceptually—now let’s see how to use it. In a nutshell, ColBERT, as an embedding model, produces a multi-vector representation of input text, generating one vector per token to capture more semantical meanings of the input text than dense embedding models, which represent an entire input text with a single vector.
The original ColBERT (v1 and v2) models and tooling were developed at Stanford. The reference implementations can be found in the stanford-futuredata/ColBERT repository. One of the easiest ways to use ColBERT in applications is via the RAGatouille library. It wraps ColBERT with simple APIs for indexing and search, and supports training/fine-tuning—useful for general retrieval and RAG pipelines. RAGatouille library supports both a ColBERT model to fine-tune or a BERT/RoBERTa-like model to build a new ColBERT model from.
Building on ColBERT’s architectural foundation, Jina ColBERT v2 introduced a multilingual variant of the late interaction retriever. The model is available at HuggingFace. It supports retrieval in 89 languages. It uses Matryoshka Representation Learning to enable flexible embedding sizes with minimal quality loss. Jina ColBERT v2 supports output embeddings of 128, 96, and 64 dimensions, which consequently saves significant storage costs. It is available via Jina Search Foundation API or via Hugging Face.
On the infrastructure side, ColBERT Live!, developed by Datastax, provides a search pipeline that integrates ColBERT into vector databases, supporting live document updates (add/update/remove) without full reindexing, filtering, and complex predicates on structured fields. In addition, ColBERT Live! supports both document and query embedding pooling, which aims to eliminate low-signal vectors from the document/query embeddings by improving relevance and making searches faster. The code is available at the Colbert-live repository.
Moreover, Vespa has introduced ColBERT embedder by reducing the vector storage footprint with compression support. The embedder maps text to token embeddings, representing a text as multiple contextualised embeddings. In addition to the ColBERT embedder, Long-context ColBERT implementation in Vespa enables efficient retrieval over long documents and is added as an extension to build on top of that foundation.
In addition, Qdrant supports ColBERT multi-vector generation and indexing into vector space via their FastEmbed library.
Lastly, there are tools for training and fine-tuning ColBERT models. For example, RAGatouille’s RAGTrainer supports fine-tuning on ColBERT model instances and training on other kinds of transformers. Another example is PyLate, a library designed to simplify and optimise fine-tuning, inference, and retrieval using ColBERT models. There is a list of some of the pre-trained ColBERT models available in PyLate, for example, jinaai/jina-colbert-v2, answerdotai/answerai-colbert-small-v1 (all the available models).
In conclusion, ColBERT has evolved from a research idea into a practical ecosystem—with robust models, convenient wrappers, and production-ready infrastructure.
Thanks for reading — stay tuned for more insights!
Need Help with this topic?
Need Help With This Topic?
If you’re struggling with ColBERT, don’t worry – we’re here to help!
Our team offers expert services and training to help you optimize your search engine and get the most out of your system. Contact us today to learn more!





