Search is going through a paradigm shift, sometimes referred to as the “BERT revolution.” The introduction of pre-trained language transformer models like BERT has brought significant advancements in search and document ranking state-of-the-art.
Bringing these promising methods to production in an end-to-end search serving system is not trivial. It requires substantial middleware glue and deployment effort to connect open-source tools like Apache Lucene, vector search libraries (e.g., FAISS), and model inference servers. However, the open-source serving engine Vespa, which Yahoo has developed since 2003, offers features that enable implementing state-of-the-art retrieval and ranking methods using a single serving engine stack, significantly reducing deployment complexity, cost, and failure modes.
This talk gives an overview of the Vespa search serving architecture and features enabling expressing state-of-the-art retrieval and ranking methods. We dive into Vespa’s implementations of sub-linear retrieval algorithms for sparse and dense representations to produce candidate documents for (re-)ranking efficiently. Vespa allows expressing the end-to-end multi-stage retrieval and ranking pipeline, including inference using transformer models. We also touch on real-world application constraints, such as filtering and search result diversification.