Apache Lucene Apache Solr News
Apache Lucene/Solr AI Roadmap Do You Want to Make It Happen?

Apache Lucene/Solr AI Roadmap – Do You Want to Make It Happen?

In the last few years at Sease, we’ve been contributing a lot: from mailing list and slack support to bug fixes and new features, we’ve been extremely active in the Open Source scene.
Especially in the Apache Lucene and Solr projects, led by our director Alessandro Benedetti, we managed to push upstream many different contributions.
Andrea, Elia, Anna, Ilaria and Daniele, all have been able to publish some work; they are not committers yet, but it’s rewarding and encouraging: We are proud of them.
We were also able to share our work at various international conferences and that’s definitely a plus! (Sharing is caring)

Sease invested a considerable amount of time and money to bring many important features to Lucene and Solr. Let me list the major ones (but if you take a look at the git logs, there are many others)

APACHE SOLR
  • Neural/Vector-based search
  • Pre-filtering for Neural/Vector Search
  • Function queries for vector distance similarity (to be used with Learning To Rank)
  • Byte/Float encodings support for dense vectors
  • High-dimensional vectors support
  • Learning To Rank feature extractor improvements (null and log all capabilities)
  • Learning To Rank various bug fixes and improvements
  • Learning To Rank Interleaving
  • Weighted Synonyms
APACHE LUCENE
  •  Auto-generated synonyms on the fly using Word2Vec
  • Function queries for vector distance similarity
  • Bug fixes for HNSW
  • Weighted synonyms
  • Document classification

Although we are definitely happy and proud of what we achieved fully self-funded (aside from the Apache Solr Interleaving work that was sponsored by AdoreBeauty), we have to admit that Apache Solr is lagging behind other search technologies in regards to AI and Machine Learning investment, and that’s a shame!

We are a small company and although we truly believe Open source to be fundamental to connecting academia and the industry in Information Retrieval, we can dedicate limited time to our contributions if not sponsored.

Apache Lucene/Solr AI Roadmap Do You Want to Make It Happen?

The scope of this post is to illustrate what we have on our roadmap and ask for funds to make it happen!

Yes, exactly we’re asking for your money (well, your company money) for a greater good since the outcome of the work will be fully contributed back to Apache Lucene/Solr.


The idea for this blog post comes from the interesting talk at the Community Over Code conference (Ex ApacheCon) by Matt Yonkovit: https://communityovercode.org/schedule-list/#SY005 , you should not be ashamed to ask for money to make your favourite Open source project better!

Without further ado, let’s see what’s on the plate!

Artificial Intelligence

A highlighter that takes in input a language model and uses it at runtime to build a snippet for each document, with the paragraph of text most relevant to the query.
This component is not based on lexical keyword matching but on the semantic matching of the information requested.
It’s a wonderful addition to the Solr explainability tools, especially valuable for the end user, looking for quick valuable information from each search result.
This work in particular has been already implemented by us as a commercial plugin, but we would be happy to make it Open Source if sponsored.

2) Apache Solr End-to-End Neural Search

This work brings to Solr the ability of parsing natural language queries, encoding them to vectors to then run vector-based search.
All out of the box, with no external integration needed.
Various components need to be designed and developed:
– a Language Model Handler, a server with the responsibility of loading and handling language models and offering an inference endpoint, independently scalable and fully dedicated to Solr
– an update request processor to enrich documents at indexing time, potentially chunking them and encoding them to vectors.
 It’s configured to talk with an external inference service or to interact directly with the Language Model handler.
– a query parser, that can be configured to talk with an external inference service or to interact directly with the model handler.
Its main responsibility is to encode the user query from text to vector and then run K nearest neighbour search

3) Better Hybrid Search in Apache Solr

The idea here is to implement various approaches to combine and re-score search results coming from both lexical and neural models.
We’re talking about Reciprocal Rank Fusion algorithms and better support in Learning To Rank for vector similarity as a feature.

4) Apache Solr Large Language Model Query Rewriter

This component has the responsibility of parsing a natural language query and building a structured Solr Query, leveraging the interaction with a configured Large Language Model and the internal Solr index.
The result will be an easy-to-debug new Solr query, that leverages the combined power of the Solr inverted index terms and query expansion and understanding capabilities of LLMs.
e.g.
Wouldn’t be cool in your music search engine to express your need as “I want an intense music for a surfing session” and get back a structured SOLR query, with the best terms for each field of your Solr document? (and consequentially highly relevant results).

5) Apache Solr Retrieval Augmented Generation

Once configured with a Large Language Model (inference can happen locally on a dedicated Language Model Handler component or remotely accessing external APIs) this component will be able to take in input the query, the top-k results as context(coming from lexical, neural ar hybrid search) and use the LLM to craft the perfect answer with citations.
Grounded generative AI!

6) Apache Lucene Multi-Valued Vectors

This work is in progress and you can find various presentations about it at  international conferences:
https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/YPUPAA/

The scope is to give the ability to Lucene of indexing multiple vectors per field (quite useful when working with long documents).
This will enable the capability for the other Lucene-based search engines (Solr, Elasticsearch and OpenSearch).

Learning to Rank

Feature Vector Cache Improvements

We’ve been working a lot on this functionality.
Currently, the feature vector cache is not used at ranking time and query level features/document level features/ query document level features are not independently cached.
This could bring substantial performance improvements at query time for Learning to Rank:
https://issues.apache.org/jira/browse/SOLR-10448

In general, we’ve been practically maintaining single-handedly both the Learning To Rank and Neural Search areas in Apache Solr, we have the skills but we need more funds to sustain this line of work and contributions, so if you have any cool ideas you would like to see, we are happy to design, implement it and commit it to the official Open Source project!
The Open source landscape is a continuous give and take, so your donation could be fundamental to make Solr better!

// get in touch

Want to fund open source contributions?

If you want to make a difference in the open source landscape, feel free to get in touch.

Author

Alessandro Benedetti

Alessandro Benedetti is the founder of Sease Ltd. Senior Search Software Engineer, his focus is on R&D in information retrieval, information extraction, natural language processing, and machine learning.

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.