Apache Lucene, Apache Solr, News

Apache Lucene/Solr AI Roadmap – Do You Want to Make It Happen?

In the last few years at Sease, we’ve been contributing a lot: from mailing list and slack support to bug fixes and new features, we’ve been extremely active in the Open Source scene.
Especially in the Apache Lucene and Solr projects, led by our director Alessandro Benedetti, we managed to push upstream many different contributions.
Andrea, Elia, Anna, Ilaria and Daniele, all have been able to publish some work; they are not committers yet, but it’s rewarding and encouraging: We are proud of them.
We were also able to share our work at various international conferences and that’s definitely a plus! (Sharing is caring)

Sease invested a considerable amount of time and money to bring many important features to Lucene and Solr. Let me list the major ones (but if you take a look at the git logs, there are many others)

APACHE SOLR

Neural/Vector-based search
Pre-filtering for Neural/Vector Search
Function queries for vector distance similarity (to be used with Learning To Rank)
Byte/Float encodings support for dense vectors
High-dimensional vectors support
Learning To Rank feature extractor improvements (null and log all capabilities)
Learning To Rank various bug fixes and improvements
Learning To Rank Interleaving
Weighted Synonyms

APACHE LUCENE

Auto-generated synonyms on the fly using Word2Vec
Function queries for vector distance similarity
Bug fixes for HNSW
Weighted synonyms
Document classification

Although we are definitely happy and proud of what we achieved fully self-funded (aside from the Apache Solr Interleaving work that was sponsored by AdoreBeauty), we have to admit that Apache Solr is lagging behind other search technologies in regards to AI and Machine Learning investment, and that’s a shame!

We are a small company and although we truly believe Open source to be fundamental to connecting academia and the industry in Information Retrieval, we can dedicate limited time to our contributions if not sponsored.

The scope of this post is to illustrate what we have on our roadmap and ask for funds to make it happen!

Yes, exactly we’re asking for your money (well, your company money) for a greater good since the outcome of the work will be fully contributed back to Apache Lucene/Solr.

The idea for this blog post comes from the interesting talk at the Community Over Code conference (Ex ApacheCon) by Matt Yonkovit: https://communityovercode.org/schedule-list/#SY005 , you should not be ashamed to ask for money to make your favourite Open source project better!

Without further ado, let’s see what’s on the plate!

Artificial Intelligence

1) APACHE SOLR NEURAL HIGHLIGHTER

A highlighter that takes in input a language model and uses it at runtime to build a snippet for each document, with the paragraph of text most relevant to the query.
This component is not based on lexical keyword matching but on the semantic matching of the information requested.
It’s a wonderful addition to the Solr explainability tools, especially valuable for the end user, looking for quick valuable information from each search result.
This work in particular has been already implemented by us as a commercial plugin, but we would be happy to make it Open Source if sponsored.

2) Apache Solr End-to-End Neural Search

This work brings to Solr the ability of parsing natural language queries, encoding them to vectors to then run vector-based search.
All out of the box, with no external integration needed.
Various components need to be designed and developed:
– a Language Model Handler, a server with the responsibility of loading and handling language models and offering an inference endpoint, independently scalable and fully dedicated to Solr
– an update request processor to enrich documents at indexing time, potentially chunking them and encoding them to vectors.
It’s configured to talk with an external inference service or to interact directly with the Language Model handler.
– a query parser, that can be configured to talk with an external inference service or to interact directly with the model handler.
Its main responsibility is to encode the user query from text to vector and then run K nearest neighbour search

3) Better Hybrid Search in Apache Solr

The idea here is to implement various approaches to combine and re-score search results coming from both lexical and neural models.
We’re talking about Reciprocal Rank Fusion algorithms and better support in Learning To Rank for vector similarity as a feature.

4) Apache Solr Large Language Model Query Rewriter

This component has the responsibility of parsing a natural language query and building a structured Solr Query, leveraging the interaction with a configured Large Language Model and the internal Solr index.
The result will be an easy-to-debug new Solr query, that leverages the combined power of the Solr inverted index terms and query expansion and understanding capabilities of LLMs.
e.g.
Wouldn’t be cool in your music search engine to express your need as “I want an intense music for a surfing session” and get back a structured SOLR query, with the best terms for each field of your Solr document? (and consequentially highly relevant results).

5) Apache Solr Retrieval Augmented Generation

Once configured with a Large Language Model (inference can happen locally on a dedicated Language Model Handler component or remotely accessing external APIs) this component will be able to take in input the query, the top-k results as context(coming from lexical, neural ar hybrid search) and use the LLM to craft the perfect answer with citations.
Grounded generative AI!

6) Apache Lucene Multi-Valued Vectors

This work is in progress and you can find various presentations about it at international conferences:
https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/YPUPAA/

The scope is to give the ability to Lucene of indexing multiple vectors per field (quite useful when working with long documents).
This will enable the capability for the other Lucene-based search engines (Solr, Elasticsearch and OpenSearch).

Learning to Rank

Feature Vector Cache Improvements

We’ve been working a lot on this functionality.
Currently, the feature vector cache is not used at ranking time and query level features/document level features/ query document level features are not independently cached.
This could bring substantial performance improvements at query time for Learning to Rank:
https://issues.apache.org/jira/browse/SOLR-10448

In general, we’ve been practically maintaining single-handedly both the Learning To Rank and Neural Search areas in Apache Solr, we have the skills but we need more funds to sustain this line of work and contributions, so if you have any cool ideas you would like to see, we are happy to design, implement it and commit it to the official Open Source project!
The Open source landscape is a continuous give and take, so your donation could be fundamental to make Solr better!

Want to fund open source contributions?

If you want to make a difference in the open source landscape, feel free to get in touch.

apache solr, contributions, information retrieval, lucene, open source, solr

Sign up for our Newsletter

Did you like this post? Don’t forget to subscribe to our Newsletter to stay always updated in the Information Retrieval world!

About the company

about our work

Rated Ranking Evaluator (RRE)

Rated Ranking Evaluator Enterprise (RREE)

Apache Solr LLM Highlighter plugin

News

Main Blog

TIPS AND TRICKS

LATEST BLOG POST

contact us

Don't miss all the news - subscribe to our newsletter!

Apache Lucene/Solr AI Roadmap – Do You Want to Make It Happen?

APACHE SOLR

APACHE LUCENE

Artificial Intelligence

1) APACHE SOLR NEURAL HIGHLIGHTER

2) Apache Solr End-to-End Neural Search

3) Better Hybrid Search in Apache Solr

4) Apache Solr Large Language Model Query Rewriter

5) Apache Solr Retrieval Augmented Generation

6) Apache Lucene Multi-Valued Vectors

Learning to Rank

Feature Vector Cache Improvements

Want to fund open source contributions?

Other posts you may find useful

Entity Search with graph embeddings – Part 4 – Evaluation and conclusion

Semantic Web & Linked Open Data

Solr Is Learning To Rank Better – Part 3 – Ltr tools

Alessandro Benedetti

Alessandro Benedetti

Follow Us

Top Categories

Recent Posts

Retrieval and Responsibility: The Ethics of Augmented Knowledge

Faster Vector Search: Early Termination Strategy Now in Apache Solr

OpenSearch and Large Language Models

Monthly video

Sign up for our Newsletter

Leave a Reply Cancel reply

Rated Ranking Evaluator
(RRE)