Apache Solr, Event, News

Community Over Code EU 2024 Apache Lucene/Solr Birds of a Feather

June 25, 2024
10 mins read

Community Over Code EU 2024 (Ex ApacheCon) happened from the 3rd to the 5th of June in Bratislava (Slovakia).

It has been a great conference, full of energy, passion and discussions for the Open Source projects we love.

It was hugely pleasant and thought-provoking to meet with many fellow Lucene/Solr committers, contributors and practitioners during and after the conference, Bratislava has a small city centre so it was quite easy to meet and continue the discussions after the conference, it was such a great location!

The search track during the conference was somehow short, there were interesting talks (we talked about Hybrid Search in Apache Solr), but I hope we’ll get more talks in the next editions!

On the 3rd of June, I and Anshum organised a Lucene/Solr Birds of a Feather (BoF) to collect feedback from the community and assess the status of the projects.

This blog summarises that event. Huge thanks to Stefan Vodita, Lucene’s committer who took notes during the event. Without his effort, this blog post would not have been possible.

Roughly 10 people participated (with some on and off during the 60-90 minutes of the event):

Anshum

Part of Lucene and Solr from 2006 to now. Committer and PMC member works at Apple helping people to use Solr and Lucene.

Andreea

Works at Emag. Stores keywords in Solr, very small docs. They plan to use Apache Solr more extensively and right now it is used to provide some sort of promotion of sellers per query.

Alex

Works at Microsoft on managed Cassandra (that is shipped with a Lucene extension).

Atita

Works in search since 2007. She works with a lot of vector search use cases right now. She’s very passionate about the Solr community.

Reinhart

He’s following many projects around Solr. Nowadays he works mostly in project management. Last year his company started developing a search analytics platform.
This came from the observation that many projects suffer from a lack of customer data. He noticed OpenSearch is looking in this direction, but Solr is missing it (SOLR-10359)

Claude

Likes to experiment with search space reduction strategies. Interested in indexing bit-vectors for boom filter applications

Adrien

Currently working at AMD as a firmware developer for network cards. Worked with Lucene at Amazon as an intern. Curious to hear what’s new.

Stefan

Works in Amazon product search and Lucene. Interested in aggregations (Do you know Lucene has aggregations too?).

Ranimier

Works on Cassandra which uses a bit of Lucene. Trying to contribute J vector.

Alessandro

Apache Lucene/Solr committer and Solr PMC member. Director of Sease and author of this blog post 🙂

The session started quite informally with a round around the table to get an introduction of all participants and then we moved to a “raise your hand” model to discuss interesting points about Lucene and Solr.

Are Lucene and Solr offering Vector-Based search?

The first point of discussion raised by Atita is about the fact that not many people know that Apache Lucene and Solr are capable of vector-based and hybrid search: this is visible from LinkedIn discussions, insight diagrams and divulgation posts on the matter.

There’s been effort in promoting the features, I’ve done this in the first person with my talks at conferences and blogs, but it’s clearly not enough.

The overall idea is that the PMC should focus more on these activities and encourage non-coder contributors.

Wait a minute!

Do you like blogging? Evangelising tech? Contributing to websites? We (Apache Lucene/Solr projects) want you!

Click Here

Having online demos could help a lot in showcasing such functionalities, but we (Apache Solr project) need servers and maintainers (something similar to Search.py by Mike McCandless).

In regards to an official blog, I am delighted to share that we (Apache Solr PMC) launched the Solr Blog initiative and anyone is welcome to contribute their own or external blog!

Another interesting consideration is about aggregating Apache Lucene/Solr videos on a dedicated official channel, something definitely to take a look at.

Short-Term Action: Pull Request to change both Solr and Lucene websites to clearly advertise their support for vector-based search and hybrid search
– 13514
– 17341

Mid-Term Action: As Atita suggested ‘Demos speak louder than words‘, it would be cool to have a live demo showcasing Solr’s capabilities on a well-known, easy-to-understand public dataset (MS Marco?).

What about the default Apache Solr configurations?

How many times have you seen your company or your clients using the default schema.xml or solrconfig.xml, even with exactly the same default comments?

In my case, a lot.

To counteract this, there was at a certain point the initiative of a minimal schema.xml/solrconfig.xml and example directories where you can find how to configure for the most common scenarios (simplest solr config) but it’s still an open issue.

Sometimes this even ends up with unexpected and incorrect configurations because of old and outdated original schemas (like for example creating unique fields instead of unique values).

Ideally, we want to have a minimal configuration and very clear examples, well-categorized and simple enough to make Solr look good.

Too many ways of doing the same things

This is a recurring topic, we have discussed both in New Orleans (ApacheCon 2022) and Halifax (Community Over Code 2023).

Many features in Lucene and Solr can be implemented in different ways and components.

At the same time, there’s always going to be THE ideal solution but it’s very likely a person or a team may end up implementing a feature in an Ok (more or less) way to then discover it doesn’t scale well and getting a wrong impression over the project.

Flexibility is powerful for the expert user but very delicate for the average practitioner.

This is still an open problem and we should probably dedicate better documentation to it and more deprecations.

Cassandra and vector-based search

Cassandra has integrated vector-based search using JVector.

One of the reasons was that it proved to be quite difficult and time-consuming to contribute the functionality to Lucene.

The investment around this work was caused more by both the hype and feeling from the community, rather than immediate requirements from clients.

There were rumours the integration happened in a week using Github Copilot, but in reality, it’s more likely that it was the case for the first prototype and then the production-level code was replaced by something less risky (especially from a licensing perspective).

Index Migration Tool

Once you upgrade your index and just do one commit with the new version, rolling back becomes not possible. In the Lucene code base and Solr documentation, there’s a description for an indexer migration tool, but it’s barely usable and does pretty much nothing.

The consensus was to remove it and a short discussion happened shortly after in the dev mailing list, independently confirming the impression from the Birds of a Feather.

This tool just confuses and can lead users in the wrong direction.

And that’s the conclusion for the Birds Of A Feather 2024 summary!

What do you think? What are the main pain points of Apache Lucene/Solr in 2024 and what can we do to improve them?

Let us know in the comments!

apache con, apache lucene, apache solr, apachecon, community over code

Sign up for our Newsletter

Did you like this post? Don’t forget to subscribe to our Newsletter to stay always updated in the Information Retrieval world!

About the company

about our work

Rated Ranking Evaluator (RRE)

Rated Ranking Evaluator Enterprise (RREE)

Apache Solr LLM Highlighter plugin

News

Main Blog

TIPS AND TRICKS

LATEST BLOG POST

contact us

Don't miss all the news - subscribe to our newsletter!

Community Over Code EU 2024 Apache Lucene/Solr Birds of a Feather

Anshum

Andreea

Alex

Atita

Reinhart

Claude

Adrien

Stefan

Ranimier

Alessandro

Are Lucene and Solr offering Vector-Based search?

Wait a minute!

What about the default Apache Solr configurations?

Too many ways of doing the same things

Cassandra and vector-based search

Index Migration Tool

Other posts you may find useful

Search Limitations and Workarounds in OpenSearch

Rated Ranking Evaluator: Help the poor (Search Engineer)

Solr Is Learning To Rank Better – Part 2 – Model Training

Alessandro Benedetti

Alessandro Benedetti

Follow Us

Top Categories

Recent Posts

Vector Search Doctor (Part 2): Bridging the Gap Between Theory and Practice in Vector Search

Vector Search Doctor (Part 1): Beyond the MTEB Leaderboard for Custom Datasets

Search Quality Evaluation with LLMs: the Dataset Generator

Monthly video

Sign up for our Newsletter

Leave a Reply Cancel reply

Rated Ranking Evaluator
(RRE)