Community Over Code EU 2024 (Ex ApacheCon) happened from the 3rd to the 5th of June in Bratislava (Slovakia).
It has been a great conference, full of energy, passion and discussions for the Open Source projects we love.
It was hugely pleasant and thought-provoking to meet with many fellow Lucene/Solr committers, contributors and practitioners during and after the conference, Bratislava has a small city centre so it was quite easy to meet and continue the discussions after the conference, it was such a great location!
The search track during the conference was somehow short, there were interesting talks (we talked about Hybrid Search in Apache Solr), but I hope we’ll get more talks in the next editions!
On the 3rd of June, I and Anshum organised a Lucene/Solr Birds of a Feather (BoF) to collect feedback from the community and assess the status of the projects.
This blog summarises that event. Huge thanks to Stefan Vodita, Lucene’s committer who took notes during the event. Without his effort, this blog post would not have been possible.
Roughly 10 people participated (with some on and off during the 60-90 minutes of the event):
Anshum
Part of Lucene and Solr from 2006 to now. Committer and PMC member works at Apple helping people to use Solr and Lucene.
Andreea
Works at Emag. Stores keywords in Solr, very small docs. They plan to use Apache Solr more extensively and right now it is used to provide some sort of promotion of sellers per query.
Alex
Works at Microsoft on managed Cassandra (that is shipped with a Lucene extension).
Atita
Works in search since 2007. She works with a lot of vector search use cases right now. She’s very passionate about the Solr community.
Reinhart
He’s following many projects around Solr. Nowadays he works mostly in project management. Last year his company started developing a search analytics platform.
This came from the observation that many projects suffer from a lack of customer data. He noticed OpenSearch is looking in this direction, but Solr is missing it (SOLR-10359)
Claude
Likes to experiment with search space reduction strategies. Interested in indexing bit-vectors for boom filter applications
Adrien
Currently working at AMD as a firmware developer for network cards. Worked with Lucene at Amazon as an intern. Curious to hear what’s new.
Stefan
Works in Amazon product search and Lucene. Interested in aggregations (Do you know Lucene has aggregations too?).
Ranimier
Works on Cassandra which uses a bit of Lucene. Trying to contribute J vector.
Alessandro
Apache Lucene/Solr committer and Solr PMC member. Director of Sease and author of this blog post 🙂
The session started quite informally with a round around the table to get an introduction of all participants and then we moved to a “raise your hand” model to discuss interesting points about Lucene and Solr.
Are Lucene and Solr offering Vector-Based search?
The first point of discussion raised by Atita is about the fact that not many people know that Apache Lucene and Solr are capable of vector-based and hybrid search: this is visible from LinkedIn discussions, insight diagrams and divulgation posts on the matter.
There’s been effort in promoting the features, I’ve done this in the first person with my talks at conferences and blogs, but it’s clearly not enough.
The overall idea is that the PMC should focus more on these activities and encourage non-coder contributors.
Wait a minute!
Having online demos could help a lot in showcasing such functionalities, but we (Apache Solr project) need servers and maintainers (something similar to Search.py by Mike McCandless).
In regards to an official blog, I am delighted to share that we (Apache Solr PMC) launched the Solr Blog initiative and anyone is welcome to contribute their own or external blog!
Another interesting consideration is about aggregating Apache Lucene/Solr videos on a dedicated official channel, something definitely to take a look at.
Short-Term Action: Pull Request to change both Solr and Lucene websites to clearly advertise their support for vector-based search and hybrid search
– 13514
– 17341
Mid-Term Action: As Atita suggested ‘Demos speak louder than words‘, it would be cool to have a live demo showcasing Solr’s capabilities on a well-known, easy-to-understand public dataset (MS Marco?).
What about the default Apache Solr configurations?
How many times have you seen your company or your clients using the default schema.xml or solrconfig.xml, even with exactly the same default comments?
In my case, a lot.
To counteract this, there was at a certain point the initiative of a minimal schema.xml/solrconfig.xml and example directories where you can find how to configure for the most common scenarios (simplest solr config) but it’s still an open issue.
Sometimes this even ends up with unexpected and incorrect configurations because of old and outdated original schemas (like for example creating unique fields instead of unique values).
Ideally, we want to have a minimal configuration and very clear examples, well-categorized and simple enough to make Solr look good.
Too many ways of doing the same things
This is a recurring topic, we have discussed both in New Orleans (ApacheCon 2022) and Halifax (Community Over Code 2023).
Many features in Lucene and Solr can be implemented in different ways and components.
At the same time, there’s always going to be THE ideal solution but it’s very likely a person or a team may end up implementing a feature in an Ok (more or less) way to then discover it doesn’t scale well and getting a wrong impression over the project.
Flexibility is powerful for the expert user but very delicate for the average practitioner.
This is still an open problem and we should probably dedicate better documentation to it and more deprecations.
Cassandra and vector-based search
Cassandra has integrated vector-based search using JVector.
One of the reasons was that it proved to be quite difficult and time-consuming to contribute the functionality to Lucene.
The investment around this work was caused more by both the hype and feeling from the community, rather than immediate requirements from clients.
There were rumours the integration happened in a week using Github Copilot, but in reality, it’s more likely that it was the case for the first prototype and then the production-level code was replaced by something less risky (especially from a licensing perspective).
Index Migration Tool
Once you upgrade your index and just do one commit with the new version, rolling back becomes not possible. In the Lucene code base and Solr documentation, there’s a description for an indexer migration tool, but it’s barely usable and does pretty much nothing.
The consensus was to remove it and a short discussion happened shortly after in the dev mailing list, independently confirming the impression from the Birds of a Feather.
This tool just confuses and can lead users in the wrong direction.
And that’s the conclusion for the Birds Of A Feather 2024 summary!
What do you think? What are the main pain points of Apache Lucene/Solr in 2024 and what can we do to improve them?
Let us know in the comments!





