Community Over Code North America 2023 (Ex ApacheCon) happened from the 7th to the 10th of October in Halifax (Canada).
As usual, it has been an amazing conference, full of energy and passion for the Open Source projects we love.
I am personally really grateful for the time spent with many fellow Lucene/Solr committers during and after the conference, something I would definitely like to do more often.
Aside from the search track during the conference, we had the pleasure to hack a bit around at the Hackathon Day organised by Eric Pugh on the 8th of October and to check the status of the project at the Apache Lucene/Solr Birds of a Feather (BoF) on the 9th of October, organized in conjunction by David Smiley, Eric Pugh and myself (Alessandro Benedetti).
The main activity was to go over the pain points identified past year in New Orleans, to see if any progress happened in any area.
More than 20 people attended for a great discussion, both from the Apache Lucene and Apache Solr communities, and that makes us very proud: community over code indeed!
We would have needed much more time to explore everything with calm but it’s a first step, we’ll definitely organise more at future conferences, possibly different Birds Of a Feather, on different topics.
Special thanks go to Stefan Vodita who took notes as the Birds Of a Feather was happening.
So let’s explore what we have achieved as a community in the past year, for the main pain points identified last year:
In regards to managing ontologies not much has happened, but does it really matter anymore?
With the advent of Large Language Models, sentence transformers to encode text to vectors and auto-generated synonyms using neural networks, the need for ontologies management and knowledge base integrations has slowly faded.
We have improvements with the way positional queries are implemented over multi-term synonyms, but the community seems to agree this topic not to be a pain point nor a priority anymore…
Cross Index(collections) joins have been improved.
Also, K-NN parent-child joins have been enhanced with a huge piece of work in Lucene 9.8 to allow to modelling of nested vectors for a parent document, sure the performance may still need some care, but the feature is solid and available for users to experiment: https://github.com/apache/lucene/pull/12434
The general consensus was that Index-time joins are now decently fast while query-time joins still remain a costly operation.
Showcase of core features
Little progress happened in this area and the problem remains.
Sure, there are some additional tutorials (especially for vector-based search) with added example datasets, but nothing major happened to make it easy to navigate the huge jungle of Lucene/Solr functionalities.
From the Lucene side, there’s the demo package available, but not much more.
Metrics at collection/node level
Also in this area, not much progress happened, it should be possible to use Grafana to do it, but we need at least some documentation to explain it and ideally Solr to export such metrics natively.
Finding new issues to work on
Unfortunately labels both on Github and Jira are still not used consistently.
This doesn’t apply only to the newdev one, but more generally. To a certain extent, GitHub made labelling even worse.
Additionally, there are also duplicate labels meaning the same thing and bringing even additional fragmentation.
We need to do it better as a community, if you create an issue/pull request please take the extra 5 minutes to properly label it!
Splainer has been integrated as a module, and this is a good improvement, but generally is still quite hard to explain and compare results.
With the addition of vector-based search the situation even degraded more: effectively there’s no explanation for the HNSW K-NN search, we only get an additional message that states that “the result is within the top-k”.
Is there a need for more advanced explainability tools at the Lucene level? that’s an interesting question for sure!
At the moment Lucene guarantees back-compatibility for the default codec, but not others.
For experimental codecs, the situation is a bit more complicated and it’s quite costly and manual to maintain that for them.
Too many ways of doing things
There are still way too many faceting modules and highlighters. but at least the analytics module has been deprecated.
Is this necessarily a bad thing? For expert users, it gives great flexibility but for new users, it’s a nightmare.
Furthermore, it should be possible to merge some of these into single functionalities but it takes effort.
The Apache Solr reference guide got bigger and with better search functionality (which is great for search engine technology! :))
Also, we should have more opinionated sections where you get pieces of advice from practitioners and experts.
From the Lucene side unfortunately not much has happened, we had some minor improvements on the Javadocs for faceting, but nothing major.
Contributing and getting attention
We got the Apache Solr monthly meetup initiative running and that’s definitely a big improvement.
Also, the hackathon organised at Community Over Code was definitely beneficial for the community.
We should also periodically monitor open/closed Pull Request rates and raise an alert if certain Pull Requests are ignored for too long.
We probably should review periodically open Pull Requests, this should happen at meetups or when meeting at conferences.
We identified also additional concerns:
It’s clear we are lagging behind other open source search engines, we definitely need more machine learning experts contributors and more funds to sponsor these contributions (many of the other companies are investing a substantial amount of money on this, while Solr has mostly been the effort of my company Sease, self-funded…, a blog will follow on what are our next plans and how to make them happen!).
Was the migration to GitHub a success for Lucene?
Well, labelling got worse, for example is not easy to identify what ends up in a release and that should be automated.
Github search on the other hand seems to work considerably better.
Also, the GitHub interface is much more user-friendly for young/new users.
This is an ever-lasting pain point for contributors, it’s a file that defines where a specific feature ends up in a release, but it’s manual, easy to forget and prone to conflicts.
We should definitely find a better way of doing it, but no pragmatic solution has been found yet.
What do you think? What are the main pain points of Apache Lucene/Solr in 2023 and what can we do to improve them?
Let us know in the comments!
Subscribe to our newsletter
Did you like this post about Apache Lucene/Solr: the Top 10 Pain Points? Don’t forget to subscribe to our Newsletter to stay always updated in the Information Retrieval world!