Apache Solr, Event, News

Apache Lucene/Solr: the Top 10 Pain Points – Community Over Code 2023 Edition

Community Over Code North America 2023 (Ex ApacheCon) happened from the 7th to the 10th of October in Halifax (Canada).
As usual, it has been an amazing conference, full of energy and passion for the Open Source projects we love.
I am personally really grateful for the time spent with many fellow Lucene/Solr committers during and after the conference, something I would definitely like to do more often.

Aside from the search track during the conference, we had the pleasure to hack a bit around at the Hackathon Day organised by Eric Pugh on the 8th of October and to check the status of the project at the Apache Lucene/Solr Birds of a Feather (BoF) on the 9th of October, organized in conjunction by David Smiley, Eric Pugh and myself (Alessandro Benedetti).
The main activity was to go over the pain points identified past year in New Orleans, to see if any progress happened in any area.
More than 20 people attended for a great discussion, both from the Apache Lucene and Apache Solr communities, and that makes us very proud: community over code indeed!

We would have needed much more time to explore everything with calm but it’s a first step, we’ll definitely organise more at future conferences, possibly different Birds Of a Feather, on different topics.
Special thanks go to Stefan Vodita who took notes as the Birds Of a Feather was happening.
So let’s explore what we have achieved as a community in the past year, for the main pain points identified last year:

In regards to managing ontologies not much has happened, but does it really matter anymore?
With the advent of Large Language Models, sentence transformers to encode text to vectors and auto-generated synonyms using neural networks, the need for ontologies management and knowledge base integrations has slowly faded.
We have improvements with the way positional queries are implemented over multi-term synonyms, but the community seems to agree this topic not to be a pain point nor a priority anymore…

Cross Index(collections) joins have been improved.
Also, K-NN parent-child joins have been enhanced with a huge piece of work in Lucene 9.8 to allow to modelling of nested vectors for a parent document, sure the performance may still need some care, but the feature is solid and available for users to experiment: https://github.com/apache/lucene/pull/12434
The general consensus was that Index-time joins are now decently fast while query-time joins still remain a costly operation.

Little progress happened in this area and the problem remains.
Sure, there are some additional tutorials (especially for vector-based search) with added example datasets, but nothing major happened to make it easy to navigate the huge jungle of Lucene/Solr functionalities.
From the Lucene side, there’s the demo package available, but not much more.

Also in this area, not much progress happened, it should be possible to use Grafana to do it, but we need at least some documentation to explain it and ideally Solr to export such metrics natively.

Unfortunately labels both on Github and Jira are still not used consistently.
This doesn’t apply only to the newdev one, but more generally. To a certain extent, GitHub made labelling even worse.
Additionally, there are also duplicate labels meaning the same thing and bringing even additional fragmentation.
We need to do it better as a community, if you create an issue/pull request please take the extra 5 minutes to properly label it!

Splainer has been integrated as a module, and this is a good improvement, but generally is still quite hard to explain and compare results.
With the addition of vector-based search the situation even degraded more: effectively there’s no explanation for the HNSW K-NN search, we only get an additional message that states that “the result is within the top-k”.
Is there a need for more advanced explainability tools at the Lucene level? that’s an interesting question for sure!

At the moment Lucene guarantees back-compatibility for the default codec, but not others.
For experimental codecs, the situation is a bit more complicated and it’s quite costly and manual to maintain that for them.

There are still way too many faceting modules and highlighters. but at least the analytics module has been deprecated.
Is this necessarily a bad thing? For expert users, it gives great flexibility but for new users, it’s a nightmare.
Furthermore, it should be possible to merge some of these into single functionalities but it takes effort.

The Apache Solr reference guide got bigger and with better search functionality (which is great for search engine technology! :))
Also, we should have more opinionated sections where you get pieces of advice from practitioners and experts.
From the Lucene side unfortunately not much has happened, we had some minor improvements on the Javadocs for faceting, but nothing major.

We got the Apache Solr monthly meetup initiative running and that’s definitely a big improvement.
Also, the hackathon organised at Community Over Code was definitely beneficial for the community.
We should also periodically monitor open/closed Pull Request rates and raise an alert if certain Pull Requests are ignored for too long.
We probably should review periodically open Pull Requests, this should happen at meetups or when meeting at conferences.

We identified also additional concerns:

It’s clear we are lagging behind other open source search engines, we definitely need more machine learning experts contributors and more funds to sponsor these contributions (many of the other companies are investing a substantial amount of money on this, while Solr has mostly been the effort of my company Sease, self-funded…, a blog will follow on what are our next plans and how to make them happen!).

Was the migration to GitHub a success for Lucene?
Well, labelling got worse, for example is not easy to identify what ends up in a release and that should be automated.
Github search on the other hand seems to work considerably better.
Also, the GitHub interface is much more user-friendly for young/new users.

This is an ever-lasting pain point for contributors, it’s a file that defines where a specific feature ends up in a release, but it’s manual, easy to forget and prone to conflicts.
We should definitely find a better way of doing it, but no pragmatic solution has been found yet.

And that’s the conclusion for the Birds Of A Feather 2023 summary!

Search Solutions ’23 is coming to London on the 21st and 22nd of November, with a London Information Retrieval Meetup on the 20th of November.
Another great occasion for search practitioners to meet if you are in London that period!

What do you think? What are the main pain points of Apache Lucene/Solr in 2023 and what can we do to improve them?
Let us know in the comments!

Did you like this post about Apache Lucene/Solr: the Top 10 Pain Points? Don’t forget to subscribe to our Newsletter to stay always updated in the Information Retrieval world!

apache con, apache solr, apachecon, community over code

Sign up for our Newsletter

Did you like this post? Don’t forget to subscribe to our Newsletter to stay always updated in the Information Retrieval world!

About the company

about our work

Rated Ranking Evaluator
(RRE)

Rated Ranking Evaluator Enterprise (RREE)

Apache Solr LLM Highlighter plugin

News

Main Blog

TIPS AND TRICKS

LATEST BLOG POST

contact us

Don't miss all the news - subscribe to our newsletter!

Apache Lucene/Solr: the Top 10 Pain Points – Community Over Code 2023 Edition

Other posts you may find useful

Elasticsearch Neural Search Improvements in 8.6 and 8.7

Solr Document Classification – Part 1 – Indexing Time

Apache Solr: Chaining SearchHandler instances: the CompositeRequestHandler

Alessandro Benedetti

Alessandro Benedetti

Follow Us

Top Categories

Recent Posts

Scalar Quantization of Dense Vectors in Apache Solr

Retrieval and Responsibility: The Ethics of Augmented Knowledge

Faster Vector Search: Early Termination Strategy Now in Apache Solr

Monthly video

Sign up for our Newsletter

Leave a Reply Cancel reply

Quick Links

Services

Subscribe

About the company

about our work

Rated Ranking Evaluator (RRE)

Rated Ranking Evaluator Enterprise (RREE)

Apache Solr LLM Highlighter plugin

News

Main Blog

TIPS AND TRICKS

LATEST BLOG POST

contact us

Don't miss all the news - subscribe to our newsletter!

Apache Lucene/Solr: the Top 10 Pain Points – Community Over Code 2023 Edition

Other posts you may find useful

Elasticsearch Neural Search Improvements in 8.6 and 8.7

Solr Document Classification – Part 1 – Indexing Time

Apache Solr: Chaining SearchHandler instances: the CompositeRequestHandler

Alessandro Benedetti

Alessandro Benedetti

Follow Us

Top Categories

Recent Posts

Scalar Quantization of Dense Vectors in Apache Solr

Retrieval and Responsibility: The Ethics of Augmented Knowledge

Faster Vector Search: Early Termination Strategy Now in Apache Solr

Monthly video

Sign up for our Newsletter

Leave a Reply Cancel reply

Rated Ranking Evaluator
(RRE)