Apache Solr, Event

Apache Solr: the Top 10 Things We Can Do Better – ApacheCon 2022 Edition

ApacheCon North America 2022 reached its conclusion last Thursday 6th of October in New Orleans(Louisiana), it has been a terrific conference, vibrant and thought-provoking.
Especially for the Apache Solr community, it has been a fantastic moment to gather.
I have personally enjoyed the company of many fellow Lucene/Solr committers during and after the conference and this makes these events even more special.
The Apache Solr Birds of a Feather(BoF), organized by David Smiley has been fundamental to exploring what users and contributors-wannabe think about our beloved project, and oh boy, how much we discovered!
18 people attended, 7 were only interested in Apache Lucene and 11 had some(or a lot of) experience with Apache Solr.
As an Apache Lucene/Solr committer and PMC member, my priority is to improve the quality of the project and to involve more the community, so let’s explore the top 10 things we ended up identifying as current Apache Solr pain points!

10. Managing ontologies

There was a time another Apache project was on the rise and the king of ontologies management for text enrichment and its name was Apache Stanbol.
I am using the past here because this project retired to the attic so it’s not maintained anymore.
When this happened, it has been hard to fill the gap through Open-source software and a plugin/extension of Apache Solr could be spot on.
What about a dedicated Apache Solr collection/managed resource to store/search ontologies from a knowledge base(maybe directly importing triples stores/RDF?), possibly providing a tagging service to be used in conjunction with an update request processor( maybe a Text Tagger 2.0)?
This area has not been actively researched and developed recently, but this means there’s an opportunity to shine through new contributions!

9. (Performant?) Joins

Joins strictly speaking are implemented both in Apache Lucene and Solr and provide the ability to run nested documents search (query/index time), including across indexes and collections on various instances.
It’s definitely an amazing feature, but they have always been recommended as a last resort only, given the performance implications.
Many users are using(and possibly abusing?) this feature anyway, certainly with difficulties.
Possibly it’s time to invest strongly in this direction and improve/re-design part of this feature.

8. Showcase of core features

Sure, there is the Apache Solr reference guide which is an immense source of documentation including many tutorials and examples of how to use certain features, but what about a core section with the most common use cases and clear, detailed examples, something like the “getting started” on steroids, potentially with additional code examples and a live Solr instance managed by the community to interact with?

7. Metrics at collection/node level

Currently, it’s possible to investigate in detail what happens at a core level, but it’s not easy to gather information and stats at a collection level(distributed) or at the node level(summing up all the cores).
This should especially regard usage stats such as the number of query requests received by the total of request handlers and indexing throughput as a whole.

6. Finding new issues to work on

We are reaching the main pain points here: Apache projects are first communities and then technological solutions, so we should facilitate how new people get into the project.
And what a best way to enter a software project than start coding? (ok, maybe this opinion is biased, I am a software engineer after all:))

Currently, there is a “newdev” label in the Apache Solr Jira, but how many people know that? (for example, I didn’t).
Yes, it is explained here(https://cwiki.apache.org/confluence/display/solr/howtocontribute), but I couldn’t find any reference on the Apache Solr reference guide (please allow me this word pun).
Potentially we need more, maybe different tiers of issues(new-dev-tier1, new-dev-tier2…)? With an increasing order of difficulty? So depending on the amount of time a new contributor wants/can dedicate can choose.
And we definitely should update the reference guide with that and evangelize that more at conferences and meetups.

5. Explainability

Why document A has been returned by my query? Why document B is at the 7th position and document A at the 3rd?
All of us search folks have heard these questions and observations multiple times.
But explaining why a document has been returned with a certain score is quite hard also for seasoned engineers.
Sure, there’s the debug results functionality, which is fantastic, but not that easy to read.
A (possibly retired?) Chrome plugin was a wonderful solution (we should integrate it in the Apache Solr admin UI code base!) and http://splainer.io is an external service that aims to help with Explainibility but we can do better…
A user (or power user) should be able to get everything needed directly from Apache Solr itself, there’s definitely space for contribution(and a lot of fun playing with novel strategies) here.

4. Back Compatibility

In the audience, two people expressed concern with the back-compatibility of Apache Solr releases, something that may potentially hold off users and enterprises from upgrading (and slowing down the progress of the project through adoption and potential discovery of bugs/improvements).
We should definitely make upgrading as painless as possible and careful back compatibility is crucial.
I personally think we are doing a good job already, but if people are noticing it, there’s space for improvement!

3. Too many ways of doing things

Two votes here as well, in Apache Solr, incremental iterations in development brought many different ways of solving the same problems.
Faceting? Highlighting? Grouping(or collapsing)? And more…
Most of the time, each new iteration of each feature brought improvements and new capabilities but couldn’t manage to achieve “that little thinghy” that the original implementation was doing sooo well…
In this way, we ended up accumulating great flexibility but also great confusion when a new user arrives and needs to choose among the multiple implementations for the feature he/she needs.
Is it time to deprecate more and maybe spend extra effort in merging some of the various implementations and only get the best of those worlds?

2. Lucene/Solr documentation

Three votes to ease the steep learning curve to enter the Lucene/Solr world.
The reference guide is great and updated with each release, but not that deep into the internals.
Books are amazing for understanding the core and internals but get outdated very quickly.
The code base is immense with some good Javadocs but hard to navigate.
What are the options here?
Maybe a hosted book, detailed and maintained by the community.
Do we need more examples and code, maybe directly in the related java package or a demo folder?
Tests are good but definitely not always the right way to enter and understand a feature in deep.

Should we differentiate between the user path and the developer path?
Definitely, there’s a lot to improve here, a lot of margin for new contributions(that will ease new ones, a positive self-reinforcing loop!)

1. Contributing and getting attention

Finally, we reach the top-voted current difficulty with Apache Solr: contributing and getting enough traction to get reviews and get contributions merged.
Four people from the audience raised this issue: even if you end up with a working patch for a new feature/existent bug, it’s really difficult to get the necessary attention if you are not in the community already or know the right people.
It’s true that all committers and PMC members are volunteers but it’s a shame that some new contributors have this feeling, we should definitely aim to be more welcoming and careful.
For sure, new contributors are encouraged to do their homework: short contributions, laser-focused details, tests, and GitHub pull requests tagging as reviewers people that recently touched that area of code definitely help.
From the community side, we can aim for more frequent initiatives and meetups where committers participate and contributors are able to pinpoint their issues of interest.
Especially during conferences such as ApacheCon, adding a hackathon pre/post conference would be delightful to unblock some of these issues.
Search Solutions ’22 is coming to London on the 22nd and 23rd of November, with a London Information Retrieval Meetup in strict proximity (we are finalizing the dates with our partners), it may be a good occasion to start for British contributors.