Apache Solr Tips And Tricks
Digging in the Solr code 5 minutes howto

Digging in the Solr code: 5 minutes howto

Let’s say you need to write a component, a request handler, or in general some piece of custom code that needs to be plugged into Solr. Or, you need to have a deeper understanding about some Lucene/Solr internals, following what actually happens within the code.  

I know: unit tests, integration tests, everything to make sure things behave as you would expect; but here I’m talking about something different: while developing, it is (at least for me) very useful a productive and debug environment where it is possible, using short dev iterations, to follow step by step what’s happening within the code, taking a deep look at how actually things work behind the scenes.

In my experience I found that useful in a couple of scenarios:

  • I have to write some Solr add-on: in this case I want to have a development environment which allows to write and debug code as much fast as possible
  • I have to study some Solr internals: let’s say for example I need to check what happens at retrieval time when a field is both docValues=”true” and stored=”true”; where does Solr get the field value from?

Let’s see how both of them can be accomplished in few minutes!

Step #1: clone our template repository

Clone the following repository [1]

Once imported in your favourite IDE, the project layout will look like this:

This image has an empty alt attribute; its file name is template-project-imported-1.png

As you can see, the template project provides:

  • A custom TokenFilter which simply prints in the standard out the output tokens during the text analysis. Note this is just an example (useful if you want to debug an analyzer): I could have created a SearchComponent, a Tokenizer or whatever I’d need.
  • a sample Solr configuration, with a minimal set of things configured
  • a Test Supertype layer (BaseIntegrationTest) and a sample Test (Tests) which loads some data, executes a query and then prints out the results.

Surprisingly, that’s all! There’s no a second step!

Use Case #1: implement, debug and test an add-on

As previously said, in the example repository we already have a simple add-on which consists of a TokenFilter that prints in the standard output each token produced in the analysis chain. The filter has been declared in the Solr configuration as part of “text” field type analyzer: 

<fieldType name="text" class="solr.TextField">
        <filter class="io.sease.labs.solr.SystemOutTokenFilterFactory"/>

The test class triggers that analyzer because it indexes some documents, so if you run it as a plain JUnit test, you will see the following output:

startOffset=0,endOffset=6,positionIncrement=1,positionLength=1,type=word => Object
startOffset=7,endOffset=15,positionIncrement=1,positionLength=1,type=word => Oriented
startOffset=16,endOffset=24,positionIncrement=1,positionLength=1,type=word => Software
startOffset=25,endOffset=37,positionIncrement=1,positionLength=1,type=word => Construction
startOffset=0,endOffset=6,positionIncrement=1,positionLength=1,type=word => Design
startOffset=7,endOffset=16,positionIncrement=1,positionLength=1,type=word => Patterns:
startOffset=17,endOffset=25,positionIncrement=1,positionLength=1,type=word => Elements
startOffset=26,endOffset=28,positionIncrement=1,positionLength=1,type=word => of
startOffset=29,endOffset=37,positionIncrement=1,positionLength=1,type=word => Reusable
startOffset=38,endOffset=53,positionIncrement=1,positionLength=1,type=word => Object-Oriented
startOffset=54,endOffset=62,positionIncrement=1,positionLength=1,type=word => Software
DOC 1 
id = 1
title = Object Oriented Software Construction

DOC 2 
id = 2
title = Design Patterns: Elements of Reusable Object-Oriented Software

If you but a breakpoint in the token filter and the re-run the Tests class in debug mode, the debugger will stop at that line as expected: 

Use Case #2: debugging Solr internals

In this case there’s no custom code because remember, the goal is to investigate some Solr internals. Specifically, the question I have to answer in this example is: assuming we have a field

<field name="myfield" type="string" docValues="true" stored="true"/>

and a request


Where does Solr get the field value from [1]?

The first thing I have to do is to change something in the project:

    • schema.xml: add the field definition above
    • Tests class: change the query parameters (adding fl=myfield) and add some value for the myfield field in the indexed documents.

Now, a premise: since the goal of this blog post is not to actually answer to the question above, we will skip all the investigation phase needed for understanding the overall query execution flow and for detecting the right place where we will put the breakpoint.

After some investigation, we understand the RetrieveFieldOptimizer [2] class plays a fundamental role in that process, so let’s open it and put some breakpoint:

This image has an empty alt attribute; its file name is retrieveFieldOptimizer-1.png

As you can see, the name and the intent of that class is quite clear, but I still want to see what happens at runtime: let’s start the Tests class in debug mode and, as expected

This image has an empty alt attribute; its file name is debug_1.png

I can see the field “myfield” has been collected in the “storedFields” set, while the dvFields (DocValues fields) set is empty, even if the field has the docValues flag enabled. So that probably suggests me something…

Moving forward, we arrive at the optimize method, where we meet the optimisation described in SOLR-8344 [3]:

This image has an empty alt attribute; its file name is debug_2.png

Again, this is just an example and the goal here is not describe the findings; however, briefly, it says that if all requested fields

    • have the docValues and stored flags enabled
    • are not multivalued

then Solr retrieves the values only from docValues.

Happy New Year!

Note this class (and the optimisation as well) has been introduced in Solr 7. [2] 

// our service

Shameless plug for our training and services!

Did I mention we do Apache Solr Beginner and Elasticsearch Beginner training?
We also provide consulting on these topics, get in touch if you want to bring your search engine to the next level!


Subscribe to our newsletter

Did you like this post about Digging in the Solr code? Don’t forget to subscribe to our Newsletter to stay always updated from the Information Retrieval world!


Andrea Gazzarini

Andrea Gazzarini is a curious software engineer, mainly focused on the Java language and Search technologies. With more than 15 years of experience in various software engineering areas, his adventure in the search world began in 2010, when he met Apache Solr and later Elasticsearch.

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.