Apache Solr Main Blog
docvalues vs stored fields

This blog post aims to give a better understanding of Docvalues and store fields in Apache Solr for the operations in which they can be used interchangebly.

Although stored fields and docvalues have been created for different purposes, both can be used for effectively storing values of the document fields, and then, retrieve them when needed. 

When you are defining your solr schema you might ask yourself if your fields should be defined as docvalues, stored or both. In order to decide this, you need to understand what is the usage of your fields in your application from a functional point of view. If it happens that you can use both of these methods for your goals, you should understand how choosing one in place of the other will affects the performance and the space usage of your search engine.

Following, I’m giving some tools for analyzing which method is the best choice and in which scenario.

Stored fields

The main usage of stored fields is to return field values at query time.  You can specify a field to be stored as following:

<fieldname="name" type="string" stored="true"/>
 

Stored fields are organised in a row manner. This means that, given a set of fields, for each document, the values of these fields are concatenated in a row. The rows are then stored sequentially on disk according with their lucene doc id. Each row may have a different size dependently on the number of fields defined for that document and data types (e.g. string or text fields have variable size). The pointers to each row are stored for allowing fast access to them.

Let’s say you want to retrieve the field A from the document with lucene id lid:

  1. Find the address to the row of lid
  2. Read all the values in the rows until you reach the value of the field A.
  3. return the field A value.
The second point is crucial here. Because we have a pointer only at the beginning of the row,  we have to read all the stored values in the row at the worst case. This implies that returning all the stored fields in a row should have more or less the same cost of returning only a subset. This is an important factor to take into account when we define our solr schema.

Stored fields space occupancy

Understanding how much space stored fields will take is a difficult task. Infact, when data get compressed, some factors such as data redundancy and distribution play a fundamental role. 

The compression happens by applying the LZ4 algorithm to the rows formed as described above. LZ4 is a very fast compression algorithm with a good but not optimal compression ratio. It is possible to specify two modes: BEST_COMPRESSION and BEST_SPEED. While BEST_SPEED is the default option, in order to set BEST_COMPRESSION you just need to enable it in solrconfig.xml as following:

<codecFactory class="solr.SchemaCodecFactory">
  <str name="compressionMode">BEST_COMPRESSION</str>
</codecFactory>
This option changes only some parameters of the LZ4 algorithm execution, worsening the performance and improving compression ratio.
Last thing to say about stored fields compression is that in the same row we can find very ethereogenuous data and this is not the best scenario for obtaining an optimal compression ratio.

Docvalues

Docvalues have been introduced for improving performance of some operations that otherwise would be very expensive: sorting, faceting ang grouping. If these operations are frequent and important in your search engine use case, you should really take in consideration to set the involved fields as docvalues.

For enabling docvalues, just set the docValues property as following:

<fieldname="name" type="string" docValues="true"/>

 

Docvalues can be used in place of stored fields for retrieving the fields value at query time, but they are serialised in a column fashion. There are 3 important differences in behaviour to take in consideration:

  1. Text fields cannot be set as docvalues. If your field has text type and you want to return it, you are forced to set it as stored.
  2. Multivalued fields are returned using a different order: when you use stored fields, they are returned in the order they have been indexed (the holding data structure acts as a List). Instead, multivalued fields with docValues enabled internally use a Set (SORTED_SET or SORTED_NUMERIC depending on their type), a data structure which pre-orders its member and removes duplicates.  
  3. If you want to return docvalues with fl=*, you need to specify this in the field definition: useDocValuesAsStored=”true” (note this is the default value starting from schema version 1.6) 
As anticipated, fields are serialised by column. So, take as example a generic field A. For each document in the collection, we store consecutively the values of the field A according with the Lucene document id.  In this way, we have run of data all belonging to the same type. Notice that in case of some types (e.g. string), each value has a different size and it is not predictable how much space it is required for each element.

Accessing the value of field A for a document X can be done as following:

  1. Find the address of document X in the column of field A.
  2. Read the value and return it.
Finding the right address is actually a tricky operation because of two reasons:
  • The field could be empty for some documents
  • We don’t want to have a pointer for each docvalue, otherwise too much space would be used only for pointers.
However, because we can read directly the value of interest, we can imagine that reading only one field value using docvalues is more performant than using stored fields while reading more fields requires to repeat the same procedure more times. Moreover, because the values are sparse in memory, reading more values will cause cache misses and this will degrade performance.

Docvalues space occupancy

It is worthy to spend some words about space occupancy for docvalues without digging into details. Things here are a bit more complicated than for stored fields. Depending of the column, data are stored and compressed in a different manner. So, a column with numeric data will be stored and compressed differently from a column of strings.
 
  • Data are stored in column manner so homogeneus data will be close to each other
  • Specialised compression is applied dependently on the data type.
Because of these assertions, we can conclude that storing docvalues will results in a less and more efficient space usage than using stored fields.

Benchmarks

After having discussed about stored fields and docvalues now we have some knowledge about how them. I created a Solr instance aiming to provide some numbers about the performance of stored fields and docvalues for fields returning.

This benchmark is not meant to be completed. It is possible that in other scenarios things could be very different. 

We’ve created an index in this way: we’ve indexed 1 million documents taken from wikipedia. For each document we’ve added:
  • 100 random stored string fields of 15 characters each
  • 100 random docvalues string fields of 15 characters each
The document and query collections have been taken from https://github.com/tantivy-search/search-benchmark-game. We used real collections for simulating a realistic scenario.
 
Execution details:
  • CPU: AMD RYZEN 3600
  • RAM: 32 GB
  • Index size: 9.07 GB
top 100
top 200
It is possible to notice that retrieving an high number of fields leads to a sensible worsening of performance if docvalues are used. Instead,  the (almost) surprising thing is that, by returning less than 20 fields, docvalues performs better than stored fields and the difference gets little as the number of fields returned increases. This is due to a better management of docvalues in main memory.
Asking for 9 docvalues fields and 1 stored field takes an average query time is 6.86 (more than returning 10 stored fields).
Furthermore, the charts above show that the query times increase linearly with the number of fields returned. This makes predictable the speedup of using the stored fields against docvalues
 

Concluding, the use of docvalues leads to several benefits for the performance point of view (faceting, sorting and grouping) and they can even speed up fields retrieval if only few docvalues fields and no store fields are used. Moreover, docvalues are likely to use less space than stored fields. If the use-case requires that lot of fields need to be returned, using stored fields is the way to go. 

The experiments can be replicated by using the solr configuration and the python scripts in https://github.com/SeaseLtd/solr-field-retrieval-benchmark. The git repository includes only a small sample of data used for the benchmarks.

// STAY ALWAYS UP TO DATE

Subscribe to our newsletter

Did you like this post about DocValues VS Stored Fields? Don’t forget to subscribe to our Newsletter to stay always updated from the Information Retrieval world!

Author

Elia Porciani

Elia is a Software Engineer passionate about algorithms and data structures concerning search engines and efficiency. He is active part of the information retrieval research community, attending international conferences such as SIGIR and ECIR.

Comment (1)

  1. christophe CERQUEIRA
    June 25, 2020

    thx for you interesting article , i have a question ; can you explain me the option DocValuesFormat and how to Defines a custom DocValuesFormat = “Disk” , because my facetting requests have an out of memory. I don’t know that is the solution to force solr user disc when facetting request are excuted

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: