This blog post aims to give a better understanding of Docvalues and store fields in Apache Solr for the operations in which they can be used interchangebly.
Although stored fields and docvalues have been created for different purposes, both can be used for effectively storing values of the document fields, and then, retrieve them when needed.
When you are defining your solr schema you might ask yourself if your fields should be defined as docvalues, stored or both. In order to decide this, you need to understand what is the usage of your fields in your application from a functional point of view. If it happens that you can use both of these methods for your goals, you should understand how choosing one in place of the other will affects the performance and the space usage of your search engine.
Following, I’m giving some tools for analyzing which method is the best choice and in which scenario.
The main usage of stored fields is to return field values at query time. You can specify a field to be stored as following:
<fieldname="name" type="string" stored="true"/>
Stored fields are organised in a row manner. This means that, given a set of fields, for each document, the values of these fields are concatenated in a row. The rows are then stored sequentially on disk according with their lucene doc id. Each row may have a different size dependently on the number of fields defined for that document and data types (e.g. string or text fields have variable size). The pointers to each row are stored for allowing fast access to them.
Let’s say you want to retrieve the field A from the document with lucene id lid:
- Find the address to the row of lid.
- Read all the values in the rows until you reach the value of the field A.
- return the field A value.
Stored fields space occupancy
Understanding how much space stored fields will take is a difficult task. Infact, when data get compressed, some factors such as data redundancy and distribution play a fundamental role.
The compression happens by applying the LZ4 algorithm to the rows formed as described above. LZ4 is a very fast compression algorithm with a good but not optimal compression ratio. It is possible to specify two modes: BEST_COMPRESSION and BEST_SPEED. While BEST_SPEED is the default option, in order to set BEST_COMPRESSION you just need to enable it in solrconfig.xml as following:
Docvalues have been introduced for improving performance of some operations that otherwise would be very expensive: sorting, faceting ang grouping. If these operations are frequent and important in your search engine use case, you should really take in consideration to set the involved fields as docvalues.
For enabling docvalues, just set the docValues property as following:
<fieldname="name" type="string" docValues="true"/>
Docvalues can be used in place of stored fields for retrieving the fields value at query time, but they are serialised in a column fashion. There are 3 important differences in behaviour to take in consideration:
- Text fields cannot be set as docvalues. If your field has text type and you want to return it, you are forced to set it as stored.
- Multivalued fields are returned using a different order: when you use stored fields, they are returned in the order they have been indexed (the holding data structure acts as a List). Instead, multivalued fields with docValues enabled internally use a Set (SORTED_SET or SORTED_NUMERIC depending on their type), a data structure which pre-orders its member and removes duplicates.
- If you want to return docvalues with fl=*, you need to specify this in the field definition: useDocValuesAsStored=”true” (note this is the default value starting from schema version 1.6)
Accessing the value of field A for a document X can be done as following:
- Find the address of document X in the column of field A.
- Read the value and return it.
- The field could be empty for some documents
- We don’t want to have a pointer for each docvalue, otherwise too much space would be used only for pointers.
Docvalues space occupancy
- Data are stored in column manner so homogeneus data will be close to each other
- Specialised compression is applied dependently on the data type.
This benchmark is not meant to be completed. It is possible that in other scenarios things could be very different.
- 100 random stored string fields of 15 characters each
- 100 random docvalues string fields of 15 characters each
- CPU: AMD RYZEN 3600
- RAM: 32 GB
- Index size: 9.07 GB
Concluding, the use of docvalues leads to several benefits for the performance point of view (faceting, sorting and grouping) and they can even speed up fields retrieval if only few docvalues fields and no store fields are used. Moreover, docvalues are likely to use less space than stored fields. If the use-case requires that lot of fields need to be returned, using stored fields is the way to go.
The experiments can be replicated by using the solr configuration and the python scripts in https://github.com/SeaseLtd/solr-field-retrieval-benchmark. The git repository includes only a small sample of data used for the benchmarks.
Subscribe to our newsletter
Did you like this post about DocValues VS Stored Fields? Don’t forget to subscribe to our Newsletter to stay always updated from the Information Retrieval world!