Apache Lucene, Apache Solr, Main Blog

Exploring Solr Internals : The Lucene Inverted Index

This blog post is about the Lucene Inverted Index and how Apache Solr internally works.

When playing with Solr systems, understanding and properly configuring the underlying Lucene Index is fundamental to deeply controlling your search.
With a better knowledge of how the index looks like and how each component is used, you can build a more performant, lightweight and efficient solution.
The scope of this blog post is to explore the different components of the Inverted Index.
This will be more about data structures and how they contribute to providing search-related functionalities.
For low-level approaches to store the data structures please refer to the official Lucene documentation [1].

The Lucene Inverted Index

The Inverted Index is the basic data structure used by Lucene to provide Search in a corpus of documents.

It’s pretty much quite similar to the index at the end of a book.

From Wikipedia :

“In computer science, an inverted index (also referred to as postings file or inverted file) is an index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents.”

In Memory/On Disk

The Inverted index is the core data structure that is used to provide Search.

We are going to see in detail all the components involved.

It’s important to know where the Inverted index will be stored.

Assuming we are using a FileSystem Lucene Directory, the index will be stored on the disk for durability ( we will not cover here the Commit concept and policies).

Modern implementation of the FileSystem Directory will leverage the OS Memory Mapping feature to load into the memory (RAM) chunk of the index (or possibly all the index) when necessary.

The index in the file system will look like a collection of immutable segments.

Each segment is a fully working Inverted Index, built from a set of documents.

The segment is a partition of the full index, it represents a part of it and it is fully searchable.

Each segment is composed of a number of binary files, each of them storing a particular data structure relevant to the index, compressed [2].

To simplify, in the life of our index, while we are indexing data, we build segments, which are merged from time to time (depending on the configured Merge Policy).

But the scope of this post is not the Indexing process but the structure of the Index produced.

Hands on!

Let’s assume in input 3 documents, each of them with 2 simple fields, and see how a full inverted Index will look like:

Doc0
{ “id”:”c”,
“title”:”video game history”
},

Doc1
{ “id”:”a”,
“title”:”game video review game”
},

Doc2
{ “id”:”b”,
“title”:”game store”
},

Depending on the configuration in the schema.xml, at indexing time we generate the related data structure.

Let’s see the inverted index in the complete form, then let’s explain how each component can be used, and when to omit part of it.

Field	id
Ordinal	Term	Document Frequency	Posting List
0	a	1	1 : 1 : [1] : [0-1]
1	b	1	2 : 1 : [1] : [0-1]
2	c	1	0 : 1 : [1] : [0-1]

Field	title
Ordinal	Term	Document Frequency	Posting List
0	game	3	0 : 1 : [2] : [6-10], 1 : 2 : [1, 4] : [0-4, 18-22], 2 : 1 : [1] : [0-4]
1	history	1	0 : 1 : [3] : [11-18]
2	review	1	1 : 1 : [3] : [11-17]
3	store	1	2 : 1 : [2] : [5-10]
4	video	2	0 : 1 : [1] : [0-5], 1 : 1 : [2] : [5-10],

This sounds scary at the beginning, let’s analyse the different components of the data structure.

Term Dictionary

The term dictionary is a sorted skip list containing all the unique terms for the specific field.

Two operations are permitted, starting from a pointer in the dictionary :

next() -> to iterate one by one on the terms

advance(ByteRef b) -> to jump to an entry >= than the input (this operation is O(n) = log n where n= number of unique terms).

An auxiliary Automaton is stored in memory, accepting a set of smart calculated prefixes of the terms in the dictionary.

It is a weighted Automaton, and a weight will be associated with each prefix (i.e. the offset to look into the Term Dictionary).

This automaton is used at query time to identify a starting point to look into the dictionary.

When we run a query (a TermQuery for example) :

1) we give in input the query to the In Memory Automaton, and an Offset is returned

2) we access the location associated with the Offset in the Term Dictionary

3) we advance to the ByteRef representation of the TermQuery

4) if the term is a match for the TermQuery we return the Posting List associated

Document Frequency

This is simply the number of Documents in the corpus containing the term t in the field f .

For the term game we have 3 documents in our corpus that contain the term in the field title

Posting List

The posting list is the sorted skip list of DocIds that contains the related term.

It’s used to return the documents for the searched term.

Let’s have a deep look at a complete Posting List for the term game in the field title :

0 : 1 : [2] : [6-10],
1 : 2 : [1, 4] : [0-4, 18-22],
2 : 1 : [1] : [0-4]

Each element of this posting list is :
Document Ordinal: Term Frequency: [array of Term Positions]: [array of Term Offset].

Document Ordinal -> The DocN Ordinal (Lucene ID) for the DocN document in the corpus containing the related term.
Never rely on the application level on this ordinal, as it may change over time (during segments merge for example).

e.g.
According to the starting ordinals of each Posting list element :
Doc0 0,
Doc1 1,
Doc2 2
contain the term game in the field title.

Term Frequency -> The number of occurrences of the term in the Posting List element.
e.g.
0 : 1
Doc0 contains 1 occurrence of the term game in the field title.
1 : 2
Doc1 contains 2 occurrences of the term game in the field title.
2 : 1
Doc2 contains 1 occurrence of the term game in the field title.

Term Positions Array -> For each occurrence of the term in the Posting List element, it contains the related position in the field content.
e.g.
0 : [2]
Doc0 1st occurrence of the term game in the field title occupies the 2nd position in the field content.
( “title”:”video(1) game(2) history(3)” )
1 : [1, 4]
Doc1 1st occurrence of the term game in the field title occupies the 1st position in the field content.
Doc1 2nd occurrence of the term game in the field title occupies the 4th position in the field content.
( “title”:”game(1) video(2) review(3) game(4)“ )
2 : [1]
Doc0 1st occurrence of the term game in the field title occupies the 1st position in the field content.
( “title”:”game(1) store(2)” )

Term Offsets Array -> For each occurrence of the term in the Posting List element, it contains the related character offset in the field content.
e.g.
0 : [6-10]
Doc0 1st occurrence of the term game in the field title starts at the 6th char in the field content till the 10th char ( excluded ).
( “title”:”video(1) game(2) …” )
( “title”:”01234 5 6789 10 …” )
1 : [0-4, 18-22]
Doc1 1st occurrence of the term game in the field title starts at 0th char in the field content till 4th char ( excluded ).
Doc1 2nd occurrence of the term game in the field title starts at the 18th char in the field content till the 22nd char ( excluded ).
( “title”:”game(1) video(2) review(3) game(4)“ )
( “title”:”0123 4 video(2) review(3) 18192021 22″ )
2 : [0-4]
Doc0 1st occurrence of the term game in the field title occupies the 1st position in the field content.
( “title”:”game(1) store(2)” )
( “title”:”0123 4 store(2)” )

Live Documents

Live Documents is a lightweight data structure that keeps the alive documents at the current status of the index.

It simply associates a bit 1 (alive), 0 (deleted) to a document.

This is used to keep the queries aligned with the status of the Index, avoiding returning deleted documents.

Note: Deleted documents in index data structures are removed ( and disk space released) only when a segment merge happens.

This means that you can delete a set of documents and the space in the index will be claimed back only after the first merge involving the related segments will happen.

Assuming we delete the Doc2, this is how the Live Documents data structure looks like :

Ordinal	Alive
0	1
1	1
2	0

Ordinal -> The Internal Lucene document ID
Alive -> A bit that specifies if the document is alive or deleted.

Norms

Norms is a data structure that provides length normalisation and boost factor per Field per Document.
The numeric values are associated per Field and Document.
This value is associated with the length of the content value and possibly an Indexing time boost factor as well.
Norms are used when scoring Documents retrieved for a query.
In detail, it’s a way to improve the relevancy of Documents containing the term in short fields.
A boost factor can be associated with indexing time as well.
Short field contents will win over long field field contents when matching a query with Norms enabled.

Field	title
Doc Ordinal	Norm
0	0.78
1	0.56
2	0.98

Schema Configuration

When configuring a field in Solr (or directly in Lucene) it is possible to specify a set of field attributes to control which data structures are going to be produced.

Let’s take a look at the different Lucene Index Options

Lucene Index Option	Solr schema	Description	To Use When …
NONE	indexed=”false”	The inverted index will not be built.	You don’t need to search in your corpus of documents.
DOCS	omitTermFreqAnd Positions=”true”	The posting list for each term will simply contain the document Ids ( ordinal) and nothing else. e.g. game -> 0,1,2	– You don’t need to search in your corpus with phrase or positional queries. -You don’t need score to be affected by the number of occurrences of a term in a document field.
DOCS_AND_FREQS	omitPositions=”true”	The posting list for each term will simply contain the document Ids ( ordinal) and term frequency in the document. e.g. game -> 0 : 1, 1 : 2 , 2 : 1	– You don’t need to search in your corpus with phrase or positional queries. – You do need scoring to take Term Frequencies in consideration
DOCS_AND_FREQS_AND_ POSITIONS	Default when indexed=”true”	The posting list for each term will contain the term positions in addition. e.g. 0 : 1 : [2], 1 : 2 : [1, 4], 2 : 1 : [1]	– You do need to search in your corpus with phrase or positional queries. – You do need scoring to take Term Frequencies in consideration
DOCS_AND_FREQS_AND_ POSITIONS_AND_OFFSETS	storeOffsetsWithPositions =”true”	The posting list for each term will contain the term offsets in addition. e.g.game ->0 : 1 : [2] : [6-10], 1 : 2 : [1, 4] : [0-4, 18-22], 2 : 1 : [1] : [0-4]	– You want to use the Posting Highlighter. A fast version of highlighting that uses the posting list instead of the term vector.
omitNorms	omitNorms=”true”	The norms data structure will not be built	– You don’t need to boost short field contents – You don’t need Indexing time boosting per field

Need Help With This Topic?

If you’re struggling with understanding the Lucene Inverted Index, don’t worry – we’re here to help! Our team offers expert services and training to help you optimize your search engine and get the most out of your system. Contact us today to learn more!

Need Help with this topic?

If you're struggling with understanding the Lucene Inverted Index, don't worry - we're here to help! Our team offers expert services and training to help you optimize your search engine and get the most out of your system. Contact us today to learn more!

Click Here

Document, frequency, indexing options, Lucene index, Norms, solr schema, Term Frequency, Term offsets, Term positions

Alessandro Benedetti

Alessandro Benedetti is the founder of Sease Ltd. Senior Search Software Engineer, his focus is on R&D in information retrieval, information extraction, natural language processing, and machine learning.

read other blog posts of this author

Alessandro Benedetti

read other blog posts of this author

We are Sease, an Information Retrieval Company based in London, focused on providing R&D project guidance and implementation, Search consulting services, Training, and Search solutions using open source software like Apache Lucene/Solr, Elasticsearch, OpenSearch and Vespa.

Sign up for our Newsletter

Did you like this post? Don’t forget to subscribe to our Newsletter to stay always updated in the Information Retrieval world!

8 Responses

Prasanna kumar says:

November 22, 2018 at 7:07 am

Thanks for such a detailed article Alessandro 🙂

Loading...

Reply
ashutosh says:

May 7, 2020 at 5:05 pm

thanks for such a great article. I have a query about Live documents. why Lucene retains documents deleted with alive status ‘0’ and actual deletion from memory happens only after a segment merge? what’s the advantage here?

Loading...

Reply
1. Alessandro Benedetti says:
  
  May 7, 2020 at 5:26 pm
  
  Hi, one of the reasons Lucene is pretty fast in indexing (and searching to a certain extent) is the fact index segments are immutable.
  Changing a bit in the live documents data structure make deletions visible really fast in search.
  So it is unbalanced in favour of visibility in contract to the disk space that gets released just on segment merge.
  
  Operating a full delete each time would imply a modification of the posting lists, containing that document.
  Defintiely not a trivial task.
  
  On segment merge, the posting lists are merged anyway, so at that point is much easier and quicker to claim back the space.
  There may be many more reasons, but I hope this is a sufficient explanation!
  
  Loading...
  
  Reply
  1. ashutosh says:
    
    May 8, 2020 at 10:04 am
    
    thanks for explanation.
    
    Loading...
Shyamala says:

June 4, 2020 at 5:36 am

Hi, thanks for the article. I have a question. Is there any REST api avalible to query the inverted index contents of a field?(Note: the field is of type ‘keyword’ and the doc_values are disabled for it) Or is there anyway i can get the list of contents of a field(type ‘keyword’) without enabling the doc_values?

Loading...

Reply
1. Alessandro Benedetti says:
  
  June 4, 2020 at 5:20 pm
  
  If you intend to retrieve the terms for a field, there are various ways in both Solr and Elasticsearch (REST Search Servers on top of Apache Lucene).
  Lucene on its own is not a server and just a library.
  In Apache Solr you can do : https://lucene.apache.org/solr/guide/8_5/the-terms-component.html
  In Elasticsearch I would go with the term aggregation: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html
  Or Term Vector endpoint: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html
  
  Loading...
  
  Reply
  1. Shyamala says:
    
    June 5, 2020 at 1:42 pm
    
    Thanks for your quick reply Alessandro Benedetti.
    
    My business problem is “get all the distinct values/terms of a field (type: keyword)”.
    As you suggested, I can do with elasticsearch terms aggregation only when the field has doc_values enabled. Somehow, I got the doc_values enabled and I’m trying to do term aggregation to solve same business problem. I got the list of unique values for my field. But, now the issue is, the elasticsearch is going through all the documents to fetch the list of unique values present in that index (confirmed by evaluating the response below) . But i have only around 3000 unique values for that field
    
    “hits”: {
    “total”: 14041450,
    “max_score”: 0.0,
    “hits”: []
    }
    
    Is there a way that I get the list of unique values of a field with hits limited to the number of unique values? (Because there will be lot of documents being added up on daily basis as the terms of the field could grow rapidly, in that case i do not want elasticsearch to go through all the documents just to get the list of unique values of the field)
    
    Loading...
teddie says:

July 2, 2020 at 4:44 pm

Thank you, truly a great post!

Loading...

Reply

Exploring Solr Internals : The Lucene Inverted Index

The Lucene Inverted Index

In Memory/On Disk