The Inverted index is the core data structure that is used to provide Search.
We are going to see in details all the components involved.
It’s important to know where the Inverted index will be stored.
Assuming we are using a FileSystem Lucene Directory, the index will be stored on the disk for durability ( we will not cover here the Commit concept and policies).
Modern implementation of the FileSystem Directory will leverage the OS Memory Mapping feature to actually load into the memory ( RAM ) chunk of the index ( or possibly all the index) when necessary.
The index in the file system will look like a collection of immutable segments.
Each segment is a fully working Inverted Index, built from a set of documents.
The segment is a partition of the full index, it represents a part of it and it is fully searchable.
Each segment is composed by a number of binary files, each of them storing a particular data structure relevant to the index, compressed
[1].
To simplify, in the life of our index, while we are indexing data, we build segments, which are merged from time to time ( depending of the configured Merge Policy).
But the scope of this post is not the Indexing process but the structure of the Index produced.
Prasanna kumar
November 22, 2018Thanks for such a detailed article Alessandro 🙂
ashutosh
May 7, 2020thanks for such a great article. I have a query about Live documents. why Lucene retains documents deleted with alive status ‘0’ and actual deletion from memory happens only after a segment merge? what’s the advantage here?
Alessandro Benedetti
May 7, 2020Hi, one of the reasons Lucene is pretty fast in indexing (and searching to a certain extent) is the fact index segments are immutable.
Changing a bit in the live documents data structure make deletions visible really fast in search.
So it is unbalanced in favour of visibility in contract to the disk space that gets released just on segment merge.
Operating a full delete each time would imply a modification of the posting lists, containing that document.
Defintiely not a trivial task.
On segment merge, the posting lists are merged anyway, so at that point is much easier and quicker to claim back the space.
There may be many more reasons, but I hope this is a sufficient explanation!
ashutosh
May 8, 2020thanks for explanation.
Shyamala
June 4, 2020Hi, thanks for the article. I have a question. Is there any REST api avalible to query the inverted index contents of a field?(Note: the field is of type ‘keyword’ and the doc_values are disabled for it) Or is there anyway i can get the list of contents of a field(type ‘keyword’) without enabling the doc_values?
Alessandro Benedetti
June 4, 2020If you intend to retrieve the terms for a field, there are various ways in both Solr and Elasticsearch (REST Search Servers on top of Apache Lucene).
Lucene on its own is not a server and just a library.
In Apache Solr you can do : https://lucene.apache.org/solr/guide/8_5/the-terms-component.html
In Elasticsearch I would go with the term aggregation: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html
Or Term Vector endpoint: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html
Shyamala
June 5, 2020Thanks for your quick reply Alessandro Benedetti.
My business problem is “get all the distinct values/terms of a field (type: keyword)”.
As you suggested, I can do with elasticsearch terms aggregation only when the field has doc_values enabled. Somehow, I got the doc_values enabled and I’m trying to do term aggregation to solve same business problem. I got the list of unique values for my field. But, now the issue is, the elasticsearch is going through all the documents to fetch the list of unique values present in that index (confirmed by evaluating the response below) . But i have only around 3000 unique values for that field
“hits”: {
“total”: 14041450,
“max_score”: 0.0,
“hits”: []
}
Is there a way that I get the list of unique values of a field with hits limited to the number of unique values? (Because there will be lot of documents being added up on daily basis as the terms of the field could grow rapidly, in that case i do not want elasticsearch to go through all the documents just to get the list of unique values of the field)
teddie
July 2, 2020Thank you, truly a great post!