Synonyms + Stopwords?? OMG!

The Context

The scenario description is quite simple: we want to use synonyms and stopwords.

Following the path of our previous article, we will introduce an additional component in the analysis chain: a StopFilter, which, as the name suggests, removes a set of words from an incoming token stream.

We will use the following data through the examples:

  • synonyms = [“out of warranty”,”oow”]
  • stopwords = [“of”]

Token filters can be configured at index and/or query time. In this context we are focused on the query side: both synonyms and stopwords will be configured only in the query analyzer.

Working exclusively at query time has a great benefit: we can change things at runtime without any reindex need. At the same time, no stopwords filtering will be executed at index time so those terms will be uselessly part of the dictionary.

The Problem: synonyms followed by stopwords

We have the following analyzers:

  • index analyzer
    • standard-tokenizer
    • lowercase
  • query analyzer
    • standard-tokenizer
    • lowercase + synonyms + stopwords

Theoretically, in the query analyzer we would have two options: the stopwords filter could be defined before or after the synonym filter. However, the first way (before) doesn’t make so much sense, because terms that are stopwords and that are, at the same time, part of a synonym will be removed before the synonym detection. As consequence of that those synonym won’t be detected: in the example data, issuing a query like

q=out of warranty

the “of” term will be removed by the StopFilter, the subsequent filter would receive [“out”, “warranty”], which doesn’t match the configured synonym (“out of warranty”).

Elasticsearch users: Elasticsearch doesn’t allow this scenario at all; if you try to use the PUT Settings API with a chain defined as above (first stopwords then synonyms with some term intersection), it will throw an illegal argument exception saying “term: out of warranty analyzed to a token (warranty) with position increment != 1 (got: 2)” .

Apache Solr instead uses a lenient approach: no errors at index creation, but the problem remains (personally I prefer the Elasticsearch approach)

So the obvious choice is to postpone the stopwords management after the synonym filter. Unfortunately, here there’s an issue: the stopword(s) removal has some unwanted side-effect in the generated token graph and the query parser generates a wrong query because it consumes the token stream at the end of the chain.

Let’s imagine we have the following query:

q=tv went out of warranty something of

it will generate the following:

title:tv title:went (title:oow PhraseQuery(title:"out ? warranty something"))

As you can see, the synonym (out of warranty -> oow) is correctly detected but the stopwords filter removes all the “of” tokens, even if the first occurrence is part of a synonym. In the generated query you can see the sneaky effect: the “hole” created by the first “of” occurrence removal, produces the inclusion, in the phrase query, of the next available token in the stream (“something”, in the example).

In other words, the oow token synonym is marked with a positionLength = 3, which correctly means it spans three tokens (1=out, 2=of, 3=warranty); later, the query parser will include the next three available terms for generating a synonym phrase queries but since we no longer have the 2nd token (of), such count includes also “something”, which is the 3rd available token in the stream.

Before proceeding: this is a known problem, a long-standing issue [1] in Lucene which has a broader domain because it is related with the FilteringTokenFilter, the superclass of StopFilter.

The problem we will try to solve is: how can we manage synonyms and stopwords at query time without generating the conflict above?

A Solution

A note first: the token filter we are going to create is something that deals only with Lucene classes. However, when things need to be plugged in a runtime container (e.g. Apache Solr or Elasticsearch) the deployment procedure depends on the target platform: we won’t cover this part here.

The proposed solution is to create a StopFilter subclass which will be “synonym-aware”; it will check the tokenType and positionLength attributes before deciding if a token needs to be removed from the stream. The goal is to avoid removing those terms which have been defined in the stopwords list but are part of a synonym definition.

The class that we are going to extends is org.apache.lucene.analysis.core.StopFlter. This is an empty class, because all the filtering logic is in the superclasses (org.apache.lucene.analysis.StopFilter and the more generic org.apache.lucene.analysis.FilteringTokenFilter). The stopwords logic resides in the accept() method, which as you can see is very simple:

protected boolean accept() {
  return !stopWords.contains(termAtt.buffer(), 0, termAtt.length());

If the stopwords list contains the current term, it will be removed. So far, so good. We need to extend (actually we could also decorate) the StopFilter class for doing something else before calling the logic above.

First we need to check the token type: if a token has been marked as a SYNONYM then our filter doesn’t have to remove it. Then we need to check the positionLength attribute, because, within a synonym detection context, a position length greater than 1 means we have traversing a multi-term synonym:

public class SynonymAwareStopFilter extends StopFilter {

  private TypeAttribute tAtt = 
  private PositionLengthAttribute plAtt = 

  private int synonymSpans;

  protected SynonymAwareStopFilter(
                         TokenStream in, CharArraySet stopwords) {
    super(in, stopwords);

  protected boolean accept() {
    if (isSynonymToken()) {
      synonymSpans = plAtt.getPositionLength() > 1 
                             ? plAtt.getPositionLength() 
                             : 0;
      return true;

    return (--synonymSpans > 0) || super.accept();

  private boolean isSynonymToken() {
    return "SYNONYM".equals(tAtt.type());

Let’s do some test. We will use Apache Solr 7.4.0 for checking the results. Here is the field type definition, where you can see our SynonymAwareStopFilter:

<fieldtype name="text" class="solr.TextField" autoGeneratePhraseQueries="true">
       <analyzer type="index">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
       <analyzer type="query">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
           <filter class="solr.SynonymGraphFilterFactory" 
           <filter class="sc.SynonymAwareStopFilterFactory" 

and this is a minimal request handler:

<requestHandler name="/def" class="solr.SearchHandler" default="true">
       <lst name="defaults">
           <bool name="sow">false</bool>
           <str name="df">title</str>
           <str name="defType">lucene</str>
           <bool name="debug">true</bool>

Running the previous query:

q=tv went out of warranty something of

we have the following:

title:tv title:went (title:oow PhraseQuery(title:"out of warranty")) title:something

if we use instead the other synonym variant:

q=tv went oow something of

we have the following:

title:tv title:went (PhraseQuery(title:"out of warranty") title:oow) title:something

Everything seems working as expected! This is probably just one specific scenario among those addressed by LUCENE-4065; however, it helped me a lot because this is (at least in my experience) a frequent use case.

As usual, any feedback is warmly welcome. See you next time!



Apache Solr/Elasticsearch: How to Manage Multi-term Concepts out of the Box?

This flash blog post will address a very specific and common problem : how to manage entities/concepts composed by multiple terms in a vanilla Apache Solr/Elasticsearch instance ( no plugins or extensions to install).

The (deployment) context

An Elasticsearch or Apache Solr infrastructure where you cannot install third-party components (e.g. plugins, filters, query parsers). This can happen for several reasons:

  • endogenous factors: lack of required expertise/skills for implementing/installing things in your infrastructure.
  • exogenous factors: search capabilities live in an external and managed context which doesn’t allow custom components. This happens for example with services like the Amazon Elasticsearch Service [1].

The Problem

How can I model multi-terms concepts (i.e. concepts composed by multiple terms)?

Concepts are a fundamental piece of a domain specific vocabulary: “United States of America”, “Phone Number”, “Out of Warranty” are just examples of entities you probably want to manage as a whole; if a user searches something like “How can I transfer a Phone Number?” you probably don’t want to return things about numbers or phones, which have a broader scope (i.e. sacrificing precision in favour of recall) .

Note, “probably” is in bold because there’s not an absolute truth here: everything depends on the functional context where the application is running. Here we assume this requirement, but things could be different in another context.

Index/Query Time Solutions

The problem can be solved using two different approaches that involves Indexing time and Query time configurations :

  • Simple Contraction: using the SynonymFilter with Simple Contraction [2] [3] in order to inject a single term that represents the concepts (with optional synonyms)
Multimedia Messaging Service,Multimedia Text Message => mms
  • Shingles: combining shingles and a keep word filter for generating bigrams and trigrams from a given text (assuming we are limiting our interest only to concepts composed by a maximum of 3 terms)

…an Additional Constraint: Query-Time Only

One of the drawbacks of the first group (index + query time) is the cost of reindexing the whole dataset when a change occurs in the synonyms or in the keywords list. So the additional constraint we will introduce is: we want to be able to change the concepts list at runtime without any reindexing.

Again, this is not an absolute constraint: there are a lot of scenarios where reindex everything is completely ok. In my experience this has a direct correlation with the index size plus how the whole reindexing process takes.

So in other words: if the full reindexing process takes a reasonable amount of time and it doesn’t produce any service interruption, then you should consider removing the “query-time only” constraint.

A Solution

The synonym management has been enhanced in Elasticsearch and Apache Solr [4], with the introduction of “graph-aware” token filters, for enabling a full support of Multi-Word Synonyms. 

Prior to that, both index/query time and query time approaches suffered from some limitations when dealing with synonyms [5].

In any case, the current implementation allows us to correctly manage multi-word synonyms as a whole. So, coming back to our question, we can try to “shape” the synonym filter at our wills for managing compound concepts as well.

First, a compound concept can or cannot have synonyms.

Concept with Synonym(s)

If the concept has one or more synonyms, we are within a regular context of the synonym filter. Here’s a sample content of the synonyms.txt file:

Multimedia Messaging Service,Multimedia Text Message,MMS
USA,United States of America

Here’s the corresponding configuration:

<fieldtype name="txt" 
       <analyzer type="index">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
       <analyzer type="query">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
<field name="title" type="txt" indexed="true" stored="true"/>

"analysis": {
    "filter": {
      "english_synonyms": {
         "type": "synonym_graph",
         "synonyms_path": "synonyms.txt",
         "expand": true
    "analyzer": {
      "text_index_analyzer": {
        "tokenizer": "standard",
        "filter": [ "lowercase" ]
      "text_query_analyzer": {
        "tokenizer": "standard",
        "filter": [
"properties": {
   "title": {
      "type": "text",
      "analyzer": "text_index_analyzer",
      "search_analyzer": "text_query_analyzer"

Issuing the following queries:

q={!lucene}multimedia messaging service&sow=false&df=title

  "query": {
     "match": {
       "title": "Multimedia messaging service"

   "query": {
     "query_string": {
       "query": "Multimedia messaging service",
       "default_field": "title",

produces (use debug=true in Solr or _validate/query?explain=true in Elasticsearch) the following query:

(PhraseQuery(title:"multimedia text message") 
 PhraseQuery(title:"multimedia messaging service"))

which is exactly what we want: the concept expressed in the query has been detected and the output query is looking for its expanded (united states of america) or contracted (usa) form. So far, so good.

Concept without Synonyms

What about if a multi-term concept doesn’t have any variant/synonym but we still want to manage it as a whole? Can we use the synonym filter without synonyms? Let’s try to see what happens.

The first obvious idea is to declare something like this, in our synonyms.txt:

Multimedia Messaging Service  

That is, a line containing our concept without any synonym. Unfortunately this is not working, the engine detects there are no synonyms and the resulting query is something like this:

title:multimedia title:messaging title:service

The same happens if you put a dummy synonym which is removed later, in the indexing chain, by a stopword filter. Something like:

Multimedia Messaging Service, something_that_will_be_configured_as_stopword

So the last chance is to duplicate the entry: something like this:

Multimedia Messaging Service,Multimedia Messaging Service

Here things start to be interesting. Running the same queries above we got the following explain:

(PhraseQuery(title:"multimedia messaging service") 
 PhraseQuery(title:"multimedia messaging service"))

title:"multimedia messaging service"^2.0

The phrase query is doubled, which means the score assigned to matches will reflect this, as illustrated in the example explain below (see the 2.0 boost):

4.544185 = sum of:
  4.544185 = weight(title:"multimedia messaging service" in 1) [SchemaSimilarity], result of:
    4.544185 = score(doc=1,freq=1.0 = phraseFreq=1.0
), product of:
      2.0 = boost
      2.7725887 = idf(), sum of:
        0.6931472 = idf, computed as ... from:
          1.0 = docFreq
          2.0 = docCount
        0.6931472 = idf, computed as ... from:
          1.0 = docFreq
          2.0 = docCount
        0.6931472 = idf, computed as ... from:
          1.0 = docFreq

That is, the score of the match above would have been 2.2720926 (4.544185 / 2) but since we have that artificial boost it has been doubled; at the same time it’s important to underline our concept has been correctly managed as a whole, and the queries above won’t return items related with messages, multimedia or services, which are broader concepts.

Is that boost factor a problem? That actually depends on your application: you should

  • have a representative number of search cases
  • try a plain term-centric search
  • try the approach suggested above
  • use a search quality evaluation tool
  • compare and choose


  • You have an Elasticsearch or Apache Solr cluster
  • You cannot install custom plugins
  • You want to manage compound concepts
  • You don’t want to reindex your corpus when adding / removing / updating the concepts list
  • If a concept has one or more synonyms, this is quite straightforward: use the synonym (graph) filter at query time*
  • if a concept doesn’t have any synonym, you can still use the synonym (graph) filter: just double the concept definition, but keep in mind the double (2.0) boost applied to the corresponding phrase query*

* Apache Solr users: make sure the target field type has autoGeneratePhraseQueries set to true, and the sow parameter (defined in the RequestHandler settings or as a request parameter) set to true as well.






Apache Solr: orchestrating Known item and Full-text search


You’re working as a search engineer for XYZ Ltd, a company which sells electric components. XYZ provided you the application logs of the last six months, and some business requirements.

Two kinds of customers, two kinds of requirements, two kinds of search

The log analysis shows that XYZ has mainly two kinds of customers: a first group, the “expert” users (e.g. electricians, resellers, shops) whose members are querying the system by product identifiers, codes (e.g. SKU, model codes, thinks like Y-M8GB, 140-213/A and ABD9881); it’s clear, at least it seems so, they already know what they want and what they are looking for. However, you noticed a lot of such queries produce no results. After investigating, the problem seems to be that codes and identifiers are definitely hard to remember: queries use a lot of disparate forms for pointing to the same product. For example:

  • y-m8gb (lowercase)
  • YM8GB (no delimiters)
  • YM-8GB (delimiter in a wrong place)
  • Y/M8GB (wrong delimiter)
  • Y M8GB (whitespace instead of delimiter)
  • y M8/gb (a combination of cases above)

This kind of scenario, where there’s only one relevant document in the collection, is usually referred to as “Known Item Search”: our first requirement is to make sure this “product identifier intent” is satisfied.

The other group of customers are end-users, like me and you. Being not so familiar with product specs like codes or model codes, the behaviour here is different: they use a plain keyword search, trying to match products by entering terms which represents names, brands, manufacturer. An here it comes the second requirement which can be summarized as follows: people must be able to find products by entering plain free-text queries.

As you can imagine, in this case search requirements are different from the other scenario: the focus here is more “term-centric”, therefore involving different considerations about the text analysis we’d need to apply.

While the expert group query is supposed to point to one and only one product (we are in a black / white scenario: match or not), the needs on the other side require the system to provide a list of “relevant” documents, according to the terms entered.

An important thing / assumption before proceeding: for illustration purposes we will consider those two queries / user groups as disjoint: that is, a given user belongs only to one of the mentioned groups, not both. Better, a given user query will contain product identifiers or terms, not both. 

Schema & configuration notes

The expert group, and the “Known Item Search”

The “product identifier” intent, which is assumed to be implicit in the query behaviour of this group, can be captured, both at index and query time, by applying the following analyzer, which basically treats the incoming value as a whole, normalizes it to lower case, removes all delimiters and finally collapses everything in a single output token.

<fieldtype name="identifier" class="solr.TextField" omitNorms="true">
        <tokenizer class="solr.KeywordTokenizerFactory" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.WordDelimiterGraphFilterFactory"
                splitOnCaseChange="0" />
<field name="product_id" type="identifier" indexed="true" ... />

In the following table you can see the analyzer in action with some example:

As you can see, the analyzer doesn’t declare a type attribute because it is supposed to be applied both at index and query time. However, there’s a difference in the incoming value: at index time the analyzer is dealing with a field content (i.e. the value of a field of an incoming document), while at query time the value which flows through the pipeline is composed by one or more terms entered by the user (a query, briefly).

While at index time everything works as expected, at query time the analyzer above requires a feature that has been introduced in Solr 6.5: the “Split On Whitespace” flag [1]. When it is set to “false” (as we need here in this context), it causes the incoming query text to be kept as a single whole unit, when sent to the analyzer.

Prior to Solr 6.5 we didn’t have such control, and the analyzers were receiving a “pre-tokenized-by-whitespaces” tokens; in other words, the unit of work of the query-time analysis was the single term: the analyzer chain (including the tokenizer itself) was invoked for each term outputted by that pre-whitespace-tokenization. As consequence of that our analyzer, at query time, couldn’t work as expected: if we take the example #5 and #6 from the table above, you can see the user entered a whitespace. With the “Split on Whitespace” flag set to true (explicitly, or using a Solr < 6.5), the pre-tokenization described above produces two tokens:

  • #5 = {“Y”, ”M8GB”}
  • #6 = {“y”, “M8/gb”}

so our analyzer would receive 2 tokens (for each case) and there won’t be any match with the single term ym8gb stored in the index. So, prior to Solr 6.5 we had two ways for dealing with this requirement:

  • client side: wrapping the whole query with double quotes, escaping whitespaces with “\”, or replacing them with a delimiter like “-“. Easy, but it requires a control on the client code, and this is not always possible.
  • Solr side: applying to the incoming query the same transformations as above but this time at query parser level. Easy, if you know some Lucene / Solr internals. In addition it requires a context where you have permissions for installing custom plugins in Solr. A similar effect could be obtained also using an UpdateRequestProcessor which would create a new field with the same value of the original field but without any whitespace.

The end-users group, and the full-text search query

In this case we are within a “plain” full-text search context, where the analysis identified a couple of target fields: product names and brands.

Differently from the previous scenario, here we don’t have a unique and deterministic way to satisfy the search requirement. It depends on a lot of factors: the catalog, the terms distribution, the implementor experience, the customer expectations in terms of user search experience. All these things can lead to different answers. Just for example, here’s a possible option:

<fieldType name="brand" class="solr.TextField" omitNorms="true">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory" 

<fieldType name="name" class="solr.TextField">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.EnglishMinimalStemFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        <filter class="solr.LengthFilterFactory" min="2" max="50" />

The focus here is not on the schema design itself: the important thing to underline is that this requirement needs a completely different configuration from the “Known Item Search” previously described.

Specifically, let’s assume we ended up following a “term-centric” approach for satisfying the second requirement. The approach requires a different value for the “Split on Whitespace” parameter, which has to be set to true, in this case.

The “sow” parameter can be set at SearchHandler level, so it is applied at query time. It can be declared within the solrconfig.xml and, depending on the configuration, it can be overridden using a named (HTTP) query parameter.

A “split on whitespace” pre-tokenisation leads us on a scenario which is really different from the “Known Item Search”, where instead we “should” be in a field-centric search; “should” is double-quoted because if, from one side, we are actually using a field-centric search, on the other side we are on an edge case where we’re querying one single field with one single query term (the first analyzer in this post always outputs one term).

The implementation


Although one could think the first thing is about how to combine those two different query strategies, prior to that, the question we need to answer is where to implement the solution? Clearly, regardless the way we will decide to follow, we will have to implement a (search) workflow, which can be summarised in the following diagram:

Known Item Search in Apache Solr

On Solr side, each “search” task needs to be executed in a different SearchHandler, so returning to our question: where do we want to implement such workflow? We have three options: outside, between or inside Solr.

#1: Client-side implementation

The first option is to implement the flow depicted above in the client application. That assumes you have the required control and programming skills on that side. If this assumption is true, then it’s relatively easy to code the workflow: you can choose one of the client API binding available for your language and then implement the double + conditional search illustrated above.

  • Pros: easy to implement. It requires a minimal Solr (functional) knowledge.
  • Cons: the search workflow / logic is moved on the client side. Programming is required, so you must be in a context where this can be done and where the client application code is under your control.

#2: Man-in-the-middle

Moving things outside the client sphere, another popular option, which can be still seen as a client-side alternative (from the Solr perspective), is a proxy / adapter / facade. Whatever is the name you want to give to this stuff, this is a new module which sits between the client application and Solr; it would intercept all requests and it would implement the custom logic by orchestrating the search endpoints exposed in Solr.

Being a new module, it has several advantages:

  • it can be coded using your preferred language
  • it is completely decoupled from the client application, and from Solr as well

but for the same reason, it has also some disadvantages:

  • it must be created: designed, implemented, tested, installed and maintained
  • it is a new piece in your system, which necessarily increases the overall complexity of the architecture
  • Solr exposes a lot of (index & search) services. With this option, all those services should be proxied, therefore resulting in a lot of unnecessary delegations (i.e. delegate services that don’t add any value to the execution chain).

#3: In Solr

The last option moves the workflow implementation (and the search logic) in the place where, in my opinion, it should be: in Solr.

Note that this option is usually not only a “philosophical” choice: if you are a search engineer, most probably you will be hired for designing, implementing and tuning the “search-side of the cake”. That means it’s perfectly possible that, for a lot of reasons, you must think to the client application as an external (sub)system, where you don’t have any kind of control.

The main drawback of this approach is that, as you can imagine, it requires programming skills plus a knowledge about the Solr internals.

In Solr, a search request is consumed by a SearchHandler, a component which is in charge of executing the logic associated with a given search endpoint. In our example, we would have the following search handlers matching the two requirements:

<!-- Known Item search -->
<requestHandler name="/known_item_search" class="solr.SearchHandler">
   <lst name="invariants">
        <str name="defType">lucene</str>
        <bool name="sow">false</bool> <!-- No whitespace split -->
        <str name="df">product_id</str>

<!-- Full-text search -->
<requestHandler name="/full-text-search" class="solr.SearchHandler">
    <lst name="invariants">
         <bool name="sow">true</bool> <!--Whitespace split -->
         <str name="defType">edismax</str>
         <str name="df">product_name</str>
         <str name="qf">

On top of that, we would need a third component, which would be in charge to orchestrate the two search handlers above. I’ll call this component a “Composite Request Handler”.

The composite handler would also provide the public search endpoint called by clients. Once a request is received, the composite request handler implements the search workflow: it invokes all the handlers that compose its chain, and it will stop when one the invocation target produces the expected result.

The composite handler configuration looks like this:

<requestHandler name="/search" class=".....">
    <str name="chain">/know_item_search,/full_text_search</str>

On the client side, that would require only one request because the entire workflow will be implemented in Solr, by means of the composite request handler. In other words, imagining a GUI with a search bar, the client application, when the search button is pressed, would have to retrieve the term(s) entered by the user and send just one request (to the composite handler endpoint), regardless the intent of the user (i.e. regardless the group the user belongs to).

The composite request handler introduced in this section has been already implemented, you can find it in our Github account, here[2].

Enjoy and, as usual, any feedback is warmly welcome!



Give the height the right weight: quantities detection in Apache Solr

Quantity detection? What is a quantity? And why do we need to detect it?

A quantity, as described by Martin Fowler in his “Analysis Patterns” [1] is defined as a pair which combines an amount and unit (such as 30 litres, 0.25 cl, or 140 cm). In search-based applications, there are many cases where you may want to classify your searchable dataset using dimensioned attributes, because such quantities have a special meaning within the business context you are working on. The first example that comes in my mind?

Apache Solr Quantity Detection Plugin

Beer is offered in several containers (e.g. cans, bottles); each of them is available in multiple sizes (e.g. 25 cl, 50 cl, 75 cl or 0.25 lt, 0.50 lt, 0.75 lt). A good catalog would capture these information in dedicated fields, like “container” (bottle, can) and “capacity” (25cl, 50cl, 75cl in the example above): in this way the search logic can properly make use of them. Faceting (and subsequent filtering) is a good example of what the user can do after a first search has been executed: he can filter and refine results, hopefully finding what he was looking for.

But if we start from the beginning of a user interaction, there’s no result at all: only the blank textfield where the user is going to type something. “Something” could be whatever, anything (in his mind) related with the product he wants to find: a brand, a container type, a model name, a quantity. In few words: anything which represents one or more relevant features of the product he’s looking for.

So one of the main challenge, when implementing a search logic, is to get the point about the meaning of the entered terms. This is in general a very hard topic, often involving complicated stuff (e.g. machine learning), but sometimes things move on an easier side, especially when concepts, we want to detect, follow a common and regular pattern: like a quantity.

The main idea behind the quantity detection plugin [2] we developed at Sease is the following: starting from the user entered query, first it detects the quantities (i.e. the amounts and the corresponding units); then, these information will be isolated from the main query and they will be used for boosting up all products relevant to those quantities. Relevancy here can be meant in different ways:

  • exact match: all bottles with a capacity of 25cl
  • range match: all bottles with a capacity between 50cl and 75cl.
  • equivalence exact match: all bottles with a capacity of 0.5 litre (1lt = 100cl)
  • equivalence range match: all bottles with a capacity between 0.5 and 1 litre (1lt = 100cl)

The following is a short list with a brief description of all supported features:

  • variants: a unit can have a preferred form and (optionally) several variants. This can include different forms of the same unit (e.g. mt, meter) or an equivalent unit in a different metric system (e.g. cl, once)
  • equivalences: it’s possible to define an equivalence table so units can be converted at runtime (“beer 0.25 lt” will have the same meaning of “beer 25cl”). An equivalence table maps a unit with a conversion factor.
  • boost: each unit can have a dedicated boost, especially useful for weighting multiple matching units.
  • ranges: each unit can have a configured gap, which triggers a range query where the detected amount can be in the middle (PIVOT), at the beginning (MIN) or at the end (MAX) of the generated range
  • multi-fields: in case we have more than one attribute using the same unit (e.g. height, width, depth)
  • assumptions: in case an “orphan” amount (i.e an amount without a unit) is detected, it’s possible to define an assumption table and let Solr guess the unit.

Feel free to have a try, and if you think it could be useful, please share with us your idea and / or your feedback.



Exploring Solr Internals : The Lucene Inverted Index


This blog post is about the Lucene Inverted Index and how Apache Solr internally works.

When playing with Solr systems, understanding and properly configuring the underline Lucene Index is fundamental to deeply control your search.
With a better knowledge of how the index looks like and how each component is used, you can build a more performant, lightweight and efficient solution.
Scope of this blog post is to explore the different components in the Inverted Index.
This will be more about data structures and how they contribute to provide Search related functionalities.
For low level approaches to store the data structures please refer to the official Lucene documentation [1]

Lucene Inverted Index

The Inverted Index is the basic data structure used by Lucene to provide Search in a corpus of documents.
It’s pretty much quite similar to the index in the end of a book.
From wikipedia :

“In computer science, an inverted index (also referred to as postings file or inverted file) is an index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents.”

In Memory / On Disk

The Inverted index is the core data structure that is used to provide Search.
We are going to see in details all the components involved.
It’s important to know where the Inverted index will be stored.
Assuming we are using a FileSystem Lucene Directory, the index will be stored on the disk for durability ( we will not cover here the Commit concept and policies, so if curious [2]) .
Modern implementation of the FileSystem Directory will leverage the OS Memory Mapping feature to actually load into the memory ( RAM ) chunk of the index ( or possibly all the index) when necessary.
The index in the file system will look like a collection of immutable segments.
Each segment is a fully working Inverted Index, built from a set of documents.
The segment is a partition of the full index, it represents a part of it and it is fully searchable.
Each segment is composed by a number of binary files, each of them storing a particular data structure relevant to the index, compressed [1] .
To simplify, in the life of our index, while we are indexing data, we build segments, which are merged from time to time ( depending of the configured Merge Policy).
But the scope of this post is not the Indexing process but the structure of the Index produced.


Hands on !

Let’s assume in input 3 documents, each of them with 2 simple fields, and see how a full inverted Index will look like :

    {  “id”:”c”,
        “title”:”video game history”

    {  “id”:”a”,
        “title”:”game video review game”

    {  “id”:”b”,
        “title”:”game store”

Depending of the configuration in the schema.xml, at indexing time we generate the related data structure.

Let’s see the inverted index in the complete form, then let’s explain how each component can be used , and when to omit part of it.

.tg {border-collapse:collapse;border-spacing:0;} .tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;} .tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;} .tg .tg-vyw9{font-family:”Courier New”, Courier, monospace !important;} .tg .tg-s8ba{font-weight:bold;font-family:”Courier New”, Courier, monospace !important;;background-color:#efefef} .tg .tg-vc88{font-weight:bold;font-family:”Courier New”, Courier, monospace !important;}

Field id
Ordinal Term Document Frequency Posting List
0 a 1 : 1 : [1] : [0-1]
1 b 1 : 1 : [1] : [0-1]
2 c 1 : 1 : [1] : [0-1]
Field title
Ordinal Term Document Frequency Posting List
0 game 3 0 : 1 : [2] : [6-10],
1 : 2 : [1, 4] : [0-4, 18-22],
: 1 : [1] : [0-4]
1 history 1 : 1 : [3] : [11-18]
2 review 1 : 1 : [3] : [11-17]
3 store 1 2 : 1 : [2] : [5-10]
4 video 2 0 : 1 : [1] : [0-5],
1 : 1 : [2] : [5-10],

This sounds scary at the beginning let’s analyse the different component of the data structure.

Term Dictionary

The term dictionary is a sorted skip list containing all the unique terms for the specific field.
Two operations are permitted, starting from a pointer in the dictionary :
next() -> to iterate one by one on the terms
advance(ByteRef b) -> to jump to an entry >= than the input  ( this operation is O(n) = log n where n= number of unique terms).
An auxiliary Automaton is stored in memory, accepting a set of smart calculated prefixes of the terms in the dictionary.
It is a weighted Automaton, and a weight will be associated to each prefix ( i.e. the offset to look into the Term Dictionary) .
This automaton is used at query time to identify a starting point to look into the dictionary.
When we run a query ( a TermQuery for example) :
1) we give in input the query to the In Memory Automaton, an Offset is returned
2) we access the location associated to the Offset in the Term Dictionary
3) we advance to the ByteRef representation of the TermQuery
4) if the term is a match for the TermQuery we return the Posting List associated

Document Frequency

This is simply the number of Documents in the corpus containing the term  t in the field f .
For the term game we have 3 documents in our corpus that contain the term in the field title

Posting List

The posting list is the sorted skip list of DocIds that contains the related term.
It’s used to return the documents for the searched term.
Let’s have a deep look to a complete Posting List for the term game in the field title :
0 : 1 : [2] : [6-10],
1 : 2 : [1, 4] : [0-4, 18-22],
: 1 : [1] : [0-4]

Each element of this posting list is :
Document Ordinal : Term Frequency : [array of Term Positions] : [array of Term Offset] .

Document Ordinal -> The DocN Ordinal (Lucene ID) for the DocN document in the corpus containing the related term.
Never relies at application level on this ordinal, as it may change over time ( during segments merge for example).
According to the starting ordinals of each Posting list element :
Doc0 0,
Doc1 1,
Doc2 2
contain the term game in the field title .

Term Frequency -> The number of occurrences of the term in the Posting List element .
0 : 1 
Doc0 contains 1 occurrence of the term game in the field title.
1 : 2
Doc1 contains 2 occurrence of the term game in the field title.
: 1
Doc2 contains 1 occurrence of the term game in the field title.

Term Positions Array -> For each occurrence of the term in the Posting List element, it contains the related position in the field content .
0 : [2]
Doc0 1st occurrence of the term game in the field title occupies the 2nd position in the field content.
“title”:”video(1) game(2) history(3)” )
1 : [1, 4] 
Doc1 1st occurrence of the term game in the field title occupies the 1st position in the field content.
Doc1 2nd occurrence of the term game in the field title occupies the 4th position in the field content.
“title”:”game(1) video(2) review(3) game(4) )
: [1] 
Doc0 1st occurrence of the term game in the field title occupies the 1st position in the field content.
“title”:”game(1) store(2)” )

Term Offsets Array ->  For each occurrence of the term in the Posting List element, it contains the related character offset in the field content .
0 : [6-10]
Doc0 1st occurrence of the term game in the field title starts at 6th char in the field content till 10th char ( excluded ).
“title”:”video(1) game(2) …” )
“title”:”01234 5 6789 10  …” )
1 : [0-4, 18-22]
Doc1 1st occurrence of the term game in the field title starts at 0th char in the field content till 4th char ( excluded )..
Doc1 2nd occurrence of the term game in the field title starts at 18th char in the field content till 22nd char ( excluded ).
“title”:”game(1) video(2) review(3) game(4) )
“title”:”0123 4   video(2) review(3) 18192021 22″ )
Doc0 1st occurrence of the term game in the field title occupies the 1st position in the field content.
“title”:”game(1) store(2)” )
“title”:”0123 4 store(2)” )

Live Documents

Live Documents is a lightweight data structure that keep the alive documents at the current status of the index.
It simply associates a bit  1 ( alive) , 0 ( deleted) to a document.
This is used to keep the queries aligned to the status of the Index, avoiding to return deleted documents.
Note : Deleted documents in index data structures are removed ( and disk space released) only when a segment merge happens.
This means that you can delete a set of documents, and the space in the index will be claimed back only after the first merge involving the related segments will happen.
Assuming we delete the Doc2, this is how the Live Documents data structure looks like :
Ordinal Alive
0 1
1 1
2 0
Ordinal -> The Internal Lucene document ID
Alive -> A bit that specifies if the document is alive or deleted.


Norms is a data structure that provides length normalisation and boost factor per Field per Document.
The numeric values are associated per FIeld and per Document.
This value is associated to the length of the content value and possibly an Indexing time boost factor as well.
Norms are used when scoring Documents retrieved for a query.
In details it’s a way to improve the relevancy of Documents containing the term in short fields .
A boost factor can be associated at indexing time as well.
Short field contents will win over long field field contents when matching a query with Norms enabled.

Field title
Doc Ordinal Norm
0 0.78
1 0.56
2 0.98


Schema Configuration

When configuring a field in Solr ( or directly in Lucene) it is possible to specify a set of field attributes to control which data structures are going to be produced .
Let’s take a look to the different Lucene Index Options
Lucene Index Option Solr schema Description To Use When …
NONE indexed=”false” The inverted index will not be built. You don’t need to search in your corpus of documents.
DOCS omitTermFreqAnd
The posting list for each term will simply contain the document Ids ( ordinal) and nothing else.

game -> 0,1,2 
– You don’t need to search in your corpus with phrase or positional queries. 
-You don’t need score to be affected by the number of occurrences of a term in a document field.
DOCS_AND_FREQS omitPositions=”true” The posting list for each term will simply contain the document Ids ( ordinal) and term frequency in the document.

game -> 0 : 11 : 2 : 1 
– You don’t need to search in your corpus with phrase or positional queries.
– You do need scoring to take Term Frequencies in consideration
Default when indexed=”true” The posting list for each term will contain the term positions in addition.

0 : 1 : [2], 1 : 2 : [1, 4], : 1 : [1]
– You do need to search in your corpus with phrase or positional queries.
– You do need scoring to take Term Frequencies in consideration
storeOffsetsWithPositions =”true” The posting list for each term will contain the term offsets in addition. ->0 : 1 : [2] : [6-10],
1 : 2 : [1, 4] : [0-4, 18-22],
: 1 : [1] : [0-4]

– You want to use the Posting Highlighter.
A fast version of highlighting that uses the posting list instead of the term vector.
omitNorms omitNorms=”true” The norms data structure will not be built – You don’t need to boost short field contents
– You don’t need Indexing time boosting per field

[1] Lucene File Formats
[2] Understanding commits and Tlog in Solr