Give the height the right weight: quantities detection in Apache Solr

Quantity detection? What is a quantity? And why do we need to detect it?

A quantity, as described by Martin Fowler in his “Analysis Patterns” [1] is defined as a pair which combines an amount and unit (such as 30 litres, 0.25 cl, or 140 cm). In search-based applications, there are many cases where you may want to classify your searchable dataset using dimensioned attributes, because such quantities have a special meaning within the business context you are working on. The first example that comes in my mind?

Apache Solr Quantity Detection Plugin

Beer is offered in several containers (e.g. cans, bottles); each of them is available in multiple sizes (e.g. 25 cl, 50 cl, 75 cl or 0.25 lt, 0.50 lt, 0.75 lt). A good catalog would capture these information in dedicated fields, like “container” (bottle, can) and “capacity” (25cl, 50cl, 75cl in the example above): in this way the search logic can properly make use of them. Faceting (and subsequent filtering) is a good example of what the user can do after a first search has been executed: he can filter and refine results, hopefully finding what he was looking for.

But if we start from the beginning of a user interaction, there’s no result at all: only the blank textfield where the user is going to type something. “Something” could be whatever, anything (in his mind) related with the product he wants to find: a brand, a container type, a model name, a quantity. In few words: anything which represents one or more relevant features of the product he’s looking for.

So one of the main challenge, when implementing a search logic, is to get the point about the meaning of the entered terms. This is in general a very hard topic, often involving complicated stuff (e.g. machine learning), but sometimes things move on an easier side, especially when concepts, we want to detect, follow a common and regular pattern: like a quantity.

The main idea behind the quantity detection plugin [2] we developed at Sease is the following: starting from the user entered query, first it detects the quantities (i.e. the amounts and the corresponding units); then, these information will be isolated from the main query and they will be used for boosting up all products relevant to those quantities. Relevancy here can be meant in different ways:

  • exact match: all bottles with a capacity of 25cl
  • range match: all bottles with a capacity between 50cl and 75cl.
  • equivalence exact match: all bottles with a capacity of 0.5 litre (1lt = 100cl)
  • equivalence range match: all bottles with a capacity between 0.5 and 1 litre (1lt = 100cl)

The following is a short list with a brief description of all supported features:

  • variants: a unit can have a preferred form and (optionally) several variants. This can include different forms of the same unit (e.g. mt, meter) or an equivalent unit in a different metric system (e.g. cl, once)
  • equivalences: it’s possible to define an equivalence table so units can be converted at runtime (“beer 0.25 lt” will have the same meaning of “beer 25cl”). An equivalence table maps a unit with a conversion factor.
  • boost: each unit can have a dedicated boost, especially useful for weighting multiple matching units.
  • ranges: each unit can have a configured gap, which triggers a range query where the detected amount can be in the middle (PIVOT), at the beginning (MIN) or at the end (MAX) of the generated range
  • multi-fields: in case we have more than one attribute using the same unit (e.g. height, width, depth)
  • assumptions: in case an “orphan” amount (i.e an amount without a unit) is detected, it’s possible to define an assumption table and let Solr guess the unit.

Feel free to have a try, and if you think it could be useful, please share with us your idea and / or your feedback.

[1] https://martinfowler.com/books/ap.html

[2] https://github.com/SeaseLtd/solr-quantities-detection-qparsers

ECIR 2018 Experience

Logo.png

This blog is a quick summary of my (subjective) experience at ECIR 2018 : the 40th European Conference on Information Retrieval, hosted in Grenoble (France) from 26/03/2018 to 29/03/2018.

Deep Learning and Explicability

Eight long papers accepted were about Deep Learning.
The topics “Neural Network” and “Word Embedding” were the most occurring in the accepted full papers (and rejected) at the conference. It is clear that deep learning technologies are strongly advancing as a de facto standard in Artificial Intelligence and this can be noticed also in Information Retrieval where these technologies can be used to better model the user behaviour and extract topics and semantic from documents.
But if with deep learning and the advanced capabilities of complex models you gain performance, on the other hand you  lose the ability of explaining and debugging why such output corresponds to a given input.
A recurring topic in the Deep Learning track ( and the Industry Day actually) was to find a balance in the performance gain of new techniques and the control over them.
It is an interesting topic and I believe it is just the other face of the coin, it is good though that Academia is noting the importance of such aspect : in the industry a technology requires control and maintenance much more than in the academic environment and most of the times the “debuggability” can affect the decision between a technology and another.

From Academic Papers to Production

The 2018 European Conference on Information Retrieval ended the 29/03 with a brilliant Industry Day focused on the perilous path from Research to Production.
This is the topic that most permeated the conference across different keynotes, sessions and informal discussions.
In Information Retrieval it is still difficult to converge successful researches into successful live systems : most of the times it is not clear which party should be interested in this process.
Okapi BM25 was first published and implemented in the 1980s and 1990s; it became the default similarity in Apache Lucene 6.0 in 2016 .
Academia is focused on finding new interesting problems to solve with inventive techniques and Industry is focused on finding the quickest solution that works (usually).
This brings a gap :  new solutions are not easily reproducible from academic papers and they are far from being practically ready to use outside the experimental controlled environment.
Brilliant researchers crave new interesting problems they can reason on and solve: their focus is to build a rough implementation and validate it with few metrics ( ideally with an accepted related publication ).
After that, their job is done, problem solved, the challenge is not interesting anymore and they can pass to the next problem.
Researchers get bored easily but they risk to never see their researches fulfilled, applied and used in real life.
Often Academia creates its own problems to solve them and get a publication : Publish or Perish -> no publications bring less funds.
This may be or may be not a personal problem, depending on the individual.
Industry is usually seen by academics as a boring place where you just apply consolidated techniques to just get the best result with a minimum effort.
Sometimes industry is just where you end up when you want to make some money and actually see your effort to bring benefits to some population.
And this was(is) true most of the times but with the IT explosion we are living and the boom of competition the situation nowadays is open to change, where a stronger connection between Academia and Industry can ( and should !) happen and conferences such as ECIR is a perfect ground to settle the basis.
So, building on the introduction let’s see a quick summary of the keynotes, topics and sessions that impressed me most from the conference!

From Academic Papers to Production : A Learning To Rank Story[1]

Let’s start from Sease contribution to the conference : a story about the adoption of Learning To Rank in a real world e-commerce scenario.
The session took place at the Industry Day (29/03) and focused on the challenges and pitfalls of moving from the research papers to production.
The entire journey was possible through Open Source software : Apache Solr and RankLib as main actors.
Learning To Rank is becoming extremely popular in the industry and it is a very promising and useful technology to improve the relevancy function leveraging the user behaviour.
But from an open source perspective is still quite a young technology and effort is required to get it right.
The main message that I wanted to transmit is : don’t be scared to fail, if something doesn’t work immediately out of the box, it doesn’t mean it’s not a valid technology, No Pain No Gain, Learning To Rank open source implementations are valid but require tuning and care to bring them to production with success ( and the improvement these technologies can bring is extremely valuable).

The Harsh Reality of Production Information Access Systems[2]

Recipient of the “Karen Spärck Jones Award” Fernando Diaz in this talk focused on the problems derived from the adoption of Information Retrieval technologies in production environments where a deep understanding of individuals, groups and society is required.
Building on the technical aspect involved in applying research in real world systems, the focus switched to the ethical side : are current IR systems moving in the direction of providing the user with a fair and equally accessible information or the monetisation process behind is producing just addicted users who see (and buy) ad hoc tailored information ?

Statistical Stemmers : A Reproducibility Study[3]

This paper is a sort of symbol of the conference trend : reproducibility is as important as the innovation that a research brings.
Winner of the Best Paper Award, it may have caused some perplexity among the audience ( why to reward a paper that is not innovating but just reproducing past techniques ?), but the message that transmits is clear : a research needs to be easily reproducible and effort is required in that direction for an healthy Research & Development flow that doesn’t target just the publication but a real world application.

Entity-centric Topic Extraction and Exploration: A Network-based Approach[4]

Interesting talk, it explores topic modelling over time in a network based fashion :
Instead of modelling a topic as ranked list of terms it uses a network (weighted graph) representation.
This may be interesting for an advanced More Like This implementation, it’s definitely worth an investigation.

Information Scent, Searching and Stopping : Modelling SERP Level Stopping Behaviour[5]

This talk focused on the entire Search Result Page as a possible signal that affects user stopping behaviour i.e. when a search result page is returned, the overall page quality affects the user perception of relevancy and may drive an immediate query reformulation OR a good abandonment ( when the information need is satisfied).
This something that I personally experimented : sometimes from a quick look to the result page you may realise if the search engine understood ( or misunderstood) your information need.
Different factors are involved in what the author calls “Information Scent” but definitely the perceived relevance (modelled through different User Experience approaches) is definitely an interesting topic that sits along the real relevance.
Further studies in this area may affect the way Search Results pages are rendered, to maximise the fruition of information.

Employing Document Embeddings to Solve the “New Catalog” Problem in User Targeting, and provide Explanations to the Users[6]

The new catalog problem is a practical problem for modern recommender systems and platforms. There are a lot of use cases where you have collection of items that you would like to recommend and this ranges over Music Streaming Platforms ( Playlists, Albums, ect), Video Streaming Platform ( Tv Series genres, To View lists ect) and many other domains.
This paper explores both the algorithm behind such recommendations and the explanation need : explaining to the user why a catalog may be relevant for his/her taste is as important as providing a relevant catalog of items.

Anatomy of an Idea: Mixing Open Source, research and Business

This keynote summarises the cornerstone of Sease culture :
Open Source as a bridge between Academia and Industry.
If every research paper should be implemented in a production ready open source platform as part of the publication process, the community would get a direct and immense benefit out of it.
Iterative improvement would get a great traction and generally speaking the entire scientific community will get a strong boost with a better accessibility .
Implementing researches in state of the art production ready open source systems ( where possible) would cut the adoption time at industry level, triggering an healthy process of utilisation and bug fixing.

Industry Day[7]

The industry day was the coronation of the overall trend that was permeating the conference : there is a strong need to build a better connection between the Academic and Industrial world.
And the good audience reception shown ( the organisers had to move the track to the main venue) is a proof that there is an increasingly stronger need of seeing interesting researches applied ( with success or failure) to the real world with the related lessons learned.
Plenty of talks in this session were brilliant, my favourites :

  • Fabrizio Silvestri (Facebook)
    Query Embeddings: From Research to Production and Back!
  • Manos Tsagkias (904Labs)
    A.I. for Search: Lessons Learned
  • Marc Bron (Schibsted Media Group)
    Managment of Industry Research: Experiences of a Research Scientist

In conclusion, the conference was perfectly organised in an excellent venue, the balance in topics and talks was fairly good( both academic and industrial) and I really enjoyed my time in Grenoble, see you next year in Cologne !

[1] https://www.slideshare.net/AlessandroBenedetti/ecir-2018-alessandro
[2] https://www.ecir2018.org/programme/keynote-speakers/
[3] http://www.dei.unipd.it/~silvello/papers/ECIR2018_SA.pdf
[4] https://dbs.ifi.uni-heidelberg.de/files/Team/aspitz/publications/Spitz_Gertz_2018_Entity-centric_Topic_Extraction.pdf
[5] https://strathprints.strath.ac.uk/62856/
[6] https://link.springer.com/chapter/10.1007%2F978-3-319-76941-7_28
[7] https://www.ecir2018.org/industry-day/

Solr Is Learning To Rank Better – Part 4 – Solr Integration

Last Stage Of The Journey

This blog post is about the Apache Solr Learning To Rank ( LTR ) integration.

We modelled our dataset, we collected the data and refined it in Part 1 .
Trained the model in Part 2 .
Analysed and evaluate the model and training set in Part 3 .
We are ready to rock and deploy the model and feature definitions to Solr.
I will focus in this blog post on the Apache Solr Learning To Rank ( LTR ) integration from Bloomberg [1] .
The contribution is completed and available from Apache Solr 6.4.
This blog is heavily based on the Learning To Rank ( LTR ) Bloomberg contribution readme [2].

Apache Solr Learning To Rank ( LTR ) integration

The Apache Solr Learning To Rank ( LTR ) integration allows Solr to rerank the search results evaluating a provided Learning To Rank model.
Main responsabilties of the plugin are :

– storage of feature definitions
– storage of models
– feature extraction and caching
– search result rerank

Features Definition

As we learnt from the previous posts, the feature vector is the mathematical representation of each document/query pair and the model will score each search result according to that vector.
Of course we need to tell Solr how to generate the feature vector for each document in the search results.
Here comes the Feature Definition file.
A Json array describing all the relevant features necessary to score our documents through the machine learned LTR model.

e.g.

[{ "name": "isBook",
  "class": "org.apache.solr.ltr.feature.SolrFeature",
  "params":{ "fq": ["{!terms f=category}book"] }
},
{
  "name":  "documentRecency",
  "class": "org.apache.solr.ltr.feature.SolrFeature",
  "params": {
      "q": "{!func}recip( ms(NOW,publish_date), 3.16e-11, 1, 1)"
  }
},
{
  "name" : "userTextTitleMatch",
  "class" : "org.apache.solr.ltr.feature.SolrFeature",
  "params" : { "q" : "{!field f=title}${user_text}" }
},
{
  "name":"book_price",
  "class":"org.apache.solr.ltr.feature.FieldValueFeature",
  "params":{"field":"book_price"}
},
{
  "name":"originalScore",
  "class":"org.apache.solr.ltr.feature.OriginalScoreFeature",
  "params":{}
},
{
   "name" : "userFromMobile",
   "class" : "org.apache.solr.ltr.feature.ValueFeature",
   "params" : { "value" : "${userFromMobile:}", "required":true }
}]  
SolrFeature
– Query Dependent
– Query Independent
A Solr feature is defined by a Solr query following the Solr sintax.
The value of the Solr feature is calculated as the return value of the query run against the document we are scoring.
This feature can depend from query time parameters or can be query independent ( see examples)
e.g.
“params”:{“fq”: [“{!terms f=category}book”] }
– Query Independent
– Boolean feature
If the document match the term ‘book’ in the field ‘category’ the feature value will be 1.
It is query independent as no query param affects this calculation.
“params”:{“q”: “{!func}recip( ms(NOW,publish_date), 3.16e-11, 1, 1)”}
– Query Dependent
– Ordinal feature
The feature value will be calculated as the result of the function query, more recent the document, closer to 1 the value.
It is query dependent as ‘NOW’ affects the feature value.
“params”:{“q”: “{!field f=title}${user_text}” }
– Query Dependent
– Ordinal feature
The feature value will be calculated as the result of the query, more relevant the title content for the user query, higher the value.
It is query dependent as the ‘user_text’ query param affects the calculation.
FieldValueFeature
– Query Independent
A Fiel Value feature is defined by a Solr field.
The value of the feature is calculated as the content of the field for the document we are scoring.
The field must be STORED or DOC-VALUED . This feature is query independent ( see examples)
e.g.
“params”:{“field”:”book_price”}
– Query Independent
– Ordinal feature
The value of the feature will be the content of the ‘book_price’ field for a given document.
It is query independent as no query param affects this calculation.
ValueFeature
– Query Level
– Constant
A Value feature is defined by a constant or an external query parameter.
The value of the feature is calculated as the value passed in the solr request as an efi(External Feature Information) parameter or as a constant.
This feature depends only on the param configured( see examples)
e.g.
“params” : { “value” : “${user_from_mobile:}”, “required”:false }
– Query Level
– Boolean feature
The user will pass the ‘userFromMobile’ request param as an efi
The value of the feature will be the value of the parameter
The default value will be assigned if the parameter is missing in the request
If it is required an exception will be thrown if the parameter is missing in the request“params” : { “value” : “5“, “required”:false }
– Constant
– Ordinal feature
The feature value will be calculated as the constant value of ‘5’ .Except the constant, nothing affect the calculation.
OriginalScoreFeature
– Query Dependent
An Original Score feature is defined with no additional parameters.
The value of the feature is calculated as the original lucene score of the document given the input query.
This feature depends from query time parameters ( see examples)
e.g.
“params”:{}
— Query Dependent
— Ordinal feature
The feature value will be the original lucene score given the input query.
It is query dependent as the entire input query affect this calculation.

EFI ( External Feature Information )

As you noticed in the feature definition json, external request parameters can affect the feature extraction calculation.
When running a rerank query it is possible to pass additional request parameters that will be used at feature extraction time.
We see this in details in the related section.

e.g.
rq={!ltr reRankDocs=3 model=externalmodel efi.user_from_mobile=1}

 

Deploy Features definition

Good, we defined all the features we require for our model, we can now send them to Solr :

curl -XPUT 'http://localhost:8983/solr/collection1/schema/feature-store' --data-binary @/path/features.json -H 'Content-type:application/json'  
 

View Features Definition

To visualise the features just sent, we can access the feature store:

curl -XGET 'http://localhost:8983/solr/collection1/schema/feature-store'  
 

Models Definition

We extensively explored how to train models and how models look like in the format the Solr plugin is expecting.
For details I suggest you reading : Part 2
Let’s have a quick summary anyway  :

 

Linear Model (Ranking SVM, Pranking)

e.g.

 {
    "class":"org.apache.solr.ltr.model.LinearModel",
    "name":"myModelName",
    "features":[
        { "name": "userTextTitleMatch"},
        { "name": "originalScore"},
        { "name": "isBook"}
    ],
    "params":{
        "weights": {
            "userTextTitleMatch": 1.0,
            "originalScore": 0.5,
            "isBook": 0.1
        }
    }
} 


Multiple Additive Trees (LambdaMART, Gradient Boosted Regression Trees )

e.g.

{
    "class":"org.apache.solr.ltr.model.MultipleAdditiveTreesModel",
    "name":"lambdamartmodel",
    "features":[
        { "name": "userTextTitleMatch"},
        { "name": "originalScore"}
    ],
    "params":{
        "trees": [
            {
                "weight" : 1,
                "root": {
                    "feature": "userTextTitleMatch",
                    "threshold": 0.5,
                    "left" : {
                        "value" : -100
                    },
                    "right": {
                        "feature" : "originalScore",
                        "threshold": 10.0,
                        "left" : {
                            "value" : 50
                        },
                        "right" : {
                            "value" : 75
                        }
                    }
                }
            },
            {
                "weight" : 2,
                "root": {
                    "value" : -10
                }
            }
        ]
    }
}  

Heuristic Boosted Model (experimental)

The Heuristic Boosted Model is an experimental model that combines linear boosting to any model.
It is currently available in the experimental branch [3].
This capability is currently supported only by the : org.apache.solr.ltr.ranking.HeuristicBoostedLambdaMARTModel .
The reason behind this approach is that sometimes, at training time we don’t have available all the features we want to use at query time.
e.g.
Your training set is not built on clicks of the search results and contains legacy data, but you want to include the original score as a boosting factor
Let’s see the configuration in details :
Given :

"features":[ { "name": "userTextTitleMatch"}, { "name": "originalScoreFeature"} ]
"boost":{ "feature":"originalScoreFeature", "weight":0.1, "type":"SUM" }  

The original score feature value, weighted by a factor of 0.1, will be added to the score produced by the LambdaMART trees.

 "boost":{ "feature":"originalScoreFeature", "weight":0.1, "type":"PRODUCT" }  
 

The original score feature value, weighted by a factor of 0.1, will be multiplied to the score produced by the LambdaMART trees.

N.B. Take extra care when using this approach. This introduces a manual boosting to the score calculation, which adds flexibility when you don’t have much data for training. However, you will loose some of the benefits of a machine learned model, which was optimized to rerank your results. As you get more data and your model becomes better, you should shift off the manual boosting.

 

e.g

{
    "class":"org.apache.solr.ltr.ranking.HeuristicBoostedLambdaMARTModel",
    "name":"lambdamartmodel",
    "features":[
        { "name": "userTextTitleMatch"},
        { "name": "originalScoreFeature"}
    ],
    "params":{
    "boost": {
          "feature": "originalScoreFeature",
          "weight": 0.5,
          "type": "SUM"
        },
        "trees": [
            {
                "weight" : 1,
                "root": {
                    "feature": "userTextTitleMatch",
                    "threshold": 0.5,
                    "left" : {
                        "value" : -100
                    },
                    "right": {
                        "value" : 10}
 }
 },
 {
 "weight" : 2,
 "root": {
 "value" : -10
 }
 }
 ]
 }
 }

Deploy Model

As we saw for the features definition, deploying the model is quite straightforward :

curl -XPUT 'http://localhost:8983/solr/collection1/schema/model-store' --data-binary @/path/model.json -H 'Content-type:application/json' 
 

View Model

The model will be stored in an easily accessible json store:

curl -XGET 'http://localhost:8983/solr/collection1/schema/model-store'
 

Rerank query

To rerank your search results using a machine learned LTR model it is required to call the rerank component using the Apache Solr Learning To Rank ( LTR ) query parser.

Query Re-Ranking allows you to run an initial query(A) for matching documents and then re-rank the top N documents re-scoring them based on a second query (B).
Since the more costly ranking from query B is only applied to the top N documents it will have less impact on performance then just using the complex query B by itself – the trade off is that documents which score very low using the simple query A may not be considered during the re-ranking phase, even if they would score very highly using query B.  Solr Wiki

The Apache Solr Learning To Rank ( LTR ) integration defines an additional query parser that can be used to define the rerank strategy.
In particular, when rescoring a document in the search results :

  • Features are extracted from the document
  • Score is calculated evaluating the model against the extracted feature vector
  • Final search results are reranked according to the new score
rq={!ltr model=myModelName reRankDocs=25}

!ltr – will use the ltr query parser
model=myModelName – specifies which model in the model-store to use to score the documents
reRankDocs=25 – specifies that only the top 25 search results from the original ranking, will be scored and reranked

When passing external feature information (EFI) that will be used to extract the feature vector, the syntax is pretty similar :

rq={!ltr reRankDocs=3 model=externalmodel efi.parameter1=’value1′ efi.parameter2=’value2′}

e.g.

rq={!ltr reRankDocs=3 model=externalModel efi.user_input_query=’Casablanca’ efi.user_from_mobile=1}

Sharding

When using sharding, each shard will rerank, so the reRankDocs will be considered per shard.

e.g.
10 shards
You run distributed query with :
rq={!ltr reRankDocs=10 …
You will get a total of 100 documents re-ranked .

Pagination

Pagination is delicate[4].

Let’s explore the scenario on a single Solr node and on a sharded architecture.

 

Single Solr node

reRankDocs=15
rows=10

This means each page is composed by 10 results.
What happens when we hit the page 2 ?
The first 5 documents in the search results will have been rescored and affected by the reranking.
The latter 5 documents will preserve the original score and original ranking.

e.g.
Doc 11 – score= 1.2
Doc 12 – score= 1.1
Doc 13 – score= 1.0
Doc 14 – score= 0.9
Doc 15 – score= 0.8
Doc 16 – score= 5.7
Doc 17 – score= 5.6
Doc 18 – score= 5.5
Doc 19 – score= 4.6
Doc 20 – score= 2.4

This means that score(15) could be < score(16), but document 15 and 16 are still in the expected order.
The reason is that the top 15 documents are rescored and reranked and the rest is left unchanged.

Sharded architecture

reRankDocs=15
rows=10
Shards number=2

When looking for the page 2, Solr will trigger queries to she shards to collect 2 pages per shard :
Shard1 : 10 ReRanked docs (page1) + 10 OriginalScored docs (page2)
Shard2 : 10 ReRanked docs (page1) + 10 OriginalScored docs (page2)

The the results will be merged, and possibly, original scored search results can top up reranked docs.
A possible solution could be to normalise the scores to prevent any possibility that a reranked result is surpassed by original scored ones.

Note: The problem is going to happen after you reach rows * page > reRankDocs. In situations when reRankDocs is quite high , the problem will occur only in deep paging.

Feature Extraction And Caching

Extracting the features from the search results document is the most onerous task while reranking using LTR.
The LTRScoringQuery will take care of computing the feature values in the feature vector and then delegate the final score generation to the LTRScoringModel.
For each document the definitions in the feature-store are applied to generate the vector.
The vector can be generate in parallel, leveraging a multi-threaded approach.
Extra care must be taken into account when configuring the number of threads in the game.
The feature vector is currently cached in toto in the QUERY_DOC_FV cache.

This means that given the query and EFIs, we cache the entire feature vector for the document.

Simply giving in input a different efi request parameter will imply a different hashcode for the feature vector and consequentially invalidate the cached one.
This bit can be potentially improved, managing separately caches for the query independent, query dependent and query level features[5].

Solr Is Learning To Rank Better – Part 3 – Ltr tools

Apache Solr Learning to Rank – Things Get Serious

This blog post is about the Apache Solr Learning to Rank Tools : a set of tools to ease the utilisation of the Apache Solr Learning To Rank integration.

The model has been trained in Part 2, we are ready to deploy it to Solr, but first it would be useful to have a better understanding of what we just created.
A LambdaMART model in a real world scenario is a massive ensemble of regression trees, not the most readable structure for a human.
More we understand the model, easier will be to find anomalies and to fix/improve it.
But the most important benefit of having a clearer picture of the training set and the model is the fact that it can dramatically improves the communication with the business layer :
  • What are the most important features in our domain ?
  • What kind of document should score high according to the model ?
  • Why this document (feature vector) is scoring that high ?
These are only examples, but a lot of similar questions can rise, and we need the tools to answer.

Apache Solr Learning to Rank Tools

This is how the Ltr tools project [1] was born ( LTR stands for Learning To Rank ).
Target of the project is to use the power of Apache Solr to visualise and understand a Learning To Rank model.
It is a set of simple tools specifically thought for LambdaMart models, represented in the Json format supported by the Bloomberg Apache Solr Learning To Rank integration.
Of course it is open source so feel free to extend it introducing additional models and functionalities.
All the tools provided are meant to work with a Solr backend in order to index data that we can later search easily.
The tools currently available provide the support to :
– index the model  in a Solr collection
– index the training set in a Solr collection
– print the top scoring leaves from a LambdaMART model

Preparation

To use the Learning To Rank ( LTR ) tools you must proceed with these simple steps :

  • set up the Solr backend – this will be a fresh Solr instance with 2 collections : models, trainingSet,  the simple configuration is available in : ltr-tools/configuration
  • gradle build – this will package the executable jar in : ltr-tools/ltr-tools/build/libs

Usage

Let’s briefly take a look to the parameters of the executable command line interface :
Parameter Description
-help Print the help message
-tool 
The tool to execute (possible values):
– modelIndexer
– trainingSetIndexer
– topScoringLeavesViewer
-solrURL
The Solr base URL to use for the search backend
-model 
The path to the model.json file
-topK
The number of top scoring leaves to return ( sorted by score descendant)
-trainingSet
The path to the training set file
-features
The path to the feature-mapping.json.
A file containing a mapping between the feature Id and the feature name.
-categoricalFeatures
The path to a file containing the list of categorical feature names.
 

N.B. all the following examples will assume the model in input is a LambdaMART model, in the json format the Bloomberg Solr Plugin expects.

 

Model Indexer

Requirement : Backend Solr collection <models> must be UP & RUNNING

The Model Indexer is a tool that indexes a lambdaMART model in Solr to better visualize the structure of the trees ensemble.
In particular the tool will index each branch split of the trees belonging to the lambdaMART ensemble as Solr documents.
Let’s take a look the solr schema:

configuration/solr/models/conf
 <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false"/>  
 <field name="modelName" type="string" indexed="true" stored="true"/>  
 <field name="feature" type="string" indexed="true" stored="true" docValues="true"/>  
 <field name="threshold" type="double" indexed="true" stored="true" docValues="true"/>  
 ...  
 

So giving in input a lambdaMART model :

 

e.g. lambdaMARTModel1.json

 {  
   "class":"org.apache.solr.ltr.ranking.LambdaMARTModel",  
   "name":"lambdaMARTModel1",  
   "features":[  
    {  
      "name":"feature1"  
    },  
    {  
      "name":"feature2"  
    }  
   ],  
   "params":{  
    "trees":[  
      {  
       "weight":1,  
       "root":{  
         "feature":"feature1",  
         "threshold":0.5,  
         "left":{  
          "value":80  
         },  
         "right":{  
          "feature":"feature2",  
          "threshold":10.0,  
          "left":{  
            "value":50  
          },  
          "right":{  
            "value":75  
          }  
         }  
       }  
      }  
    ]  
   }  
 }  
 

N.B. a branching split is where the tree split in 2 branches:

 "feature":"feature2",   
      "threshold":10.0,   
      "left":{   
       "value":50   
      },   
      "right":{   
       "value":75   
      }   

A split happens on a threshold of the feature value.
We can use the tool to start the indexing process :

 java -jar ltr-tools-1.0.jar -tool modelIndexer -model /models/lambdaMARTModel1.json  -solrURL http://localhost:8983/solr/models

After the indexing process has finished we can access Solr and start searching !
e.g.
This query will return in response for each feature :

  • the number of times the feature appears at a branch split
  • the top 10 occurring thresholds for that feature
  • the number of unique thresholds that appear in the model for that feature
 http://localhost:8983/solr/models/select?indent=on&q=*:*&wt=json&facet=true&json.facet={  
      Features: {  
           type: terms,  
           field: feature,  
           limit: -1,  
           facet: {  
                Popular_Thresholds: {  
                     type: terms,  
                     field: threshold,  
                     limit: 10  
                },  
                uniques: "unique(threshold)"  
           }  
      }  
 }&rows=0&fq=modelName:lambdaMARTModel1  


Let’s see how it is possible to interprete the Solr response :

 facets": {  
   "count": 3479, //number of branch splits in the entire model  
   "Features": {  
     "buckets": [  
       {  
         "val": "product_price",  
         "count": 317, //the feature "product_price" is occurring in the model in 317 splits  
         "uniques": 28, //the feature "product_price" is occurring in the splits with 28 unique threshold values  
         "Popular_Thresholds": {  
           "buckets": [  
             {  
               "val": "250.0", //threshold value  
               "count": 45 //the feature "product_price" is occurring in the splits 45 times with threshold "250.0"  
             },  
             {  
               "val": "350.0",  
               "count": 45  
             },  
             ...  


TrainingSet Indexer

Requirement : Backend Solr collection <trainingSet> must be UP & RUNNING

The Training set Indexer is a tool that indexes a Learning To Rank traning set (in RankLib format) in Solr to better visualize the data.
In particular the tool will index each training sample of the trainign set as a Solr document.
Let’s see the Solr schema :

configuration/solr/models/conf
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false"/>
<field name="relevancy" type="tdouble" indexed="true" stored="true" docValues="true"/> 
<dynamicField name="cat_*" type="string" indexed="true" stored="true" docValues="true"/>
<dynamicField name="*" type="tdouble" indexed="true" stored="true" docValues="true"/> 

As you can notice the main point here is definition of dynamic fields.
Indeed we don’t know beforehand the names of the features, but we can distinguish between categorical features ( which we can index as strings) and ordinal features (which we can index as double).

We require now 3 inputs :

 

1) the training set in the RankLib format: 

 

e.g. training1.txt

1 qid:419267 1:300 2:4.0 3:1 6:1
4 qid:419267 1:250 2:4.5 4:1 7:1
5 qid:419267 1:450 2:5.0 5:1 6:1
2 qid:419267 1:200 2:3.5 3:1 8:1 
 

2) the feature mapping to translate the feature Id to a human readable feature name

e.g. features-mapping1.json

 {"1":"product_price","2":"product_rating","3":"product_colour_red","4":"product_colour_green","5":"product_colour_blue","6":"product_size_S","7":"product_size_M","8":"product_size_L"}  

N.B. the mapping must be a json object on a single line

This input file is optional, it is possible to index directly the feature Ids as names.

3) the list of categorical features

e.g. categoricalFeatures1.txt

product_colour
product_size 

This list ( one feature per line) will clarify to the tool which features are categorical, to index the category as a string value for the feature.
This input file is optional, it is possible to index the categorical features as binary one hot encoded features.

To start the indexing process :

 java -jar ltr-tools-1.0.jar -tool trainingSetIndexer -trainingSet /trainingSets/training1.txt -features /featureMappings/feature-mapping1.json -categoricalFeatures /feature/categoricalFeatures1.txt -solrURL http://localhost:8983/solr/trainingSet

After the indexing process has finished we can access Solr and start searching !
e.g.
This query will return in response all the training samples filtered and then faceted on the relevancy field.

This can be an indication of the distribution of the relevancy score in specific subsets of the training set

 http://localhost:8983/solr/trainingSet/select?indent=on&q=*:*&wt=json&fq=cat_product_colour:red&rows=0&facet=true&facet.field=relevancy


N.B. this is a quick and dirty way to explore the training set. I deeply suggest you to use it as a quick resource. Advance data plotting is more suitable to visualize big data and identify patterns.

Top Scoring Leaves Viewer

The top scoring leaves viewer is a tool to print the path of the top scoring leaves in the model.
Thanks to this tool will be easier to answer to questions like :
” How a document (feature vector) should look like to get an high score?”

The tool will simply visit the ensemble of trees in the model and keep track of the scores of each leaf.

So giving in input a lambdaMART model : 

 

e.g. lambdaMARTModel1.json

 {  
   "class":"org.apache.solr.ltr.ranking.LambdaMARTModel",  
   "name":"lambdaMARTModel1",  
   "features":[  
    {  
      "name":"feature1"  
    },  
    {  
      "name":"feature2"  
    }  
   ],  
   "params":{  
    "trees":[  
      {  
       "weight":1,  
       "root":{  
         "feature":"feature1",  
         "threshold":0.5,  
         "left":{  
          "value":80  
         },  
         "right":{  
          "feature":"feature2",  
          "threshold":10.0,  
          "left":{  
            "value":50  
          },  
          "right":{  
            "value":75  
          }  
         }  
       }  
      }, ...  
    ]  
   }  
 }  

To start the process :

 java -jar ltr-tools-1.0.jar -tool topScoringLeavesViewer -model /models/lambdaMARTModel1.json -topK 10  
This will print the top scoring 10 leaves (with related path in the tree):
1000.0 -> feature2 > 0.8, feature1 <= 100.0
200.0 -> feature2 <= 0.8, 
80.0 -> feature1 <= 0.5, 
75.0 -> feature1 > 0.5, feature2 > 10.0, 
60.0 -> feature2 > 0.8, feature1 > 100.0, 
50.0 -> feature1 > 0.5, feature2 <= 10.0,  

Conclusion

The Apache Solr Learning To Rank tools are quick and dirty solutions to help people understanding better and working better with Learning To Rank models.
They are far from being optimal but I hope they will be helpful for people working on similar problems.
Any contribution, improvement, bugfix is welcome !

[1] Learning To Rank tools

Solr Is Learning To Rank Better – Part 2 – Model Training

Model Training For Apache Solr Learning To Rank 

If you want to train a model for Apache Solr Learning To Rank , you are in the right place.
This blog post is about the model training phase for the Apache Solr Learning To Rank integration.

We modelled our dataset, we collected the data and refined it in Part 1 .
We have now a shiny lot-of-rows training set ready to be used to train the mathematical model that will re-rank the resulting documents coming out from our queries.
The very first activity to carry on is to decide the model that fits best our requirements.
I will focus in this blog post on two type of models, the ones currently supported by the Apache Solr Learning To Rank integration [1] .

Ranking SVM

Ranking SVM is a linear model based on Support Vector Machines.

The Ranking SVM algorithm is a pair-wise ranking method that adaptively sorts results based on how ‘relevant’  they are for a specific query ( we saw example of relevancy rating for the pair document-query in Part 1).

The Ranking SVM function maps each search query to the features of each sample.
This mapping function projects each data pair onto a feature space.

Each sample ( query-document ) of the training data will be used for the Ranking SVM algorithm to refine the mapping.

Generally, Ranking SVM includes three steps at training time:

  • It maps the similarities between queries and the clicked results onto a certain feature space.
  • It calculates the distances between any two of the vectors obtained in step 1.
  • It forms an optimization problem which is similar to a standard SVM classification and solves this problem with the regular SVM solver
Given the list of the features that describe our problem, an SVM model will assign a numerical weight to each of them, the resulting function will assign a score to a document given in input the feature vector that describes it.
Let’s see an example SVM Model in the Json format expected by the Solr plugin :
e.g.
{
    "class":"org.apache.solr.ltr.ranking.RankSVMModel",
    "name":"myModelName",
    "features":[
        { "name": "userTextTitleMatch"},
        { "name": "originalScore"},
        { "name": "isBook"}
    ],
    "params":{
        "weights": {
            "userTextTitleMatch": 1.0,
            "originalScore": 0.5,
            "isBook": 0.1
        }

}
}

Given 3 example features :
userTextTitleMatch – (query dependant feature) – (binary)
originalScore –  (query dependant feature) – (ordinal)
isBook – (document dependant feature) – (binary)

The ranking SVM model in the example builds a linear function of the variables assigning a different weight to each of them.

Given the documents (expressed with their feature vector):
D1 [1.0, 100, 1]
D2 [0.0, 80, 1]

Applying the model we get the scores :

Score(D1) = 1.0 * userTextTitleMatch(1.0) + 0.5 * originalScore(100) + 0.1 * isBook(1.0) = 51.1
Score(D2) = 1.0 * userTextTitleMatch(0.0) + 0.5 * originalScore(80) + 0.1 * isBook(1.0) = 40.1
The D1 documents is more relevant according the model.

Key points :

  • easy to debug and understand
  • linear function

Here you can find some good Open Source libraries to train SVM models [2].

LambdaMART

LambdaMART is a tree ensemble based model.
Each tree of the ensemble is a weighted regression tree and the final predicted score is the weighted sum of the prediction of each regression tree.
A regression tree is a decision tree that takes in input a feature vector and returns a scalar numerical value in output.
At a high level, LambdaMART is an algorithm that uses gradient boosting to directly optimize Learning to Rank specific cost functions such as NDCG.

To understand LambdaMART let’s explore the main two aspects: Lambda and MART.

MART
LambdaMART is a specific instance of Gradient Boosted Regression Trees, also referred to as Multiple Additive Regression Trees (MART).
Gradient Boosting is a technique for forming a model that is a weighted combination of an ensemble of “weak learners”. In our case, each “weak learner” is a decision tree.
Lambda
At each training point we need to provide a gradient that will allow us to minimize the cost function ( whatever we selected, NDCG for example).
To solve the problem LambdaMART uses an idea coming from lambdaRank:
at each certain point we calculate a value that acts as the gradient we require, this component will effectively modify the ranking, pushing up or down groups of documents and affecting effectively the rules for the leaf values, which cover the point used to calculate lambda.
For additional details these resources has been useful [3] ,[4] .
LambdaMART is currently considered one of the most effective model and it has been proved by a number of winning contests.

Let’s see how a lambdaMART model looks like :

{
    "class":"org.apache.solr.ltr.ranking.LambdaMARTModel",
    "name":"lambdamartmodel",
    "features":[
        { "name": "userTextTitleMatch"},
        { "name": "originalScore"}
    ],
    "params":{
        "trees": [
            {
                "weight" : 1,
                "root": {
                    "feature": "userTextTitleMatch",
                    "threshold": 0.5,
                    "left" : {
                        "value" : -100
                    },
                    "right": {
                        "feature" : "originalScore",
                        "threshold": 10.0,
                        "left" : {
                            "value" : 50
                        },
                        "right" : {
                            "value" : 75
                        }
                    }
                }
            },
            {
                "weight" : 2,
                "root": {
                    "value" : -10
                }
            }
        ]
    }
}

This example model is composed by two trees, each branch split evaluates a conditional expression on a feature value, if the feature value is  :
<=  threshold we visit the left branch
>  threshold we visit the right branch

Reaching the leaf of a tree will produce a score, this will be weighted according to the tree weight and then summed to the other scores produced ( the model is an ensemble of trees).
Given the documents :

D1 [1, 9]
D2 [0, 10]

Applying the model we get the scores :

Score(D1) = 1* (userTextTitleMatch (1) > 0.5 go right , originalScore (9) < 10 = 50) +
2 * -10 = 30

Score(D2) = 1 *(userTextTitleMatch(0) <= 0.5 = -100) +
2 * -10 = -120

The D1 document is more relevant according the model.

Key points :

  • ensemble of trees are effective
  • difficult to debug
  • non linear function

There are good open source libraries to train LambdaMART models, the one that we are going to use in this case of study is RankLib [5]

Evaluation Metric

Important phase of the journey is to validate the model :
How good is our model in re-ranking our dataset ?
How good is the model in re-ranking unknown data  ?
It is really important to carefully select the evaluation metric that best suites your algorithm and your needs.
Evaluating the model will be vital for the validation phase during the training, for testing the model against new data and to consistently being able to assess the predicting quality of the model trained.
N.B. this is particularly important: having a good understanding of the evaluation metric for the algorithm selected, can help a lot in tuning,  improving the model and finding anomalies in the training data.

NDCG@K

Normalised Discounted Cumulative Gain @k is a natural fit when choosing lambdaMART.
This metric evaluates the performance of a ranking model, it varies from 0.0 to 1.0, with 1.0
representing the ideal ranking model.
It takes K as parameter :
K : maximum number of documents that will be returned .

K specifies, for each query,  after the re-ranking, the top K results to evaluate.
N.B. set K to be the number of top documents you want to optimise.
Generally it is the number of documents you show in the first page of results.

Given in input the test set, this is grouped by queryId and for each query the ideal ranking (obtained sorting the documents by descendant relevancy) is compared with the ranking generated by the ranking model.
An average for all the queryIds is calculated.
Detailed explanation can be found in Wikipedia [6].

Knowing how it works, 2 factors sound immediately important when reasoning about NDCG :
1) distribution of relevancy scores in the test set, per queryId
2) number of samples per queryId

War Story 1 : too many highly relevant samples -> high NDCG
Given the test set A :
5 qid:1 1:1 …
5 qid:1 1:1 …
4 qid:1 1:1 …
5 qid:2 1:1 …
4 qid:2 1:1 …
4 qid:2 1:1 …

Even if the ranking model does not a good job, the high distribution of relevant samples increase the probability of hitting high score in the metric.

War Story 2 : small RankLists -> high NDCG
Having few samples (<K) per queryId means that we will not evaluate the performance of our ranking model with accuracy.

In the edge case of 1 sample per queryId, our NDCG@10 will be perfect, but actually this reflect simply a really poor training/test set.

N.B. be careful on how you generate your queryId as you can end up in this edge case and have a wrong perception of the quality of your ranking model.

Model Training

Focus of this section will be how to train a LambdaMART model using RankLib[5],
Let’s see an example and analyse all the parameters:
java  -jar RankLib-2.7.jar 
train /var/trainingSets/trainingSet_2016-07-20.txt 
-feature /var/ltr/trainingSets/trainingSet_2016-07-20_features.txt 
-ranker
-leaf 75
mls 5000 
-metric2t NDCG@20 
kcv 5 
-tvs 0.8 
-kcvmd /var/models 
-kcvmn model-name.xml 
-sparse 
Parameter Description
train
Path to the training set file
feature Feature description file, list feature Ids to be considered by the learner, each on a separate line. If not specified, all features will be used.
ranker Specify which ranking algorithm to use e.g. 6: LambdaMART
metric2t Metric to optimize on the training data. Supported: MAP, NDCG@k, DCG@k, P@k, RR@k, ERR@k (default=ERR@10)
tvs
Set train-validation split to be (x)(1.0-x).
x * (size of the training set) will be used for training
(1.0 -x) * (size of the training set) will be used for validation
kcv
Stands for k Cross Validation
Specify if you want to perform k-fold cross validation using ONLY the specified training data (default=NoCV).
This means we split the training data in k subsets and we run k training executions.
In each execution 1 subset will be used as the test set and k-1 subsets will be used for training.
N.B. in the case that tvs and kcv are both defined, first we split for the cross validation, than the training set produced will be split in training/validation.
e.g.
Initial Training Set size : 100 rows

-kcv 5 
-tvs 0.8 
Test Set : 20 rows
Training Set :  64 ( 0.8 * 80)
Validation Set : 16 (0.2 * 80)
kcvmd
Directory where to save models trained via cross-validation (default=not-save).
kcvmn Name for model learned in each fold. It will be prefix-ed with the fold-number (default=empty).
sparse Allow sparse data processing for the training set ( which means that you don’t need to specify all the features for each sample where the feature has no value)
tree
Number of trees of the ensemble(default=1000).
More complex is the learning problem, more trees we need, but it’s important to be careful and not overfit the training data
leaf
Number of leaves for each tree (default=100).
As the number of trees, it is important to tune carefully this value to avoid overfitting trees.
shrinkage This is the learning rate of the algorithm(default=0.1)
If this is too aggressive the Ranking model will quickly fit the training set, but will not react properly to the validation set evaluation, which means an overall poorer model.
tc
Number of threshold candidates for tree spliting.
-1 to use all feature values (default=256)
Increasing this value we increase the complexity and possible overfitting of the model.
It is suggested to start with a low value and then increase it according the general cardinality of your features.
mls
Min leaf support – minimum #samples each leaf has to contain (default=1).
This is quite an important parameter if we want to take care of outliers.
We can tune this parameter to include only leaves with an high number of samples, discarding pattern validated by a weak support.

Tuning the model training parameters is not an easy task.
Apart the general suggestions, trial & error with a careful comparison of the evaluation metric score is the path to go.
Comparing different models on a common Test Set can help when we have produced a number of models with different tuning configuration.

Converting Model To Json

We finally produced a model which is performing quite well on our Test set.
Cool, it’s the time to push it to Solr and see the magic in action.
But before we can do that, we need to convert the model generated by the Machine Learning libraries into the Json format Solr expects ( you remember the section about the models supported ? )
Taking as example a LambdaMART model, Ranklib will produce an xml model.
So you will need to parse it and convert it to the Json format.
Would be interesting to implement directly in RankLib the possibility of selecting in output the Json format expected by Solr.
Any contribution is welcome [7]!
In the next part we’ll see how to visualise and understand better the model generated.
This activity can be vital to debug the model, see the most popular features, find out some anomaly in the training set and actually assess the validity of the model itself.
 

Solr Is Learning To Rank Better – Part 1 – Data Collection

Learning To Rank In Apache Solr

Introduction

This blog post is about the journey necessary to bring Learning To Rank In Apache Solr search engines.

Learning to Rank[1] is the application of Machine Learning in the construction of ranking models for Information Retrieval systems.
Introducing supervised learning from user behaviour and signals can improve the relevancy of the documents retrieved bringing a new approach in ranking them.
Can be helpful in countless domains, refining free text search results or building a ranking algorithm where only filtering is happening and no initial scoring is available.

This series of blog posts will explore a full journey from the Signals Collection through the re-ranking in Solr.
Part 1  will explore the data collection, data modelling and refining phase
Part 2 will explore the training phase
Part 3 will describe a set of utilities to analyse your models
Part 4 will cover the Solr integration

Requirement : Minimal knowledge and interest in Machine Learning

Special thanks to Diego Ceccarelli, that helped me a lot in the last months introducing me to this amazing topic.
Second special thanks to Juan Luis Andrades, for the passion we shared during this journey and Jesse McLaughlin for the careful technical java insights.
Final special thanks to David Bunbury who contributed with interest, passion and very thoughtful ideas to the cause.

Collecting Signals

The start of the journey is the signals collection, it is a key phase and involves the modelling of the supervised training set that will be used to train the model.
A training set can be Explicit or Implicit.

Implicit

An Implicit training set is collected from user behaviours and interactions.
e.g.
Historical sales and transactions of an E-commerce website
User Clicks in the search result page
Time spent on each document accessed
 
A training set of this type is quite noisy but allows to collect great numbers of signals with small effort.
More the user was engaged with a particular document, stronger the signal of relevancy.
e.g.
A sale of a product is a stronger signal of relevancy than adding it to the basket
User Clicks in the search result page in comparison with documents shown but not clicked
Longer you read a document, stronger the relevancy
 
+ Pros : Cheap to build
Cons : Noisy

 

Explicit

An Explicit training set is collected directly from the interaction with Human Experts.
Given a query and a set of documents, the Human expert will rate the relevancy of each document in the result set.
The score assigned to the document will rate how relevant the document was for the query.
To remove the subjective bias is suggested to use a team of experts to build the training set.
A training set of this type is highly accurate but it is really expensive to build as you need a huge team of experts to produce thousands of rating for all the queries of interest.
+ Pros : Accuracy
– Cons : Expensive to build
 

Training Set

The training set can have different forms, for the sake of this post we will focus on the pairwise approach.
In particular the syntax exposed will be the RankLib[2] syntax.
Each sample of the training set is a signal / event that describes a pair (Query-Document) .
Let’s take as an example a query (1) and a document (1A)

3 qid:1 1:1 2:1 3:0 # 1A

The document can be represented as a feature vector which is a vector of scalar numeric values.
Each element in the vector represents a specific aspect of the document.
We’ll see this in the feature part in details.
Let’s now focus in understanding the sample :

3
How relevant the document is for the query

qid:1
Identify the query

1:1 2:1 3:0
The document, represented as a vector of numerical features

# 1A
A comment, to make more readable your training data

Feature Engineering

For convenience of Machine Learning algorithms, query-document pairs are represented by numerical vectors.
Components of such vectors are called features and can be divided into different groups (depending if they depend on the document, the query or both) :
Document Features (query independent)
This kind of feature depends only on the document and not on the query.
e.g.
Document length
Price of the product
User Rating of the product
Interesting aspect of these features is that they can be potentially precomputed in off-line mode during indexing. They may be used to compute document’s static quality score (or static rank), which is often used to speed up search query evaluation.

 

Query Dependent Features 
This kind of features depends on the query and on the document
e.g.
Is the document containing the query text in the title ?
Is the document (product) of the same brand as expressed in the query
 
Query Level Features 
This kind of features depends only on the query.
e.g.
Number of words in the query
Cuisine Type Selected( e.g. “Italian”, “Sushi” when searching for Restaurants)
Date Selected ( e.g. when searching in an Hotel Booking system)
Department Selected ( e.g. “electonics”, “kitchen”, “DIY” … in an E-commerce website)
User Dependent Features
Also in this case this kind of feature does not depend on the document.
It only depends on the user running the query.
e.g.
Device 
Age of the user
Gender
As described , for convenience of the mathematical algorithms each high level feature must be modelled as a numeric feature.
In the real world a feature describes an aspect of the object (document) and must be represented accordingly:
Ordinal Features
An ordinal feature represents a numerical value with a certain position in a sequence of numbers.
e.g.
Star Rating  ( for a document describing an Hotel)
Price  ( for a document describing an e-commerce product)
 
For the Star Rating feature, stands an order for the different values:
1<2<3<4<5  is logically correct.
For the Price feature, the same observation applies .
100$ < 200$ <300$
A feature is Ordinal when it is possible to compare different values and decide the ranking of these.
 
Categorical Features
A categorical feature represents an attribute of an object that have a set of distinct possible values.
In computer science it is common to call the possible values of a categorical features Enumerations.
e.g.
Colour ( for a document describing a dress)
Country ( for a document describing a location)
It easy to observe that to give an order to the values of a categorical feature does not make any sense.
For the Colour feature :
red < blue < black has no general meaning.
 
Binary Features
A binary feature represents an attribute of an object that can have only two possible values.
Traditionally 0 / 1 in accordance with the binary numeral system.
e.g.
Is the product available ? yes/no ( for a document describing an e-commerce product)
Is the colour Red ? ( for a document describing a dress)
Is the country Italy ? ( for a document describing a location)

One Hot Encoding

When a categorical feature describes your Document, it is tempting to represent each category as an Integer Id :
e.g.
Categorical Feature : colour
Distinct Values : red, green, blue
Representation : colour:1, colour:2 colour:3 
 
With this strategy, the Machine learning algorithm will be able to manage the feature values…
But is the information we pass, the same as the original one ?
Representing a categorical feature as an ordinal feature is introducing an additional ordinal relationship :
1(red) < 2(green) < 3(blue)
which doesn’t reflect the original information.
There are different ways to encode categorical features to make them understandable by the training algorithm. We need basically to encode the original information the feature provides in an numeric form, without any loss or addition if possible.
One possible approach is called One Hot Encoding [3]:
Given a categorical feature with N distinct values, encode it in N binary features, each feature will state if the category applies to the Document.
e.g.
Categorical Feature : colour
Distinct Values : red, green, blue
Encoded Features : colour_red, colour_green, colour_blue 
A document representing a blue shirt will be described by the following feature vector :
… colour_red:0 colour_green:0 colour_blue:1 …
One Hot Encoding is really useful to properly model your information, but take care of the cardinality of your categorical feature as this will be reflected in the number of final features that will describe your signal.
War Story 1 : High Cardinality Categorical Feature
A signal describing a document with a high level categorical feature (with N distinct values) can produce a Feature vector of length N.
This can deteriorate the performance of your trainer as it will need to manage many more features per signal.
It actually happened to me, that simply adding one categorical feature was bringing in thousands of binary features, exhausting the hardware my trainer was using,  killing the training process.
To mitigate this , can be useful to limit the encoded distinct values only to a subset :
  • with white list / black list approach business driven
  • keeping only the top occurring values
  • keeping only the values occurring more than a threshold
  • encode the rest as a scpecial feature :colour_misc
  • Hash the distinct values into a reduced set of hashes

Feature Normalization

Feature Normalisation is a method used to standardize the range of values across different features, a technique quite useful in the data pre-processing phase.
As the majority of machine learning algorithms use the Euclidean distance to calculate the distance between two different points (training vector signals), if a feature has a widely different scales, the distance can be governed by this particular feature.
Normalizing can simplify the problem and give the same weight to each of the features involved.
There are different type of normalization, some of them :
  • Linear Normalization ( min/max based)
  • Sum Normalization ( based on the sum of all the values of the feature )
  • Z Score ( based on the mean/standard deviation of the feature )

Feature Value Quantisation

Another approach to simplify the job of the training algorithm is to quantise the feature values, in order to reduce the cardinality of distinct values per feature.
It is basically the simple concept of rounding, whenever we realise that it does not make any difference for the domain to model the value with an high precision, it is suggested to simplify it and round it to the acceptable level.
e.g.
Domain: ospitality
Ranking problem : Rank restaurants documents
Assuming a feature is the trip_advisor_reviews_count, is it really necessary to model the value as the precise amount of reviews ? Normally would be simpler to round to the nearest k ( 250 or whatever sensible to the business)

Note : Extra care must be taken into account if following this approach.
The reason is that adding an artificial rounding to the data can be dangerous, we can basically compromise the feature itself setting up hard thresholds.
It is always better if the algorithm decides the thresholds with freedom.
It is suggested were possible to not quantise, or quantise only after a deep statistical analysis on our data.
If the target is to simplify the model for any reason, it is possible to evaluate at training time less threshold candidates, depending on the training algorithm.

Missing Values

Some of the signals we are collecting could miss some of the features ( data corruption, bug in the signal collection or simply the information was not available at the time ) .
Modelling our signals with a sparse feature vector will imply that a missing feature will actually be modelled as feature with value 0.
This should generally be ok, but we must be careful in the case that 0 is a valid value for the feature.
e.g.
Given a user_rating feature
A rating of 0 means the product has a very bad rating.
A missing rating means we don’t have a rating for the product ( the product can still be really good) 
A first approach could be to model the 0 rating as slightly greater than 0 (i.e. 0 + ε ) and keep the sparse representation.
In this way we are differentiating the information but we are still modelling the wrong ordinal relationship : 
Missing User Rating  (0) < User Rating 0 (0 + ε)
 
Unfortunately, at least for the RankLib implementation, a missing feature will always be modelled with a value 0, this of course will vary from algorithm to algorithm.
But we can enforce the learning a bit, adding an additional binary feature that states that the User Rating is actually missing :
user_rating:0 , user_rating_missing:1 .
This should help the learning process to actually understand better the difference.
Furthermore, if possible we can help the algorithm, avoiding the sparse representation if necessary and setting for the missing feature a value which is the avg of the feature itself across the different samples.

Outliers

Some of the signals we are collecting could have some outliers ( some signal with an unlikely extremely different value for a specific feature).
This can be caused by bugs in the signal collection process or simply the anomaly can be a real instance of a really rare signal.
Outliers can complicate the job of the model training and can end up in overfitting models that have difficulties in adaptation for unknown datasets.
Identify and resolve anomalies can be vitally important if your dataset is quite fragile.

Tool for data visualisation can help in visualising outliers, but for a deep analysis I suggest to have a read of this interesting blog post [4] .

Feature Definition

Defining the proper set of features to describe the document of our domain is an hard task.
It is not easy to identify in the first place all the relevant features even if we are domain experts, this procedure will take time and a lot of trial and error.
Let’s see a guideline to try to build a feature vector as best as possible :
  • Keep it simple : start from a limited set of features which are really fundamental to describe your problem, the model produced will be really poor, but at least you have the baseline.
  • Iteratively train the model : removing or adding a feature at each execution. this is time consuming but will allow you to identify clearly which features really  matter.
  • Visualise Important Features : After you trained the model, use a visualisation tool to verify which feature is appearing more
  • Meet the Business : Have meetings with the business, to compare what they would expect to see re-ranked and what actually the model re-ranks. When there is discordance let’s have the humans explain why, this should drive to identify missing features or feature that were used in wrong meaning .

Data Preparation

We have carefully designed our vectorial representation of the domain documents, we identified the source of our signals and built our training Set.
So far, so good…
But the model is still performing really poor.
In this scenario the reasons can be countless :

  •  Poor signal quality (noise)
  •  Incomplete feature vector representation
  •  Not uniform distribution of relevant-not relevant documents per queries

Let’s explore some guideline to overcome some of these difficulties :

Noise Removal

In the scenario of implicit signals, it is likely we model the relevancy rating based on an evaluation metric of the user engagement with the document given a certain query.
Depending of the domain we can measure the user engagement in different ways.
Let’s see an example for a specific domain :  E-commerce
We can assign the relevancy rating of each signal depending on the user interaction :
Given a scale 1 to 3 :
1 – User clicked the product
2 – User added the product to the basket
3 – User bought the product
The simplest approach would be to store 1 signal per user interaction.
User behavioural signals are noisy by nature, but this approach introduces even more noise, as for the same feature vector we introduce discordant signals, specifically we are telling the training algorithm that given that feature vector and that query, the document is at the same time :
vaguely relevant – relevant – strongly relevant .
This doesn’t help the training algorithm at all, so we need to find a strategy to avoid that.
One possible way is to keep only the strongest signal per document-query per user episode .
In the case of a user buying a product, we avoid storing in the training set 3 signals, but we keep only the most relevant one.
In this way we transmit to the training algorithm the only the important information for the user interaction with no confusion.

Unbalanced Dataset

In some domain, would be quite common to have a very unbalanced dataset.
A dataset is unbalanced when the relevancy classes are not represented equally in the dataset i.e. we have many more samples of a relevancy class than another.
Taking again the E-commerce example, the number of relevant signals (sales) will be much less than the number of weak signals (clicks).
This unbalance can make the life harder to the training algorithm, as each relevant signal can be covered by many more weakly relevant ones.
Let’s see how we can manipulate the dataset to partially mitigate this problem :

Collect more data

This sounds simple, but collecting more data is generally likely to help.
Of course there are domain when collecting more data is not actually beneficial ( for example when the market change quite dinamically and the previous years dataset becomes almost irrelevant for predicting the current behaviours ) .

Resample the dataset

You can manipulate the data you collected to have more balanced data.
This change is called sampling your dataset and there are two main methods that you can use to even-up the classes: Oversampling and Undersampling[5].
You can add copies of instances from the under-represented relevancy class, this is called over-sampling, or
you can delete instances from the over-represented class, this technique is called under-sampling.
These approaches are often very easy to implement and fast to run. They are an excellent starting point.
Some ideas and suggestion :
  • Consider testing under-sampling when you have an a lot data (tens- or hundreds of thousands of instances or more), you can undersample randomly or following any business logic available
  • Consider testing different resampled ratios (it is not necessary to target a 1:1 ratio)
  • When oversampling consider advanced approaches, instead of simple duplication of already existent samples, could be good to artificially generate new ones [6]
Be careful, resampling is not always helping your training algorithm, so experiment in details your use case.
War Story 2 : Oversampling by duplication
Given a dataset highly unbalanced, the model trained was struggling to predict accurately the desired relevancy class for document-query test samples.
Oversampling was a tempting approach, so here we go!
 
As I am using cross validation, the first approach has been to oversample the dataset by duplication.
I took each relevancy class and duplicate the samples until I built a balanced dataset.
Then started the training in Cross Validation and I trained a model which was immense and almost perfectly able to predict the relevancy of validation and test samples.
Cool ! I got it !
Actually was not an amazing result at all, because of course applying cross validation on an oversampled dataset built validation and test sets oversampled as well.
This means that it was really likely that a sample in the training set was appearing exactly the same in the validation set and in the test set.
The resulting model was basically highly overfitted and not that good to predict unknown test sets.
 
So I moved to a manual training set- validation set – test set split and oversampled only the training set.
This was definitely better and built a model that was much more suitable.
It was not able to perfectly predict validation and test sets but this was a good point as the model was able to predict unknown data sets better.
 
Then I trained again, this time the original dataset, manually split as before but not oversampled.
The resulting model was actually better than the oversampled one.
One of the possible reasons is that the training algorithm and model I was using (LambdaMART) didn’t get any specific help from the resampling, actually the model lost the capability of discovering which samples were converting better ( strong relevant signals : weak relevant signals ratio).
Practically I favoured the volume over the conversion ratio, increasing the recall but losing the precision of the ranker.
 
Conclusion : Experiment, evaluate the approach with your algorithm, compare, don’t assume it is going to be better without checking

Query Id hashing

As we have seen in the initial part of the blog, each sample is a document-query pair, represented in a vectorial format.
The query is represented by an Id, this Id is used to group samples for the same query, and evaluate the ranker performance over each samples group.
This can give us an evaluation of how good the ranker is performing on average on all the queries of interest.
This brings to carefully decide how we generate the query identifier.
If we generate a too specific hash, we risk to build small groups of samples, this small groups can get an high score when ranking them, biased by their small size.
Extreme case
e.g.
Really specific hash, brings many groups to be 1 sample groups.
This brings up the evaluation metric score, as we are averaging and a lot of groups, being of size 1, are perfectly easy to rank.
If we generate an hash that is not specific enough we can end up in immense groups, not that helpful to evaluate our ranking model on the different real world scenarios.
The ideal scenario is to have one query Id per query category of interest, with a good number of samples related, this would be the perfect dataset, in this way we can validate both :
  • the intra-group relevancy ( as the group are composed by sufficient samples)
  • the average across the queries ( as we have a valid set of different queries available)
The query category could depend on a set of Query Dependent Features, this means that we can calculate the hash using the values of these features.
Being careful we maintain a balance between the group sizes and the granularity of the hash.
It is important to have the query categories across the training/validation/test set :
e.g.
We have 3 different query categories , based on the value of a user dependent feature ( user_age_segment) .
These age segments represents three very different market segments, that require very different ranking models.
When building our training set we want enough samples for each category and we want them to be split across the training/validation/test sets to be able to validate how good we are in predicting the different market segment.
This can potentially drive to build separate models and separate data sets if it is the case.

Exploring Solr Internals : The Lucene Inverted Index

Introduction

This blog post is about the Lucene Inverted Index and how Apache Solr internally works.

When playing with Solr systems, understanding and properly configuring the underline Lucene Index is fundamental to deeply control your search.
With a better knowledge of how the index looks like and how each component is used, you can build a more performant, lightweight and efficient solution.
Scope of this blog post is to explore the different components in the Inverted Index.
This will be more about data structures and how they contribute to provide Search related functionalities.
For low level approaches to store the data structures please refer to the official Lucene documentation [1]

Lucene Inverted Index

The Inverted Index is the basic data structure used by Lucene to provide Search in a corpus of documents.
It’s pretty much quite similar to the index in the end of a book.
From wikipedia :

“In computer science, an inverted index (also referred to as postings file or inverted file) is an index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents.”

In Memory / On Disk

The Inverted index is the core data structure that is used to provide Search.
We are going to see in details all the components involved.
It’s important to know where the Inverted index will be stored.
Assuming we are using a FileSystem Lucene Directory, the index will be stored on the disk for durability ( we will not cover here the Commit concept and policies, so if curious [2]) .
Modern implementation of the FileSystem Directory will leverage the OS Memory Mapping feature to actually load into the memory ( RAM ) chunk of the index ( or possibly all the index) when necessary.
The index in the file system will look like a collection of immutable segments.
Each segment is a fully working Inverted Index, built from a set of documents.
The segment is a partition of the full index, it represents a part of it and it is fully searchable.
Each segment is composed by a number of binary files, each of them storing a particular data structure relevant to the index, compressed [1] .
To simplify, in the life of our index, while we are indexing data, we build segments, which are merged from time to time ( depending of the configured Merge Policy).
But the scope of this post is not the Indexing process but the structure of the Index produced.

 

Hands on !

Let’s assume in input 3 documents, each of them with 2 simple fields, and see how a full inverted Index will look like :

Doc0
    {  “id”:”c”,
        “title”:”video game history”
},

Doc1
    {  “id”:”a”,
        “title”:”game video review game”
},

Doc2
    {  “id”:”b”,
        “title”:”game store”
},

Depending of the configuration in the schema.xml, at indexing time we generate the related data structure.

Let’s see the inverted index in the complete form, then let’s explain how each component can be used , and when to omit part of it.

.tg {border-collapse:collapse;border-spacing:0;} .tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;} .tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;} .tg .tg-vyw9{font-family:”Courier New”, Courier, monospace !important;} .tg .tg-s8ba{font-weight:bold;font-family:”Courier New”, Courier, monospace !important;;background-color:#efefef} .tg .tg-vc88{font-weight:bold;font-family:”Courier New”, Courier, monospace !important;}

Field id
Ordinal Term Document Frequency Posting List
0 a 1 : 1 : [1] : [0-1]
1 b 1 : 1 : [1] : [0-1]
2 c 1 : 1 : [1] : [0-1]
Field title
Ordinal Term Document Frequency Posting List
0 game 3 0 : 1 : [2] : [6-10],
1 : 2 : [1, 4] : [0-4, 18-22],
: 1 : [1] : [0-4]
1 history 1 : 1 : [3] : [11-18]
2 review 1 : 1 : [3] : [11-17]
3 store 1 2 : 1 : [2] : [5-10]
4 video 2 0 : 1 : [1] : [0-5],
1 : 1 : [2] : [5-10],

This sounds scary at the beginning let’s analyse the different component of the data structure.

Term Dictionary

The term dictionary is a sorted skip list containing all the unique terms for the specific field.
Two operations are permitted, starting from a pointer in the dictionary :
next() -> to iterate one by one on the terms
advance(ByteRef b) -> to jump to an entry >= than the input  ( this operation is O(n) = log n where n= number of unique terms).
An auxiliary Automaton is stored in memory, accepting a set of smart calculated prefixes of the terms in the dictionary.
It is a weighted Automaton, and a weight will be associated to each prefix ( i.e. the offset to look into the Term Dictionary) .
This automaton is used at query time to identify a starting point to look into the dictionary.
When we run a query ( a TermQuery for example) :
1) we give in input the query to the In Memory Automaton, an Offset is returned
2) we access the location associated to the Offset in the Term Dictionary
3) we advance to the ByteRef representation of the TermQuery
4) if the term is a match for the TermQuery we return the Posting List associated

Document Frequency

This is simply the number of Documents in the corpus containing the term  t in the field f .
For the term game we have 3 documents in our corpus that contain the term in the field title
 

Posting List

The posting list is the sorted skip list of DocIds that contains the related term.
It’s used to return the documents for the searched term.
Let’s have a deep look to a complete Posting List for the term game in the field title :
0 : 1 : [2] : [6-10],
1 : 2 : [1, 4] : [0-4, 18-22],
: 1 : [1] : [0-4]

Each element of this posting list is :
Document Ordinal : Term Frequency : [array of Term Positions] : [array of Term Offset] .

Document Ordinal -> The DocN Ordinal (Lucene ID) for the DocN document in the corpus containing the related term.
Never relies at application level on this ordinal, as it may change over time ( during segments merge for example).
e.g.
According to the starting ordinals of each Posting list element :
Doc0 0,
Doc1 1,
Doc2 2
contain the term game in the field title .

Term Frequency -> The number of occurrences of the term in the Posting List element .
e.g.
0 : 1 
Doc0 contains 1 occurrence of the term game in the field title.
1 : 2
Doc1 contains 2 occurrence of the term game in the field title.
: 1
Doc2 contains 1 occurrence of the term game in the field title.

Term Positions Array -> For each occurrence of the term in the Posting List element, it contains the related position in the field content .
e.g.
0 : [2]
Doc0 1st occurrence of the term game in the field title occupies the 2nd position in the field content.
“title”:”video(1) game(2) history(3)” )
1 : [1, 4] 
Doc1 1st occurrence of the term game in the field title occupies the 1st position in the field content.
Doc1 2nd occurrence of the term game in the field title occupies the 4th position in the field content.
“title”:”game(1) video(2) review(3) game(4) )
: [1] 
Doc0 1st occurrence of the term game in the field title occupies the 1st position in the field content.
“title”:”game(1) store(2)” )

Term Offsets Array ->  For each occurrence of the term in the Posting List element, it contains the related character offset in the field content .
e.g.
0 : [6-10]
Doc0 1st occurrence of the term game in the field title starts at 6th char in the field content till 10th char ( excluded ).
“title”:”video(1) game(2) …” )
“title”:”01234 5 6789 10  …” )
1 : [0-4, 18-22]
Doc1 1st occurrence of the term game in the field title starts at 0th char in the field content till 4th char ( excluded )..
Doc1 2nd occurrence of the term game in the field title starts at 18th char in the field content till 22nd char ( excluded ).
“title”:”game(1) video(2) review(3) game(4) )
“title”:”0123 4   video(2) review(3) 18192021 22″ )
[0-4]
Doc0 1st occurrence of the term game in the field title occupies the 1st position in the field content.
“title”:”game(1) store(2)” )
“title”:”0123 4 store(2)” )

Live Documents

Live Documents is a lightweight data structure that keep the alive documents at the current status of the index.
It simply associates a bit  1 ( alive) , 0 ( deleted) to a document.
This is used to keep the queries aligned to the status of the Index, avoiding to return deleted documents.
Note : Deleted documents in index data structures are removed ( and disk space released) only when a segment merge happens.
This means that you can delete a set of documents, and the space in the index will be claimed back only after the first merge involving the related segments will happen.
Assuming we delete the Doc2, this is how the Live Documents data structure looks like :
Ordinal Alive
0 1
1 1
2 0
Ordinal -> The Internal Lucene document ID
Alive -> A bit that specifies if the document is alive or deleted.

Norms

Norms is a data structure that provides length normalisation and boost factor per Field per Document.
The numeric values are associated per FIeld and per Document.
This value is associated to the length of the content value and possibly an Indexing time boost factor as well.
Norms are used when scoring Documents retrieved for a query.
In details it’s a way to improve the relevancy of Documents containing the term in short fields .
A boost factor can be associated at indexing time as well.
Short field contents will win over long field field contents when matching a query with Norms enabled.

Field title
Doc Ordinal Norm
0 0.78
1 0.56
2 0.98

 

Schema Configuration

When configuring a field in Solr ( or directly in Lucene) it is possible to specify a set of field attributes to control which data structures are going to be produced .
Let’s take a look to the different Lucene Index Options
Lucene Index Option Solr schema Description To Use When …
NONE indexed=”false” The inverted index will not be built. You don’t need to search in your corpus of documents.
DOCS omitTermFreqAnd
Positions=”true”
The posting list for each term will simply contain the document Ids ( ordinal) and nothing else.

e.g.
game -> 0,1,2 
– You don’t need to search in your corpus with phrase or positional queries. 
-You don’t need score to be affected by the number of occurrences of a term in a document field.
DOCS_AND_FREQS omitPositions=”true” The posting list for each term will simply contain the document Ids ( ordinal) and term frequency in the document.

e.g.
game -> 0 : 11 : 2 : 1 
– You don’t need to search in your corpus with phrase or positional queries.
– You do need scoring to take Term Frequencies in consideration
DOCS_AND_FREQS_AND_
POSITIONS
Default when indexed=”true” The posting list for each term will contain the term positions in addition.

e.g.
0 : 1 : [2], 1 : 2 : [1, 4], : 1 : [1]
– You do need to search in your corpus with phrase or positional queries.
– You do need scoring to take Term Frequencies in consideration
DOCS_AND_FREQS_AND_
POSITIONS_AND_OFFSETS
storeOffsetsWithPositions =”true” The posting list for each term will contain the term offsets in addition.

e.g.game ->0 : 1 : [2] : [6-10],
1 : 2 : [1, 4] : [0-4, 18-22],
: 1 : [1] : [0-4]

– You want to use the Posting Highlighter.
A fast version of highlighting that uses the posting list instead of the term vector.
omitNorms omitNorms=”true” The norms data structure will not be built – You don’t need to boost short field contents
– You don’t need Indexing time boosting per field


[1] Lucene File Formats
[2] Understanding commits and Tlog in Solr

Solr : " You complete me! " : The Apache Solr Suggester

This blog post is about the Apache Solr Autocomplete feature.
It is clear that the current documentation available on the wiki is not enough to fully understand the Solr Suggester : this blog post will describe all the available implementations with examples and tricks and tips.

Introduction

If there’s one thing that months of Solr-user mailing list have taught me is that the autocomplete feature in a Search Engine is vital and around Apache Solr Autocomplete there’s as much hype as confusion.

In this blog I am going to try to clarify as much as possible all the kind of Suggesters that can be used in Solr, exploring in details how they work and showing some real world example.

It’s not scope of this blog post to explore in details the configurations.

Please use the official wiki [1] and this really interesting blog post [2] to integrate this resource.

Let’s start with the definition of the Apache Solr Suggester component.

The Apache Solr Suggester

From the official Solr wiki [1]:
” The SuggestComponent in Solr provides users with automatic suggestions for query terms. You can use this to implement a powerful auto-suggest feature in your search application.
This approach utilizes Lucene’s Suggester implementation and supports all of the lookup implementations available in Lucene.
The main features of this Suggester are:
  • Lookup implementation pluggability
  • Term dictionary pluggability, giving you the flexibility to choose the dictionary implementation
  • Distributed support “

For the details of the configuration parameter I suggest you the official wiki as a reference.

Our focus will be the practical use of the different Lookup Implementation , with clear examples.

Term Dictionary

The Term Dictionary defines the way the terms (source for the suggestions) are retrieved for the Solr autocomplete.
There are different ways of retrieving the terms, we are going to focus on the DocumentDictionary ( the most common and simple to use).
For details about the other Dictionaries implementation please refer to the official documentation as usual.

The DocumentDictionary uses the Lucene Index to provide the list of possible suggestions, and specifically a field is set to be the source for these terms.

Suggester Building

Building a suggester is the process of :
  • retrieving the terms (source for the suggestions) from the dictionary
  • build the data structures that the Suggester requires for the lookup at query time
  • Store the data structures in memory/disk

The produced data structure will be stored in memory in first place.

It is suggested to additionally store on disk the built data structures, in this way it will available without rebuilding, when it is not in memory anymore.

For example when you start up Solr, the data will be loaded from disk to the memory without any rebuilding to be necessary.

This parameter is:

storeDir” for the FuzzyLookup

indexPath” for theAnalyzingInfixLookup

The built data structures will be later used by the suggester lookup strategy, at query time.
In details, for the DocumentDictionary during the building process, for ALL the documents in the index :
  • the stored content of the configured field is read from the disk ( stored=”true” is required for the field to have the Suggester working)
  • the compressed content is decompressed ( remember that Solr stores the plain content of a field applying a compression algorithm [3] )
  • the suggester data structure is built
We must be really careful here to this sentence :
“for ALL the documents” -> no delta dictionary building is happening
So extra care every time you decide to build the Suggester !
Two suggester configuration are strictly related to this observation :
Parameter Description
buildOnCommit or buildOnOptimize If true then the lookup data structure will be rebuilt after each soft-commit. If false, the default, then the lookup data will be built only when requested by query parameter suggest.build=true.

Because of the previous observation is quite easy to understand that the buildOnCommit is highly discouraged.

buildOnStartup If true then the lookup data structure will be built when Solr starts or when the core is reloaded. If this parameter is not specified, the suggester will check if the lookup data structure is present on disk and build it if not found.

Again, is highly discouraged to set this to true, or our Solr cores could take really long time to start up.

A good consideration at this point would be to introduce a delta approach in the dictionary building.

Could be a good improvement, making more sense out of the “buildOnCommit” feature.

I will follow up verifying the technical feasibility of this solution.

Now let’s step to the description of the various lookup implementations with related examples.Note: when using the field type “text_en” we refer to a simple English analyser with soft stemming and stop filter enabled.

The simple corpus of document for the examples will be the following :

[
      {
        "id":"44",
        "title":"Video gaming: the history"},
      {
        "id":"11",
        "title":"Video games are an economic business"},
      {
        "id":"55",
        "title":"The new generation of PC and Console Video games"},
      {
        "id":"33",
        "title":"Video games: multiplayer gaming"}]

And a simple synonym mapping : multiplayer, online

AnalyzingLookupFactory


<lst name="suggester">

  <str name=”name”>AnalyzingSuggester</str>

  <str name=”lookupImpl”>AnalyzingLookupFactory</str>

  <str name=”dictionaryImpl”>DocumentDictionaryFactory</str>

  <str name=”field”>title</str>

  <str name=”weightField”>price</str>

  <str name=”suggestAnalyzerFieldType”>text_en</str>

</lst>




 

Description
Data Structure FST
Building For each Document, the stored content from the field is analyzed according to the suggestAnalyzerFieldType.

The tokens produced are added to the Index FST.

Lookup strategy The query is analysed,  the tokens produced are added to the query FST.

An intersection happens between the Index FST and the query FST.

The suggestions are identified starting at the beginning of the field content.

Suggestions returned The entire content of the field .

This suggester is quite powerful as it allows to provide suggestions at the beginning of a field content, taking advantage of the analysis chain provided with the field.

It will be possible in this way to provide suggestions considering synonyms, stop words, stemming and any other token filter used in the analysis.Let’s see some example:

Query to autocomplete Suggestions Explanation
“Video gam”
  • Video gaming: the history”
  • Video games are an economic business”
  • Video game: multiplayer gaming”
The suggestions coming are simply the result of the prefix match. No surprises so far.
“Video Games”
  • Video gaming: the history”
  • Video games are an economic business”
  • Video game: multiplayer gaming”
The input query is analysed, and the tokens produced are the following : “video” “game”.

The analysis was applied at building time as well, producing the same stemmed terms for the beginning of the titles.

“video gaming” -> “video” “game”

“video games” -> “video” “game”

So the prefix match applies.


“Video game econ”  

  • Video games are an economic business”
In this case we can see that the stop words were not considered when building the index FST. Note :

position increments MUST NOT be preserved for this example to work, see the configuration details.

“Video games online ga”
  • Video game: multiplayer gaming”
Synonym expansion has happened and the match is returned as online and multiplayer are considered synonyms by the suggester, based on the analysis applied.

FuzzyLookupFactory


<lst name="suggester">

  <str name=”name”>FuzzySuggester</str>

  <str name=”lookupImpl”>FuzzyLookupFactory</str> 

  <str name=”dictionaryImpl”>DocumentDictionaryFactory</str>

  <str name=”field”>title</str>

  <str name=”weightField”>price</str>

  <str name=”suggestAnalyzerFieldType”>text_en</str>

</lst>

Description
Data Structure FST
Building For each Document, the stored content from the field is analyzed according to the suggestAnalyzerFieldType.

The tokens produced are added to the Index FST.

Lookup strategy The query is analysed,  the tokens produced are then expanded producing for each token all the variations accordingly to the max edit configured for the String distance function configured ( default is Levestein Distance[4]).

The finally produced tokens are added to the query FST keeping the variations.

An intersection happens between the Index FST and the query FST.

The suggestions are identified starting at the beginning of the field content.

Suggestions returned The entire content of the field .

This suggester is quite powerful as it allows to provide suggestions at the beginning of a field content, taking advantage of a fuzzy search on top of the analysis chain provided with the field.

It will be possible in this way to provide suggestions considering synonymsstop words, stemming and any other token filter used in the analysis and support also misspelled terms by the user.

It is an extension of the Analysis lookup.IMPORTANT : Remember the proper order of processing happening at query time :

  • FIRST, the query is analysed, and tokens produced
  • THEN, the tokens are expanded with the inflections based on the Edit distance and distance algorithm configured

Let’s see some example:

Query to autocomplete Suggestions Explanation
“Video gmaes”
  • Video gaming: the history”
  • Video games are an economic business”
  • Video game: multiplayer gaming”
The input query is analysed, and the tokens produced are the following : “video” “gmae”.

Then the FST associated is expanded with new statuses containing the inflections for each token.

For example “game” will be added to the query FST because it has a distance of 1 from the original token.

And the prefix matching is working fine returning the expected suggestions.

Video gmaing
  • Video gaming: the history”
  • Video games are an economic business”
  • Video game: multiplayer gaming”
The input query is analysed, and the tokens produced are the following : “video” “gma”.

Then the FST associated is expanded with new statuses containing the inflections for each token.

For example “gam” will be added to the query FST because it has a distance of 1 from the original token.

So the prefix match applies.


Video gamign
  • No suggestion returned
This can seem odd at first, but it is coherent with the Look up implementation.

The input query is analysed, and the tokens produced are the following : “video” “gamign”.

Then the FST associated is expanded with new statuses containing the inflections for each token.

For example “gaming” will be added to the query FST because it has a distance of 1 from the original token.

But no prefix matching will apply because in the Index FST we have “game”, the stemmed token for “gaming”

AnalyzingInfixLookupFactory


<lst name="suggester">

  <str name=”name”>AnalyzingInfixSuggester</str>

  <str name=”lookupImpl”>AnalyzingInfixLookupFactory</str> 

  <str name=”dictionaryImpl”>DocumentDictionaryFactory</str>

  <str name=”field”>title</str>

  <str name=”weightField”>price</str>

  <str name=”suggestAnalyzerFieldType”>text_en</str>

</lst>



 

Description
Data Structure Auxiliary Lucene Index
Building For each Document, the stored content from the field is analyzed according to the suggestAnalyzerFieldType and then additionally EdgeNgram token filtered.

Finally an auxiliary index is built with those tokens.

Lookup strategy The query is analysed according to the suggestAnalyzerFieldType.

Than a phrase search is triggered against the Auxiliary Lucene index

The suggestions are identified starting at the beginning of each token in the field content.

Suggestions returned The entire content of the field .

This suggester is really common nowadays as it allows to provide suggestions in the middle of a field content, taking advantage of the analysis chain provided with the field.

It will be possible in this way to provide suggestions considering synonymsstop words, stemming and any other token filter used in the analysis and match the suggestion based on internal tokens.Let’s see some example:

Query to autocomplete Suggestions Explanation
“gaming”
  • “Video gaming: the history”
  • “Video games are an economic business”
  • “Video game: multiplayer gaming”
The input query is analysed, and the tokens produced are the following : “game” .

In the Auxiliary Index , for each of the field content we have the EdgeNgram tokens:

“v”,”vi”,”vid”… , “g”,”ga”,”gam”,“game” .

So the match happens and the suggestion are returned

“ga”
  • “Video gaming: the history”
  • “Video games are an economic business”
  • “Video game: multiplayer gaming”
The input query is analysed, and the tokens produced are the following : “ga” .

In the Auxiliary Index , for each of the field content we have the EdgeNgram tokens:

“v”,”vi”,”vid”… , “g”,“ga”,”gam”,”game” .

So the match happens and the suggestion are returned


“game econ”  

  • “Video games are an economic business”
Stop words will not appear in the Auxiliary Index.

Both “game” and “econ” will be, so the match applies.

BlendedInfixLookupFactory

We are not going to describe the details  of this lookup strategy as it’s pretty much the same of the AnalyzingInfix.

The only difference appears scoring the suggestions, to weight prefix matches across the matched documents. The score will be higher if a hit is closer to the start of the suggestion or vice versa.

 

FSTLookupFactory


<lst name="suggester">

  <str name=”name”>FSTSuggester</str>

  <str name=”lookupImpl”>FSTLookupFactory</str> 

  <str name=”dictionaryImpl”>DocumentDictionaryFactory</str>

  <str name=”field”>title</str>

</lst>



Description
Data Structure FST
Building For each Document, the stored content is added to the Index FST.
Lookup strategy The query is added to the query FST.

An intersection happens between the Index FST and the query FST.

The suggestions are identified starting at the beginning of the field content.

Suggestions returned The entire content of the field .

This suggester is quite simple as it allows to provide suggestions at the beginning of a field content, with an exact prefix match.Let’s see some example:

Query to autocomplete Suggestions Explanation
“Video gam”
  • Video gaming: the history”
  • Video games are an economic business”
  • Video game: multiplayer gaming”
The suggestions coming are simply the result of the prefix match. No surprises so far.
“Video Games”
  • No Suggestions
The input query is not analysed,  and no field content in the documents starts with that exact prefix


“video gam”  

  • No Suggestions
The input query is not analysed,  and no field content in the documents starts with that exact prefix
“game”
  • No Suggestions
This lookup strategy works only at the beginning of the field content. So no suggestion is returned.

For the following lookup strategy we are going to use a slightly modified corpus of documents :

[
      {
        "id":"44",
        "title":"Video games: the history"},
      {
        "id":"11",
        "title":"Video games the historical background"},
      {
        "id":"55",
        "title":"Superman, hero of the modern time"},
      {
        "id":"33",
        "title":"the study of the hierarchical faceting"}]

FreeTextLookupFactory

<lst name=”suggester”>

  <str name=”name”>FreeTextSuggester</str>

  <str name=”lookupImpl”>FreeTextLookupFactory</str> 

  <str name=”dictionaryImpl”>DocumentDictionaryFactory</str>

  <str name=”field”>title</str>

  <str name=”ngrams”>3</str>

  <str name=”separator”> </str>

  <str name=”suggestFreeTextAnalyzerFieldType”>text_general</str>

</lst>



Description
Data Structure FST
Building For each Document, the stored content from the field is analyzed according to the suggestFreeTextAnalyzerFieldType.

As a last token filter is added a ShingleFilter with a minShingle=2 and maxShingle=.

The final tokens produced are added to the Index FST.

Lookup strategy The query is analysed according to the suggestFreeTextAnalyzerFieldType.

As a last token filter is added a ShingleFilter with a minShingle=2 and maxShingle=.

Only the latest “ngrams” tokens will be evaluated to produce

Suggestions returned ngram tokens suggestions

This lookup strategy is completely different from the others seen so far, its main difference is that the suggestions are ngram tokens ( and NOT the full content of the field).

We must take extra care in using this suggester as it is quite easily prone to errors, some guidelines :

  • Don’t use an heavy Analyzers, the suggested terms will come from the index, so be sure they are meaningful tokens. A really basic analyser is suggested, stop words and stemming are not 
  • Be sure you use the proper separator(‘ ‘ is suggested), the default will be encoded in “#30;”
  • ngrams parameter will set the last n tokens to be considered from the query

Let’s see some example:

Query to autocomplete Suggestions Explanation
“video g”
  • video gaming”
  • video games”
  • generation”
The input query is analysed, and the tokens produced are the following : “video g” “g” 

The analysis was applied at building time as well, producing 2-3 shingles.

“video g” matches by prefix 2 shingles from the index FST .

“g” matches by prefix 1 shingle from the index FST.

“games the h”  

  • games the history”
  • games the historical”
  • the hierarchical”
  • hero”
The input query is analysed, and the tokens produced are the following : “games the h” “the h””h” 

The analysis was applied at building time as well, producing 2-3 shingles.

“games the h” matches by prefix 2 shingles from the index FST .

“the h” matches by prefix 1 shingle from the index FST.

“h” matches by prefix 1 shingle from the index FST.

[1] Suggester Solr wiki

[2] Solr suggester

[3] Lucene Storing Compression

[4] Levenstein Distance

Solr Document Classification – Part 1 – Indexing Time

Introduction

This blog post is about the Solr classification module and the way Lucene classification has been integrated at indexing time.

In the previous blog [1] we have explored the world of Lucene Classification and the extension to use it for Document Classification .
It comes natural to integrate Solr with the Classification module and allow Solr users to easily manage the Classification out of the box .

N.B.  This is supported from Solr 6.1

Solr Classification

Taking inspiration from the work of a dear friend [2] , integrating the classification in Solr can happen 2 sides :
  • Indexing time – through an Update Request Processor
  • Query Time – through a Request handler ( similar to the More like This )
In this first article we are going to explore the Indexing time integration :
The Classification Update Request Processor.

Classification Update Request Processor

First of all let’s describe some basic concepts :
An Update Request Processor Chain, associated to an Update handler, is a pipeline of Update processors, that will be executed in sequence.
It takes in input the added Document (to be indexed) and return the document after it has been processed by all the processors in the chain in sequence.
Finally the document is indexed.
An Update Request Processor is the unit of processing of a chain, it takes in input a Document and operates some processing before it is passed to the following processor in the chain if any.
The main reason for the Update processor is to add intermediate processing steps that can enrich, modify and possibly filter documents , before they are indexed.
It is important because the processor has a view of the entire Document, so it can operate on all the fields the Document is composed.
For further details, follow the official documentation [3].

Description

The Classification Update Request Processor is a simple processor that will automatically classify a document ( the classification will be based on the latest index available) adding a new field  containing the class, before the document is indexed.
After an initial valuable index has been built with human assigned labels to the documents, thanks to this Update Request Processor will be possible to ingest documents with automatically assigned classes.
The processing steps are quite simple :
When a document to be indexed enters the Update Processor Chain, and arrives to the Classification step, this sequence of operations will be executed :
  • The latest Index Reader is retrieved from the latest opened Searcher
  • A Lucene Document Classifier is instantiated with the config parameters in the solrconfig.xml
  • A Class is assigned by the classifier taking in consideration all the relevant fields from the input document
  • A new field is added to the original Document, with the class
  • The Document goes through the next processing step

Configuration

 

e.g. K Nearest Neighbours Classifier
<updateRequestProcessorChain name=”classification”>
<processor class=”solr.ClassificationUpdateProcessorFactory”>
<str name=”inputFields”>title^1.5,content,author</str>
<str name=”classField”>cat</str>
<str name=”algorithm”>knn</str>
<str name=”knn.k”>20</str>
<str name=”knn.minTf”>1</str>
<str name=”knn.minDf”>5</str>
</processor>
</updateRequestProcessorChain>

e.g. Simple Naive Bayes Classifier
<updateRequestProcessorChain name=”classification”>
<processor class=”solr.ClassificationUpdateProcessorFactory”>
<str name=”inputFields”>title^1.5,content,author</str>
<str name=”classField”>cat</str>
<str name=”algorithm”>bayes</str>
</processor>
</updateRequestProcessorChain>

e.g. Update Handler Configuration
<requestHandler name=”/update” >
<lst name=”defaults”>
<str name=”update.chain”>classification</str>
</lst>
</requestHandler>

Parameter Default Description
inputFields
This config param is mandatory The list of fields (comma separated) to be taken in consideration for doing the classification.
Boosting syntax is supported for each field.
classField
This config param is mandatory The field that contains the class of the document. It must appear in the indexed documents .
If knn algorithm it must be stored .
If bayes algorithm it must be indexed and ideally not heavily analysed.
algorithm
knn The algorithm to use for the classification:
– knn ( K Nearest neighbours )
– bayes ( Simple Naive Bayes )
knn.k
10 Advanced – the no. of top docs to select in the MLT results to find the nearest neighbor
knn.minDf
1 Advanced – A term (from the input text) will be taken in consideration by the algorithm only if it appears at least in this minimum number of docs in the index
knn.minTf
1 Advanced – A term (from the input text) will be taken in consideration by the algorithm only if it appears at least this minimum number of times in the input
 

Usage

Indexing News Documents ? we can use the already indexed news with category,  to automatically tag upcoming stories with no human intervention.
E-commerce Search System ? Category assignation will require few human interaction after a valid initial corpus of products has been indexed with manually assigned category.
The possible usage for this Update Request Processor are countless.
In any scenario where we have documents with a class or category manually assigned in our Search System, the automatic Classification can be a perfect fit.
Leveraging the existent Index , the overhead for the Classification processing will be minimal.
After an initial human effort to have a good corpus of classified Documents, the Search System will be able to automatically index the class for the upcoming Documents.
Of course we must remember that for advanced classification scenarios that require in deep tuning, this solution can be not optimal.

Code

The patch is attached to this Jira Issue :
This has been officially merged to Apache Solr starting with 6.1. version.

 

 

Lucene Document Classification

Introduction

This blog post describes the approach used in the Lucene Classification module to adapt text classification to document ( multi field ) classification.

Machine Learning and Search have been always strictly associated.
Machine Learning can help to improve the Search Experience in a lot of ways, extracting more information from the corpus of documents, auto classifying them and clustering them.
On the other hand , Search related data structures are quite similar in content to the ML models used to solve the problems, so we can use them to build up our own algorithms.
But let’s go in order…
In this article we are going to explore how much Auto Classification can be important to easy the user experience in Search, and how will be possible to have an easy, out of the box painless classification directly from Lucene, from our already existent Index, without any external dependency or trained model.

Classification

From Wikipedia :

In machine learning and statisticsclassification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. An example would be assigning a given email into “spam” or “non-spam” classes or assigning a diagnosis to a given patient as described by observed characteristics of the patient (gender, blood pressure, presence or absence of certain symptoms, etc.).

Classification is a very well known problem, and it is really easy to verify the utility of classifiers on a  daily basis.
When dealing with Search Engines , it is so common that our Documents have a category ( generally assigned by a human ). Wouldn’t be really useful to be able to automatic extract the category for the documents that are missing it ? Wouldn’t be really interesting to automate that processing step to have document automatically categorised, based on the experience ? ( which means using all the documents already categorised in our Search System)
Wouldn’t be amazing to do that without any external tool and without any additional expensive model training activity ?
We already have a lot of information in our Search System, let’s use it !

Lucene Classification Module

To be able to provide a Classification capability, our system generally needs a trained model.
Classification is a problem solved by Supervised Machine learning algorithms which means humans need to provide a training set.
A classification training set is a set of documents, manually tagged with the correct class.
It is the “experience” that the Classification system will use to classify upcoming unseen documents.
Building a trained model is expensive and requires to be stored in specific data structures.
Is it really necessary to build an external trained model when we already have in our index million of documents, already classified by humans and ready to be searched ?
A Lucene index has already really interesting data structures that relates terms to documents ( Inverted Index), fields to terms ( Term Vectors) and all the information we need about Term Frequency and Document Frequency.
A corpus of documents, already tagged with the proper class in a specific field, can be a really useful resource to build our classification inside our search engine without any external tool or data structure.
Based on this observations, the Apache Lucene [1] Open Source project introduced with version 4.2  a document classification module to manage the text classification using the Index data structures.
The facade for this module is the text Classifier, a simple interface ( with 3 implementations available ) :

public interface Classifier {/** * Assign a class (with score) to the given text String *
* @param text a String containing text to be classified *
@return a {@link ClassificationResult} holding assigned class of type T and score *

ClassificationResult assignClass(String text) throws IOException;

/** * Get all the classes (sorted by score, descending) assigned to the given text String.
@param text a String containing text to be classified *
@return the whole list of {@link ClassificationResult}, the classes and scores. Returns null if the classifier can’t make lists. */

List getClasses(String text) throws IOException;

/** * Get the first max classes (sorted by score, descending) assigned to the given text String.
@param text a String containing text to be classified *
@param max the number of return list elements *
@return the whole list of {@link ClassificationResult}, the classes and scores. Cut for “max” number of elements. Returns null if the classifier can’t make lists. */

List getClasses(String text, int max) throws IOException;
}

The available implementations are :

  • KNearestNeighborClassifier – A k-Nearest Neighbor classifier [2] based on MoreLikeThis
  • SimpleNaiveBayesClassifier – A simplistic Lucene based NaiveBayes classifier [3]
  • BooleanPerceptronClassifier – A perceptron [4] based Boolean Classifier. The weights are calculated using TermsEnum.totalTermFreq both on a per field and a per document basis and then a corresponding FST is used for class assignment.

Let’s see the first 2 in details :

KNearestNeighborClassifier

This classifier is based on the Lucene More like This [5]. 
A feature able to retrieve similar documents to a seed one ( or a seed text) calculating the similarity between the docs on a field base.
It takes in input the list of fields to take in consideration ( with relative boost factor to increase the importance of some field over the others).
 
The idea behind this algorithm is simple :
 – given a set of relevant fields for our classification
 – given a field containing the class of the document
It retrieves all the similar documents to the Text in input using the MLT.
 
Only the documents with the class field valorised, are taken in consideration.
The top k documents in result are evaluated, and the class extracted from all of them.
Then a ranking of the retrieved classes is made, based on the frequency of the class in the top K documents.
Currently the algorithm takes in consideration only the frequency of the class to calculate its score.
One limitation is that is not taking in consideration the ranking of the class.
This means that if in the top K, we have the first k/2 documents of class C1 and the second k/2 document of class C2, both the classes will have the same score [6] .
This classifier works on top of the Index, let’s quickly have an overview of the constructor parameters :  
Parameter Description
leafReader The Index Reader that will be used to read the index data structures and classify the document
analyzer The Analyzer to be used to analyse the input unseen text
query A filter to apply on the indexed documents. Only the ones that satisfy the query will be used for the classification
k the no. of top docs to select in the MLT results to find the nearest neighbor
minDocsFreq A term (from the input text) will be taken in consideration by the algorithm only if it appears at least in this minimum number of docs in the index
minTermFreq A term (from the input text) will be taken in consideration by the algorithm only if it appears at least this minimum number of times in the input
classFieldName The field that contains the class of the document. It must appear in the indexed documents. MUST BE STORED
textFieldNames The list of fields to be taken in consideration for doing the classification. They must appear in the indexed documents
Note :  MoreLikeThis constructs a Lucene query based on terms in a document. It does this by pulling terms from the defined list of fields (the textField  
Names parameter, below). 

You usually read that for best results, the fields should have stored term vectors.
In our case, the text to classify is unseen, it is not and indexed document.
So the TermVector is not used at all.

SimpleNaiveBayesClassifier

This Classifier is based on a simplistic implementation of a Naive Bayes Classifier [3].
It uses the Index to get the Term frequencies of the terms, the Doc frequencies and the unique terms.
First of all it extract all the possible classes, this is obtained getting all the terms for the class field.
Then , each class is scored based on
– how frequent is the class in all the classified documents in the index
– how much likely the tokenized text is to belong to the class
The details of the algorithm go beyond the scope of this blog post.
This classifier works on top of the Index, let’s quickly have an overview of the constructor parameters :

Note :  The NaiveBayes Classifier works on terms from the index. This means it pulls from the index the tokens for the class field. Each token will be considered as class, and will see a score associated.

This means that you must be careful in the analysis you choose for the classField and ideally use a not tokenizer field containing the class ( a copyField if necessary)

Document Classification Extension

Parameter Description
leafReader The Index Reader that will be used to read the index data structures and classify the document
analyzer The Analyzer to be used to analyse the input unseen text
query A filter to apply on the indexed documents. Only the ones that satisfy the query will be used for the classification
classFieldName The field that contains the class of the document. It must appear in the indexed documents
textFieldNames The list of fields to be taken in consideration for doing the classification. They must appear in the indexed documents
The original Text Classifiers are perfectly fine, but what about Document classification ?
In a lot of cases the information we want to classify is actually composed of a set of fields with one or more value each one.
Each field content contributes to the classification ( with a different weight to be precise).
Given a simple News article, the title is much more important for the classification in comparison with the text and the author.
But even if not so relevant even the author can have a little part in the class assignation.
Lucene atomic unit of information is the Document. 
 

Documents are the unit of indexing and search. A Document is a set of fields. Each field has a name and a textual value.

This data structure is a perfect input for a new generation of Classifiers that will benefit of augmented information to assign the relevant class(es).
Whenever a simple text is ambiguous, including other fields in the analysis is improving dramatically the precision of the classification.
In details :

  • the field content from the Input Document will be compared with the data structures in the index, only related to that field
  • an input field will be analysed accordingly to its own analysis chain, allowing greater flexibility and precision
  • an input field can be boosted, to affect more the classification. In this way different portions of the input document will have a different relevancy to discover the class of the document.
A new interface is provided :

public interface DocumentClassifier<T> {

/** * Assign a class (with score) to the given {@link org.apache.lucene.document.Document} * * @param document a {@link org.apache.lucene.document.Document} to be classified. Fields are considered features for the classification. * @return a {@link org.apache.lucene.classification.ClassificationResult} holding assigned class of type T and score */

ClassificationResult<T> assignClass(Document document) throws IOException;

/** * Get all the classes (sorted by score, descending) assigned to the given {@link org.apache.lucene.document.Document}. * * @param document a {@link org.apache.lucene.document.Document} to be classified. Fields are considered features for the classification. * @return the whole list of {@link org.apache.lucene.classification.ClassificationResult}, the classes and scores. Returns null if the classifier can’t make lists.*/

List<ClassificationResult<T>> getClasses(Document document) throws IOException;

/** * Get the first max classes (sorted by score, descending) assigned to the given text String. * * @param document a {@link org.apache.lucene.document.Document} to be classified. Fields are considered features for the classification. * @param max the number of return list elements * @return the whole list of {@link org.apache.lucene.classification.ClassificationResult}, the classes and scores. Cut for “max” number of elements. Returns null if the classifier can’t make lists.*/ 

List<ClassificationResult<T>> getClasses(Document document, int max) throws IOException;

And 2 Classifier extended to provide the new functionality :

  • KNearestNeighborDocumentClassifier 
  • SimpleNaiveBayesDocumentClassifier 

The first implementation is available as a contribution for this Jira issue :

LUCENE-6631

Now that a new interface is available, it will be much easier to integrate it with Apache Solr.

Stay tuned for the Solr Classification Integration.

[1] Apache Lucene
[2] K-nearest_neighbors
[3] Naive Bayes
[4] Perceptron
[5] More Like This
[6] LUCENE-6654 – Fix proposed