SolrCloud Leader Election Failing

At the time we speak ( Solr 7.3.0 ) SolrCloud is a reliable and stable distributed architecture for Apache Solr.
But it is not perfect and failures happen.
This lightening blog post will present some practical tips to follow when a specific shard of a collection is down with no leader and the situation is stuck.
The following problem has been experienced with the following Solr versions :

  • 4.10.2
  • 5.4.0

Steps to solve the problem may involve manual interaction with the Zookeeper Ensemble[1].
The following steps are extracted from an interesting thread of the Solr User mailing list[2] and practical experience on the field.
In particular, thanks to Jeff Wartes for the suggestions, that proved useful for me in a couple of occasions.

Problem

  • All nodes for a Shard in a Collection are up and running
  • There is no leader for the shard
  • All the nodes are in a “Recovering” / “Recovery Failed” state
  • Search is down and the situation persist after many minutes (> 5)

Solution

A possible explanation for this problem to occur is when the node-local version of the Zookeeper clusterstate has diverged from the centralized Zookeeper cluster state.
One possible cause for the leader election to break is a Zookeeper failure : for example you lose >=50% of the ensemble nodes or the connectivity among the ensemble nodes for a certain period of time ( this is the scenario I experimented directly)
This failure, even if resolved later, can bring a corruption to the Zookeeper file system.
Some of the SolrCloud collections may remain in a not consistent status.

It may be necessary to manually delete corrupted files from Zookeeper :
Let’s start from :

collections/<collection>/leader_elect/shard<x>/election
An healthy SolrCloud cluster presents as many core_nodeX as the total replicas for the shard.
You don’t want duplicates or missing nodes here.
If you’re having trouble getting a sane election, you can try deleting the lowest-numbered entries (as well as any lower-numbered duplicates) and try to foce the election again. Possibly followed by restarting the node with that lowest-numbered entry.

collections/<collection>/leader/shard<x>
Make sure that this folder exists and has the expected replica as a leader.

collections/<collection>/leader_initiated_recovery
This folder can be informative too, this represents replicas that the *leader* thinks are out of sync, usually due to a failed update request.

After having completed the verification above, there a couple of Collection API endpoints that may be useful :

Force Leader Election
/admin/collections?action=FORCELEADER&collection=<collectionName>&shard=<shardName>

Force Leader Rebalance
/admin/collections?action=REBALANCELEADERS&collection=collectionName

N.B. rebalancing all the leader will affect all the shards

 

[1] Apache Zookeeper Solr Cli

[2] Solr Mailing List Thread

[3] Solr Collection API

 

Give the height the right weight: quantities detection in Apache Solr

Quantity detection? What is a quantity? And why do we need to detect it?

A quantity, as described by Martin Fowler in his “Analysis Patterns” [1] is defined as a pair which combines an amount and unit (such as 30 litres, 0.25 cl, or 140 cm). In search-based applications, there are many cases where you may want to classify your searchable dataset using dimensioned attributes, because such quantities have a special meaning within the business context you are working on. The first example that comes in my mind?

Apache Solr Quantity Detection Plugin

Beer is offered in several containers (e.g. cans, bottles); each of them is available in multiple sizes (e.g. 25 cl, 50 cl, 75 cl or 0.25 lt, 0.50 lt, 0.75 lt). A good catalog would capture these information in dedicated fields, like “container” (bottle, can) and “capacity” (25cl, 50cl, 75cl in the example above): in this way the search logic can properly make use of them. Faceting (and subsequent filtering) is a good example of what the user can do after a first search has been executed: he can filter and refine results, hopefully finding what he was looking for.

But if we start from the beginning of a user interaction, there’s no result at all: only the blank textfield where the user is going to type something. “Something” could be whatever, anything (in his mind) related with the product he wants to find: a brand, a container type, a model name, a quantity. In few words: anything which represents one or more relevant features of the product he’s looking for.

So one of the main challenge, when implementing a search logic, is to get the point about the meaning of the entered terms. This is in general a very hard topic, often involving complicated stuff (e.g. machine learning), but sometimes things move on an easier side, especially when concepts, we want to detect, follow a common and regular pattern: like a quantity.

The main idea behind the quantity detection plugin [2] we developed at Sease is the following: starting from the user entered query, first it detects the quantities (i.e. the amounts and the corresponding units); then, these information will be isolated from the main query and they will be used for boosting up all products relevant to those quantities. Relevancy here can be meant in different ways:

  • exact match: all bottles with a capacity of 25cl
  • range match: all bottles with a capacity between 50cl and 75cl.
  • equivalence exact match: all bottles with a capacity of 0.5 litre (1lt = 100cl)
  • equivalence range match: all bottles with a capacity between 0.5 and 1 litre (1lt = 100cl)

The following is a short list with a brief description of all supported features:

  • variants: a unit can have a preferred form and (optionally) several variants. This can include different forms of the same unit (e.g. mt, meter) or an equivalent unit in a different metric system (e.g. cl, once)
  • equivalences: it’s possible to define an equivalence table so units can be converted at runtime (“beer 0.25 lt” will have the same meaning of “beer 25cl”). An equivalence table maps a unit with a conversion factor.
  • boost: each unit can have a dedicated boost, especially useful for weighting multiple matching units.
  • ranges: each unit can have a configured gap, which triggers a range query where the detected amount can be in the middle (PIVOT), at the beginning (MIN) or at the end (MAX) of the generated range
  • multi-fields: in case we have more than one attribute using the same unit (e.g. height, width, depth)
  • assumptions: in case an “orphan” amount (i.e an amount without a unit) is detected, it’s possible to define an assumption table and let Solr guess the unit.

Feel free to have a try, and if you think it could be useful, please share with us your idea and / or your feedback.

[1] https://martinfowler.com/books/ap.html

[2] https://github.com/SeaseLtd/solr-quantities-detection-qparsers

ECIR 2018 Experience

Logo.png

This blog is a quick summary of my (subjective) experience at ECIR 2018 : the 40th European Conference on Information Retrieval, hosted in Grenoble (France) from 26/03/2018 to 29/03/2018.

Deep Learning and Explicability

Eight long papers accepted were about Deep Learning.
The topics “Neural Network” and “Word Embedding” were the most occurring in the accepted full papers (and rejected) at the conference. It is clear that deep learning technologies are strongly advancing as a de facto standard in Artificial Intelligence and this can be noticed also in Information Retrieval where these technologies can be used to better model the user behaviour and extract topics and semantic from documents.
But if with deep learning and the advanced capabilities of complex models you gain performance, on the other hand you  lose the ability of explaining and debugging why such output corresponds to a given input.
A recurring topic in the Deep Learning track ( and the Industry Day actually) was to find a balance in the performance gain of new techniques and the control over them.
It is an interesting topic and I believe it is just the other face of the coin, it is good though that Academia is noting the importance of such aspect : in the industry a technology requires control and maintenance much more than in the academic environment and most of the times the “debuggability” can affect the decision between a technology and another.

From Academic Papers to Production

The 2018 European Conference on Information Retrieval ended the 29/03 with a brilliant Industry Day focused on the perilous path from Research to Production.
This is the topic that most permeated the conference across different keynotes, sessions and informal discussions.
In Information Retrieval it is still difficult to converge successful researches into successful live systems : most of the times it is not clear which party should be interested in this process.
Okapi BM25 was first published and implemented in the 1980s and 1990s; it became the default similarity in Apache Lucene 6.0 in 2016 .
Academia is focused on finding new interesting problems to solve with inventive techniques and Industry is focused on finding the quickest solution that works (usually).
This brings a gap :  new solutions are not easily reproducible from academic papers and they are far from being practically ready to use outside the experimental controlled environment.
Brilliant researchers crave new interesting problems they can reason on and solve: their focus is to build a rough implementation and validate it with few metrics ( ideally with an accepted related publication ).
After that, their job is done, problem solved, the challenge is not interesting anymore and they can pass to the next problem.
Researchers get bored easily but they risk to never see their researches fulfilled, applied and used in real life.
Often Academia creates its own problems to solve them and get a publication : Publish or Perish -> no publications bring less funds.
This may be or may be not a personal problem, depending on the individual.
Industry is usually seen by academics as a boring place where you just apply consolidated techniques to just get the best result with a minimum effort.
Sometimes industry is just where you end up when you want to make some money and actually see your effort to bring benefits to some population.
And this was(is) true most of the times but with the IT explosion we are living and the boom of competition the situation nowadays is open to change, where a stronger connection between Academia and Industry can ( and should !) happen and conferences such as ECIR is a perfect ground to settle the basis.
So, building on the introduction let’s see a quick summary of the keynotes, topics and sessions that impressed me most from the conference!

From Academic Papers to Production : A Learning To Rank Story[1]

Let’s start from Sease contribution to the conference : a story about the adoption of Learning To Rank in a real world e-commerce scenario.
The session took place at the Industry Day (29/03) and focused on the challenges and pitfalls of moving from the research papers to production.
The entire journey was possible through Open Source software : Apache Solr and RankLib as main actors.
Learning To Rank is becoming extremely popular in the industry and it is a very promising and useful technology to improve the relevancy function leveraging the user behaviour.
But from an open source perspective is still quite a young technology and effort is required to get it right.
The main message that I wanted to transmit is : don’t be scared to fail, if something doesn’t work immediately out of the box, it doesn’t mean it’s not a valid technology, No Pain No Gain, Learning To Rank open source implementations are valid but require tuning and care to bring them to production with success ( and the improvement these technologies can bring is extremely valuable).

The Harsh Reality of Production Information Access Systems[2]

Recipient of the “Karen Spärck Jones Award” Fernando Diaz in this talk focused on the problems derived from the adoption of Information Retrieval technologies in production environments where a deep understanding of individuals, groups and society is required.
Building on the technical aspect involved in applying research in real world systems, the focus switched to the ethical side : are current IR systems moving in the direction of providing the user with a fair and equally accessible information or the monetisation process behind is producing just addicted users who see (and buy) ad hoc tailored information ?

Statistical Stemmers : A Reproducibility Study[3]

This paper is a sort of symbol of the conference trend : reproducibility is as important as the innovation that a research brings.
Winner of the Best Paper Award, it may have caused some perplexity among the audience ( why to reward a paper that is not innovating but just reproducing past techniques ?), but the message that transmits is clear : a research needs to be easily reproducible and effort is required in that direction for an healthy Research & Development flow that doesn’t target just the publication but a real world application.

Entity-centric Topic Extraction and Exploration: A Network-based Approach[4]

Interesting talk, it explores topic modelling over time in a network based fashion :
Instead of modelling a topic as ranked list of terms it uses a network (weighted graph) representation.
This may be interesting for an advanced More Like This implementation, it’s definitely worth an investigation.

Information Scent, Searching and Stopping : Modelling SERP Level Stopping Behaviour[5]

This talk focused on the entire Search Result Page as a possible signal that affects user stopping behaviour i.e. when a search result page is returned, the overall page quality affects the user perception of relevancy and may drive an immediate query reformulation OR a good abandonment ( when the information need is satisfied).
This something that I personally experimented : sometimes from a quick look to the result page you may realise if the search engine understood ( or misunderstood) your information need.
Different factors are involved in what the author calls “Information Scent” but definitely the perceived relevance (modelled through different User Experience approaches) is definitely an interesting topic that sits along the real relevance.
Further studies in this area may affect the way Search Results pages are rendered, to maximise the fruition of information.

Employing Document Embeddings to Solve the “New Catalog” Problem in User Targeting, and provide Explanations to the Users[6]

The new catalog problem is a practical problem for modern recommender systems and platforms. There are a lot of use cases where you have collection of items that you would like to recommend and this ranges over Music Streaming Platforms ( Playlists, Albums, ect), Video Streaming Platform ( Tv Series genres, To View lists ect) and many other domains.
This paper explores both the algorithm behind such recommendations and the explanation need : explaining to the user why a catalog may be relevant for his/her taste is as important as providing a relevant catalog of items.

Anatomy of an Idea: Mixing Open Source, research and Business

This keynote summarises the cornerstone of Sease culture :
Open Source as a bridge between Academia and Industry.
If every research paper should be implemented in a production ready open source platform as part of the publication process, the community would get a direct and immense benefit out of it.
Iterative improvement would get a great traction and generally speaking the entire scientific community will get a strong boost with a better accessibility .
Implementing researches in state of the art production ready open source systems ( where possible) would cut the adoption time at industry level, triggering an healthy process of utilisation and bug fixing.

Industry Day[7]

The industry day was the coronation of the overall trend that was permeating the conference : there is a strong need to build a better connection between the Academic and Industrial world.
And the good audience reception shown ( the organisers had to move the track to the main venue) is a proof that there is an increasingly stronger need of seeing interesting researches applied ( with success or failure) to the real world with the related lessons learned.
Plenty of talks in this session were brilliant, my favourites :

  • Fabrizio Silvestri (Facebook)
    Query Embeddings: From Research to Production and Back!
  • Manos Tsagkias (904Labs)
    A.I. for Search: Lessons Learned
  • Marc Bron (Schibsted Media Group)
    Managment of Industry Research: Experiences of a Research Scientist

In conclusion, the conference was perfectly organised in an excellent venue, the balance in topics and talks was fairly good( both academic and industrial) and I really enjoyed my time in Grenoble, see you next year in Cologne !

[1] https://www.slideshare.net/AlessandroBenedetti/ecir-2018-alessandro
[2] https://www.ecir2018.org/programme/keynote-speakers/
[3] http://www.dei.unipd.it/~silvello/papers/ECIR2018_SA.pdf
[4] https://dbs.ifi.uni-heidelberg.de/files/Team/aspitz/publications/Spitz_Gertz_2018_Entity-centric_Topic_Extraction.pdf
[5] https://strathprints.strath.ac.uk/62856/
[6] https://link.springer.com/chapter/10.1007%2F978-3-319-76941-7_28
[7] https://www.ecir2018.org/industry-day/

Distributed Search Tips for Apache Solr

Distributed search is the foundation for Apache Solr Scalability :

It’s possible to distributed search across different Apache Solr nodes of the same collection ( both in a  legacy[1] or SolrCloud[2] architecture), but it is also possible to distribute search across different collections in a SolrCloud cluster.
Aggregating results from different collections may be useful when you put in place different systems ( that were meant to be separate ) and you later realize that aggregating the results may be an additional useful use case.
This blog will focus on some tricky situations that can happen when running distributed search ( for configuration or details you can refer to the Solr wiki ).

IDF

Inverse Document Frequency affects the score.
This means that a document coming from a big collection can obtain a boost from IDF, in comparison to a similar document from a smaller collection.
This is because the maxDoc count is taken into account as corpus size, so even if a term has the same document frequency, IDF will be strongly affected by the collection size.
Distributed IDF[3] partially solved the problem :

When distributing the search across different shards of the same collection, it works quite well.
But using the ExactStatCache and alternating single collection distribution and multi collection distribution in the same SolrCloud cluster will create some caching conflict.

Specifically if we first execute the inter collection query, the global stats cached will be the inter collection global stats,  so if we then execute a single collection distributed search, the preview global stats will remain cached ( viceversa applies).

Debug Scoring

Real score and debug score is not aligned with the distributed IDF, this means that the debug query will not show the correct distributed IDF and correct scoring calculus for distributed searches[4].

Relevancy tuning

Lucene/Solr score is not probabilistic or normalised.
For the same collection we can have completely different score scales just with different queries.
The situation becomes more complicated when we tune our relevancy adding multiplicative or additive boosts.
Different collections may imply completely different boosting logic that could cause the score of a collection to be on a completely different scale in comparison to another.
We need to be extra careful when tuning relevancy for searches across different collections and try to configure the distributed request handler in the most compatible way as possible.

Request handler

It is important to carefully specify the request handler to be used when using distributed search.
The request will hit one collection in one node and then when distributing the same request handler will be called on the other collections across the other nodes.
If necessary it is possible to configure the aggregator request handler and local request handlers ( this may be useful if we want to use a different scoring formulas per collection, using local parameters) :

Aggregator Request Handler

It is executed on the first node receiving the request.
It will distribute the request and then aggregate the results.
It must describe parameters that are in common across all the collections involved in the search.
It is the one specified in the main request.
e.g.
http://localhost:8983/solr/collection1/select?

Local Request Handler

It is specified passing the parameter : shards.qt=
It is execute on each node that receive the distributed query AFTER the first one.
This can be used to use specific fields or parameter on a per collection basis.
A local request handler may use fields and search components that are local to the collection interested.

e.g.
http://localhost:8983/solr/collection1/select?q=*:*&collection=collection1,collection2&shards.qt=localSelect

N.B. the use of local request handler may be useful in case you want to define local query parser rules, such as local edismax configuration to affect the score.

Unique Key

The unique key field must be the same across the different collections.
Furthermore the value should be unique across the different collections to guarantee a proper behaviour.
If we don’t comply with this rule, Solr will fail in aggregating the results and raise an exception.

[1] https://lucene.apache.org/solr/guide/6_6/distributed-search-with-index-sharding.html
[2] https://lucene.apache.org/solr/guide/6_6/solrcloud.html
[3] https://lucene.apache.org/solr/guide/6_6/distributed-requests.html#DistributedRequests-ConfiguringstatsCache_DistributedIDF_
[4] https://issues.apache.org/jira/browse/SOLR-7759

Solr Is Learning To Rank Better – Part 4 – Solr Integration

Last Stage Of The Journey

This blog post is about the Apache Solr Learning To Rank ( LTR ) integration.

We modelled our dataset, we collected the data and refined it in Part 1 .
Trained the model in Part 2 .
Analysed and evaluate the model and training set in Part 3 .
We are ready to rock and deploy the model and feature definitions to Solr.
I will focus in this blog post on the Apache Solr Learning To Rank ( LTR ) integration from Bloomberg [1] .
The contribution is completed and available from Apache Solr 6.4.
This blog is heavily based on the Learning To Rank ( LTR ) Bloomberg contribution readme [2].

Apache Solr Learning To Rank ( LTR ) integration

The Apache Solr Learning To Rank ( LTR ) integration allows Solr to rerank the search results evaluating a provided Learning To Rank model.
Main responsabilties of the plugin are :

– storage of feature definitions
– storage of models
– feature extraction and caching
– search result rerank

Features Definition

As we learnt from the previous posts, the feature vector is the mathematical representation of each document/query pair and the model will score each search result according to that vector.
Of course we need to tell Solr how to generate the feature vector for each document in the search results.
Here comes the Feature Definition file.
A Json array describing all the relevant features necessary to score our documents through the machine learned LTR model.

e.g.

[{ "name": "isBook",
  "class": "org.apache.solr.ltr.feature.SolrFeature",
  "params":{ "fq": ["{!terms f=category}book"] }
},
{
  "name":  "documentRecency",
  "class": "org.apache.solr.ltr.feature.SolrFeature",
  "params": {
      "q": "{!func}recip( ms(NOW,publish_date), 3.16e-11, 1, 1)"
  }
},
{
  "name" : "userTextTitleMatch",
  "class" : "org.apache.solr.ltr.feature.SolrFeature",
  "params" : { "q" : "{!field f=title}${user_text}" }
},
{
  "name":"book_price",
  "class":"org.apache.solr.ltr.feature.FieldValueFeature",
  "params":{"field":"book_price"}
},
{
  "name":"originalScore",
  "class":"org.apache.solr.ltr.feature.OriginalScoreFeature",
  "params":{}
},
{
   "name" : "userFromMobile",
   "class" : "org.apache.solr.ltr.feature.ValueFeature",
   "params" : { "value" : "${userFromMobile:}", "required":true }
}]  
SolrFeature
– Query Dependent
– Query Independent
A Solr feature is defined by a Solr query following the Solr sintax.
The value of the Solr feature is calculated as the return value of the query run against the document we are scoring.
This feature can depend from query time parameters or can be query independent ( see examples)
e.g.
“params”:{“fq”: [“{!terms f=category}book”] }
– Query Independent
– Boolean feature
If the document match the term ‘book’ in the field ‘category’ the feature value will be 1.
It is query independent as no query param affects this calculation.
“params”:{“q”: “{!func}recip( ms(NOW,publish_date), 3.16e-11, 1, 1)”}
– Query Dependent
– Ordinal feature
The feature value will be calculated as the result of the function query, more recent the document, closer to 1 the value.
It is query dependent as ‘NOW’ affects the feature value.
“params”:{“q”: “{!field f=title}${user_text}” }
– Query Dependent
– Ordinal feature
The feature value will be calculated as the result of the query, more relevant the title content for the user query, higher the value.
It is query dependent as the ‘user_text’ query param affects the calculation.
FieldValueFeature
– Query Independent
A Fiel Value feature is defined by a Solr field.
The value of the feature is calculated as the content of the field for the document we are scoring.
The field must be STORED or DOC-VALUED . This feature is query independent ( see examples)
e.g.
“params”:{“field”:”book_price”}
– Query Independent
– Ordinal feature
The value of the feature will be the content of the ‘book_price’ field for a given document.
It is query independent as no query param affects this calculation.
ValueFeature
– Query Level
– Constant
A Value feature is defined by a constant or an external query parameter.
The value of the feature is calculated as the value passed in the solr request as an efi(External Feature Information) parameter or as a constant.
This feature depends only on the param configured( see examples)
e.g.
“params” : { “value” : “${user_from_mobile:}”, “required”:false }
– Query Level
– Boolean feature
The user will pass the ‘userFromMobile’ request param as an efi
The value of the feature will be the value of the parameter
The default value will be assigned if the parameter is missing in the request
If it is required an exception will be thrown if the parameter is missing in the request“params” : { “value” : “5“, “required”:false }
– Constant
– Ordinal feature
The feature value will be calculated as the constant value of ‘5’ .Except the constant, nothing affect the calculation.
OriginalScoreFeature
– Query Dependent
An Original Score feature is defined with no additional parameters.
The value of the feature is calculated as the original lucene score of the document given the input query.
This feature depends from query time parameters ( see examples)
e.g.
“params”:{}
— Query Dependent
— Ordinal feature
The feature value will be the original lucene score given the input query.
It is query dependent as the entire input query affect this calculation.

EFI ( External Feature Information )

As you noticed in the feature definition json, external request parameters can affect the feature extraction calculation.
When running a rerank query it is possible to pass additional request parameters that will be used at feature extraction time.
We see this in details in the related section.

e.g.
rq={!ltr reRankDocs=3 model=externalmodel efi.user_from_mobile=1}

 

Deploy Features definition

Good, we defined all the features we require for our model, we can now send them to Solr :

curl -XPUT 'http://localhost:8983/solr/collection1/schema/feature-store' --data-binary @/path/features.json -H 'Content-type:application/json'  
 

View Features Definition

To visualise the features just sent, we can access the feature store:

curl -XGET 'http://localhost:8983/solr/collection1/schema/feature-store'  
 

Models Definition

We extensively explored how to train models and how models look like in the format the Solr plugin is expecting.
For details I suggest you reading : Part 2
Let’s have a quick summary anyway  :

 

Linear Model (Ranking SVM, Pranking)

e.g.

 {
    "class":"org.apache.solr.ltr.model.LinearModel",
    "name":"myModelName",
    "features":[
        { "name": "userTextTitleMatch"},
        { "name": "originalScore"},
        { "name": "isBook"}
    ],
    "params":{
        "weights": {
            "userTextTitleMatch": 1.0,
            "originalScore": 0.5,
            "isBook": 0.1
        }
    }
} 


Multiple Additive Trees (LambdaMART, Gradient Boosted Regression Trees )

e.g.

{
    "class":"org.apache.solr.ltr.model.MultipleAdditiveTreesModel",
    "name":"lambdamartmodel",
    "features":[
        { "name": "userTextTitleMatch"},
        { "name": "originalScore"}
    ],
    "params":{
        "trees": [
            {
                "weight" : 1,
                "root": {
                    "feature": "userTextTitleMatch",
                    "threshold": 0.5,
                    "left" : {
                        "value" : -100
                    },
                    "right": {
                        "feature" : "originalScore",
                        "threshold": 10.0,
                        "left" : {
                            "value" : 50
                        },
                        "right" : {
                            "value" : 75
                        }
                    }
                }
            },
            {
                "weight" : 2,
                "root": {
                    "value" : -10
                }
            }
        ]
    }
}  

Heuristic Boosted Model (experimental)

The Heuristic Boosted Model is an experimental model that combines linear boosting to any model.
It is currently available in the experimental branch [3].
This capability is currently supported only by the : org.apache.solr.ltr.ranking.HeuristicBoostedLambdaMARTModel .
The reason behind this approach is that sometimes, at training time we don’t have available all the features we want to use at query time.
e.g.
Your training set is not built on clicks of the search results and contains legacy data, but you want to include the original score as a boosting factor
Let’s see the configuration in details :
Given :

"features":[ { "name": "userTextTitleMatch"}, { "name": "originalScoreFeature"} ]
"boost":{ "feature":"originalScoreFeature", "weight":0.1, "type":"SUM" }  

The original score feature value, weighted by a factor of 0.1, will be added to the score produced by the LambdaMART trees.

 "boost":{ "feature":"originalScoreFeature", "weight":0.1, "type":"PRODUCT" }  
 

The original score feature value, weighted by a factor of 0.1, will be multiplied to the score produced by the LambdaMART trees.

N.B. Take extra care when using this approach. This introduces a manual boosting to the score calculation, which adds flexibility when you don’t have much data for training. However, you will loose some of the benefits of a machine learned model, which was optimized to rerank your results. As you get more data and your model becomes better, you should shift off the manual boosting.

 

e.g

{
    "class":"org.apache.solr.ltr.ranking.HeuristicBoostedLambdaMARTModel",
    "name":"lambdamartmodel",
    "features":[
        { "name": "userTextTitleMatch"},
        { "name": "originalScoreFeature"}
    ],
    "params":{
    "boost": {
          "feature": "originalScoreFeature",
          "weight": 0.5,
          "type": "SUM"
        },
        "trees": [
            {
                "weight" : 1,
                "root": {
                    "feature": "userTextTitleMatch",
                    "threshold": 0.5,
                    "left" : {
                        "value" : -100
                    },
                    "right": {
                        "value" : 10}
 }
 },
 {
 "weight" : 2,
 "root": {
 "value" : -10
 }
 }
 ]
 }
 }

Deploy Model

As we saw for the features definition, deploying the model is quite straightforward :

curl -XPUT 'http://localhost:8983/solr/collection1/schema/model-store' --data-binary @/path/model.json -H 'Content-type:application/json' 
 

View Model

The model will be stored in an easily accessible json store:

curl -XGET 'http://localhost:8983/solr/collection1/schema/model-store'
 

Rerank query

To rerank your search results using a machine learned LTR model it is required to call the rerank component using the Apache Solr Learning To Rank ( LTR ) query parser.

Query Re-Ranking allows you to run an initial query(A) for matching documents and then re-rank the top N documents re-scoring them based on a second query (B).
Since the more costly ranking from query B is only applied to the top N documents it will have less impact on performance then just using the complex query B by itself – the trade off is that documents which score very low using the simple query A may not be considered during the re-ranking phase, even if they would score very highly using query B.  Solr Wiki

The Apache Solr Learning To Rank ( LTR ) integration defines an additional query parser that can be used to define the rerank strategy.
In particular, when rescoring a document in the search results :

  • Features are extracted from the document
  • Score is calculated evaluating the model against the extracted feature vector
  • Final search results are reranked according to the new score
rq={!ltr model=myModelName reRankDocs=25}

!ltr – will use the ltr query parser
model=myModelName – specifies which model in the model-store to use to score the documents
reRankDocs=25 – specifies that only the top 25 search results from the original ranking, will be scored and reranked

When passing external feature information (EFI) that will be used to extract the feature vector, the syntax is pretty similar :

rq={!ltr reRankDocs=3 model=externalmodel efi.parameter1=’value1′ efi.parameter2=’value2′}

e.g.

rq={!ltr reRankDocs=3 model=externalModel efi.user_input_query=’Casablanca’ efi.user_from_mobile=1}

Sharding

When using sharding, each shard will rerank, so the reRankDocs will be considered per shard.

e.g.
10 shards
You run distributed query with :
rq={!ltr reRankDocs=10 …
You will get a total of 100 documents re-ranked .

Pagination

Pagination is delicate[4].

Let’s explore the scenario on a single Solr node and on a sharded architecture.

 

Single Solr node

reRankDocs=15
rows=10

This means each page is composed by 10 results.
What happens when we hit the page 2 ?
The first 5 documents in the search results will have been rescored and affected by the reranking.
The latter 5 documents will preserve the original score and original ranking.

e.g.
Doc 11 – score= 1.2
Doc 12 – score= 1.1
Doc 13 – score= 1.0
Doc 14 – score= 0.9
Doc 15 – score= 0.8
Doc 16 – score= 5.7
Doc 17 – score= 5.6
Doc 18 – score= 5.5
Doc 19 – score= 4.6
Doc 20 – score= 2.4

This means that score(15) could be < score(16), but document 15 and 16 are still in the expected order.
The reason is that the top 15 documents are rescored and reranked and the rest is left unchanged.

Sharded architecture

reRankDocs=15
rows=10
Shards number=2

When looking for the page 2, Solr will trigger queries to she shards to collect 2 pages per shard :
Shard1 : 10 ReRanked docs (page1) + 10 OriginalScored docs (page2)
Shard2 : 10 ReRanked docs (page1) + 10 OriginalScored docs (page2)

The the results will be merged, and possibly, original scored search results can top up reranked docs.
A possible solution could be to normalise the scores to prevent any possibility that a reranked result is surpassed by original scored ones.

Note: The problem is going to happen after you reach rows * page > reRankDocs. In situations when reRankDocs is quite high , the problem will occur only in deep paging.

Feature Extraction And Caching

Extracting the features from the search results document is the most onerous task while reranking using LTR.
The LTRScoringQuery will take care of computing the feature values in the feature vector and then delegate the final score generation to the LTRScoringModel.
For each document the definitions in the feature-store are applied to generate the vector.
The vector can be generate in parallel, leveraging a multi-threaded approach.
Extra care must be taken into account when configuring the number of threads in the game.
The feature vector is currently cached in toto in the QUERY_DOC_FV cache.

This means that given the query and EFIs, we cache the entire feature vector for the document.

Simply giving in input a different efi request parameter will imply a different hashcode for the feature vector and consequentially invalidate the cached one.
This bit can be potentially improved, managing separately caches for the query independent, query dependent and query level features[5].

Solr Is Learning To Rank Better – Part 2 – Model Training

Model Training For Apache Solr Learning To Rank 

If you want to train a model for Apache Solr Learning To Rank , you are in the right place.
This blog post is about the model training phase for the Apache Solr Learning To Rank integration.

We modelled our dataset, we collected the data and refined it in Part 1 .
We have now a shiny lot-of-rows training set ready to be used to train the mathematical model that will re-rank the resulting documents coming out from our queries.
The very first activity to carry on is to decide the model that fits best our requirements.
I will focus in this blog post on two type of models, the ones currently supported by the Apache Solr Learning To Rank integration [1] .

Ranking SVM

Ranking SVM is a linear model based on Support Vector Machines.

The Ranking SVM algorithm is a pair-wise ranking method that adaptively sorts results based on how ‘relevant’  they are for a specific query ( we saw example of relevancy rating for the pair document-query in Part 1).

The Ranking SVM function maps each search query to the features of each sample.
This mapping function projects each data pair onto a feature space.

Each sample ( query-document ) of the training data will be used for the Ranking SVM algorithm to refine the mapping.

Generally, Ranking SVM includes three steps at training time:

  • It maps the similarities between queries and the clicked results onto a certain feature space.
  • It calculates the distances between any two of the vectors obtained in step 1.
  • It forms an optimization problem which is similar to a standard SVM classification and solves this problem with the regular SVM solver
Given the list of the features that describe our problem, an SVM model will assign a numerical weight to each of them, the resulting function will assign a score to a document given in input the feature vector that describes it.
Let’s see an example SVM Model in the Json format expected by the Solr plugin :
e.g.
{
    "class":"org.apache.solr.ltr.ranking.RankSVMModel",
    "name":"myModelName",
    "features":[
        { "name": "userTextTitleMatch"},
        { "name": "originalScore"},
        { "name": "isBook"}
    ],
    "params":{
        "weights": {
            "userTextTitleMatch": 1.0,
            "originalScore": 0.5,
            "isBook": 0.1
        }

}
}

Given 3 example features :
userTextTitleMatch – (query dependant feature) – (binary)
originalScore –  (query dependant feature) – (ordinal)
isBook – (document dependant feature) – (binary)

The ranking SVM model in the example builds a linear function of the variables assigning a different weight to each of them.

Given the documents (expressed with their feature vector):
D1 [1.0, 100, 1]
D2 [0.0, 80, 1]

Applying the model we get the scores :

Score(D1) = 1.0 * userTextTitleMatch(1.0) + 0.5 * originalScore(100) + 0.1 * isBook(1.0) = 51.1
Score(D2) = 1.0 * userTextTitleMatch(0.0) + 0.5 * originalScore(80) + 0.1 * isBook(1.0) = 40.1
The D1 documents is more relevant according the model.

Key points :

  • easy to debug and understand
  • linear function

Here you can find some good Open Source libraries to train SVM models [2].

LambdaMART

LambdaMART is a tree ensemble based model.
Each tree of the ensemble is a weighted regression tree and the final predicted score is the weighted sum of the prediction of each regression tree.
A regression tree is a decision tree that takes in input a feature vector and returns a scalar numerical value in output.
At a high level, LambdaMART is an algorithm that uses gradient boosting to directly optimize Learning to Rank specific cost functions such as NDCG.

To understand LambdaMART let’s explore the main two aspects: Lambda and MART.

MART
LambdaMART is a specific instance of Gradient Boosted Regression Trees, also referred to as Multiple Additive Regression Trees (MART).
Gradient Boosting is a technique for forming a model that is a weighted combination of an ensemble of “weak learners”. In our case, each “weak learner” is a decision tree.
Lambda
At each training point we need to provide a gradient that will allow us to minimize the cost function ( whatever we selected, NDCG for example).
To solve the problem LambdaMART uses an idea coming from lambdaRank:
at each certain point we calculate a value that acts as the gradient we require, this component will effectively modify the ranking, pushing up or down groups of documents and affecting effectively the rules for the leaf values, which cover the point used to calculate lambda.
For additional details these resources has been useful [3] ,[4] .
LambdaMART is currently considered one of the most effective model and it has been proved by a number of winning contests.

Let’s see how a lambdaMART model looks like :

{
    "class":"org.apache.solr.ltr.ranking.LambdaMARTModel",
    "name":"lambdamartmodel",
    "features":[
        { "name": "userTextTitleMatch"},
        { "name": "originalScore"}
    ],
    "params":{
        "trees": [
            {
                "weight" : 1,
                "root": {
                    "feature": "userTextTitleMatch",
                    "threshold": 0.5,
                    "left" : {
                        "value" : -100
                    },
                    "right": {
                        "feature" : "originalScore",
                        "threshold": 10.0,
                        "left" : {
                            "value" : 50
                        },
                        "right" : {
                            "value" : 75
                        }
                    }
                }
            },
            {
                "weight" : 2,
                "root": {
                    "value" : -10
                }
            }
        ]
    }
}

This example model is composed by two trees, each branch split evaluates a conditional expression on a feature value, if the feature value is  :
<=  threshold we visit the left branch
>  threshold we visit the right branch

Reaching the leaf of a tree will produce a score, this will be weighted according to the tree weight and then summed to the other scores produced ( the model is an ensemble of trees).
Given the documents :

D1 [1, 9]
D2 [0, 10]

Applying the model we get the scores :

Score(D1) = 1* (userTextTitleMatch (1) > 0.5 go right , originalScore (9) < 10 = 50) +
2 * -10 = 30

Score(D2) = 1 *(userTextTitleMatch(0) <= 0.5 = -100) +
2 * -10 = -120

The D1 document is more relevant according the model.

Key points :

  • ensemble of trees are effective
  • difficult to debug
  • non linear function

There are good open source libraries to train LambdaMART models, the one that we are going to use in this case of study is RankLib [5]

Evaluation Metric

Important phase of the journey is to validate the model :
How good is our model in re-ranking our dataset ?
How good is the model in re-ranking unknown data  ?
It is really important to carefully select the evaluation metric that best suites your algorithm and your needs.
Evaluating the model will be vital for the validation phase during the training, for testing the model against new data and to consistently being able to assess the predicting quality of the model trained.
N.B. this is particularly important: having a good understanding of the evaluation metric for the algorithm selected, can help a lot in tuning,  improving the model and finding anomalies in the training data.

NDCG@K

Normalised Discounted Cumulative Gain @k is a natural fit when choosing lambdaMART.
This metric evaluates the performance of a ranking model, it varies from 0.0 to 1.0, with 1.0
representing the ideal ranking model.
It takes K as parameter :
K : maximum number of documents that will be returned .

K specifies, for each query,  after the re-ranking, the top K results to evaluate.
N.B. set K to be the number of top documents you want to optimise.
Generally it is the number of documents you show in the first page of results.

Given in input the test set, this is grouped by queryId and for each query the ideal ranking (obtained sorting the documents by descendant relevancy) is compared with the ranking generated by the ranking model.
An average for all the queryIds is calculated.
Detailed explanation can be found in Wikipedia [6].

Knowing how it works, 2 factors sound immediately important when reasoning about NDCG :
1) distribution of relevancy scores in the test set, per queryId
2) number of samples per queryId

War Story 1 : too many highly relevant samples -> high NDCG
Given the test set A :
5 qid:1 1:1 …
5 qid:1 1:1 …
4 qid:1 1:1 …
5 qid:2 1:1 …
4 qid:2 1:1 …
4 qid:2 1:1 …

Even if the ranking model does not a good job, the high distribution of relevant samples increase the probability of hitting high score in the metric.

War Story 2 : small RankLists -> high NDCG
Having few samples (<K) per queryId means that we will not evaluate the performance of our ranking model with accuracy.

In the edge case of 1 sample per queryId, our NDCG@10 will be perfect, but actually this reflect simply a really poor training/test set.

N.B. be careful on how you generate your queryId as you can end up in this edge case and have a wrong perception of the quality of your ranking model.

Model Training

Focus of this section will be how to train a LambdaMART model using RankLib[5],
Let’s see an example and analyse all the parameters:
java  -jar RankLib-2.7.jar 
train /var/trainingSets/trainingSet_2016-07-20.txt 
-feature /var/ltr/trainingSets/trainingSet_2016-07-20_features.txt 
-ranker
-leaf 75
mls 5000 
-metric2t NDCG@20 
kcv 5 
-tvs 0.8 
-kcvmd /var/models 
-kcvmn model-name.xml 
-sparse 
Parameter Description
train
Path to the training set file
feature Feature description file, list feature Ids to be considered by the learner, each on a separate line. If not specified, all features will be used.
ranker Specify which ranking algorithm to use e.g. 6: LambdaMART
metric2t Metric to optimize on the training data. Supported: MAP, NDCG@k, DCG@k, P@k, RR@k, ERR@k (default=ERR@10)
tvs
Set train-validation split to be (x)(1.0-x).
x * (size of the training set) will be used for training
(1.0 -x) * (size of the training set) will be used for validation
kcv
Stands for k Cross Validation
Specify if you want to perform k-fold cross validation using ONLY the specified training data (default=NoCV).
This means we split the training data in k subsets and we run k training executions.
In each execution 1 subset will be used as the test set and k-1 subsets will be used for training.
N.B. in the case that tvs and kcv are both defined, first we split for the cross validation, than the training set produced will be split in training/validation.
e.g.
Initial Training Set size : 100 rows

-kcv 5 
-tvs 0.8 
Test Set : 20 rows
Training Set :  64 ( 0.8 * 80)
Validation Set : 16 (0.2 * 80)
kcvmd
Directory where to save models trained via cross-validation (default=not-save).
kcvmn Name for model learned in each fold. It will be prefix-ed with the fold-number (default=empty).
sparse Allow sparse data processing for the training set ( which means that you don’t need to specify all the features for each sample where the feature has no value)
tree
Number of trees of the ensemble(default=1000).
More complex is the learning problem, more trees we need, but it’s important to be careful and not overfit the training data
leaf
Number of leaves for each tree (default=100).
As the number of trees, it is important to tune carefully this value to avoid overfitting trees.
shrinkage This is the learning rate of the algorithm(default=0.1)
If this is too aggressive the Ranking model will quickly fit the training set, but will not react properly to the validation set evaluation, which means an overall poorer model.
tc
Number of threshold candidates for tree spliting.
-1 to use all feature values (default=256)
Increasing this value we increase the complexity and possible overfitting of the model.
It is suggested to start with a low value and then increase it according the general cardinality of your features.
mls
Min leaf support – minimum #samples each leaf has to contain (default=1).
This is quite an important parameter if we want to take care of outliers.
We can tune this parameter to include only leaves with an high number of samples, discarding pattern validated by a weak support.

Tuning the model training parameters is not an easy task.
Apart the general suggestions, trial & error with a careful comparison of the evaluation metric score is the path to go.
Comparing different models on a common Test Set can help when we have produced a number of models with different tuning configuration.

Converting Model To Json

We finally produced a model which is performing quite well on our Test set.
Cool, it’s the time to push it to Solr and see the magic in action.
But before we can do that, we need to convert the model generated by the Machine Learning libraries into the Json format Solr expects ( you remember the section about the models supported ? )
Taking as example a LambdaMART model, Ranklib will produce an xml model.
So you will need to parse it and convert it to the Json format.
Would be interesting to implement directly in RankLib the possibility of selecting in output the Json format expected by Solr.
Any contribution is welcome [7]!
In the next part we’ll see how to visualise and understand better the model generated.
This activity can be vital to debug the model, see the most popular features, find out some anomaly in the training set and actually assess the validity of the model itself.
 

Solr Is Learning To Rank Better – Part 1 – Data Collection

Learning To Rank In Apache Solr

Introduction

This blog post is about the journey necessary to bring Learning To Rank In Apache Solr search engines.

Learning to Rank[1] is the application of Machine Learning in the construction of ranking models for Information Retrieval systems.
Introducing supervised learning from user behaviour and signals can improve the relevancy of the documents retrieved bringing a new approach in ranking them.
Can be helpful in countless domains, refining free text search results or building a ranking algorithm where only filtering is happening and no initial scoring is available.

This series of blog posts will explore a full journey from the Signals Collection through the re-ranking in Solr.
Part 1  will explore the data collection, data modelling and refining phase
Part 2 will explore the training phase
Part 3 will describe a set of utilities to analyse your models
Part 4 will cover the Solr integration

Requirement : Minimal knowledge and interest in Machine Learning

Special thanks to Diego Ceccarelli, that helped me a lot in the last months introducing me to this amazing topic.
Second special thanks to Juan Luis Andrades, for the passion we shared during this journey and Jesse McLaughlin for the careful technical java insights.
Final special thanks to David Bunbury who contributed with interest, passion and very thoughtful ideas to the cause.

Collecting Signals

The start of the journey is the signals collection, it is a key phase and involves the modelling of the supervised training set that will be used to train the model.
A training set can be Explicit or Implicit.

Implicit

An Implicit training set is collected from user behaviours and interactions.
e.g.
Historical sales and transactions of an E-commerce website
User Clicks in the search result page
Time spent on each document accessed
 
A training set of this type is quite noisy but allows to collect great numbers of signals with small effort.
More the user was engaged with a particular document, stronger the signal of relevancy.
e.g.
A sale of a product is a stronger signal of relevancy than adding it to the basket
User Clicks in the search result page in comparison with documents shown but not clicked
Longer you read a document, stronger the relevancy
 
+ Pros : Cheap to build
Cons : Noisy

 

Explicit

An Explicit training set is collected directly from the interaction with Human Experts.
Given a query and a set of documents, the Human expert will rate the relevancy of each document in the result set.
The score assigned to the document will rate how relevant the document was for the query.
To remove the subjective bias is suggested to use a team of experts to build the training set.
A training set of this type is highly accurate but it is really expensive to build as you need a huge team of experts to produce thousands of rating for all the queries of interest.
+ Pros : Accuracy
– Cons : Expensive to build
 

Training Set

The training set can have different forms, for the sake of this post we will focus on the pairwise approach.
In particular the syntax exposed will be the RankLib[2] syntax.
Each sample of the training set is a signal / event that describes a pair (Query-Document) .
Let’s take as an example a query (1) and a document (1A)

3 qid:1 1:1 2:1 3:0 # 1A

The document can be represented as a feature vector which is a vector of scalar numeric values.
Each element in the vector represents a specific aspect of the document.
We’ll see this in the feature part in details.
Let’s now focus in understanding the sample :

3
How relevant the document is for the query

qid:1
Identify the query

1:1 2:1 3:0
The document, represented as a vector of numerical features

# 1A
A comment, to make more readable your training data

Feature Engineering

For convenience of Machine Learning algorithms, query-document pairs are represented by numerical vectors.
Components of such vectors are called features and can be divided into different groups (depending if they depend on the document, the query or both) :
Document Features (query independent)
This kind of feature depends only on the document and not on the query.
e.g.
Document length
Price of the product
User Rating of the product
Interesting aspect of these features is that they can be potentially precomputed in off-line mode during indexing. They may be used to compute document’s static quality score (or static rank), which is often used to speed up search query evaluation.

 

Query Dependent Features 
This kind of features depends on the query and on the document
e.g.
Is the document containing the query text in the title ?
Is the document (product) of the same brand as expressed in the query
 
Query Level Features 
This kind of features depends only on the query.
e.g.
Number of words in the query
Cuisine Type Selected( e.g. “Italian”, “Sushi” when searching for Restaurants)
Date Selected ( e.g. when searching in an Hotel Booking system)
Department Selected ( e.g. “electonics”, “kitchen”, “DIY” … in an E-commerce website)
User Dependent Features
Also in this case this kind of feature does not depend on the document.
It only depends on the user running the query.
e.g.
Device 
Age of the user
Gender
As described , for convenience of the mathematical algorithms each high level feature must be modelled as a numeric feature.
In the real world a feature describes an aspect of the object (document) and must be represented accordingly:
Ordinal Features
An ordinal feature represents a numerical value with a certain position in a sequence of numbers.
e.g.
Star Rating  ( for a document describing an Hotel)
Price  ( for a document describing an e-commerce product)
 
For the Star Rating feature, stands an order for the different values:
1<2<3<4<5  is logically correct.
For the Price feature, the same observation applies .
100$ < 200$ <300$
A feature is Ordinal when it is possible to compare different values and decide the ranking of these.
 
Categorical Features
A categorical feature represents an attribute of an object that have a set of distinct possible values.
In computer science it is common to call the possible values of a categorical features Enumerations.
e.g.
Colour ( for a document describing a dress)
Country ( for a document describing a location)
It easy to observe that to give an order to the values of a categorical feature does not make any sense.
For the Colour feature :
red < blue < black has no general meaning.
 
Binary Features
A binary feature represents an attribute of an object that can have only two possible values.
Traditionally 0 / 1 in accordance with the binary numeral system.
e.g.
Is the product available ? yes/no ( for a document describing an e-commerce product)
Is the colour Red ? ( for a document describing a dress)
Is the country Italy ? ( for a document describing a location)

One Hot Encoding

When a categorical feature describes your Document, it is tempting to represent each category as an Integer Id :
e.g.
Categorical Feature : colour
Distinct Values : red, green, blue
Representation : colour:1, colour:2 colour:3 
 
With this strategy, the Machine learning algorithm will be able to manage the feature values…
But is the information we pass, the same as the original one ?
Representing a categorical feature as an ordinal feature is introducing an additional ordinal relationship :
1(red) < 2(green) < 3(blue)
which doesn’t reflect the original information.
There are different ways to encode categorical features to make them understandable by the training algorithm. We need basically to encode the original information the feature provides in an numeric form, without any loss or addition if possible.
One possible approach is called One Hot Encoding [3]:
Given a categorical feature with N distinct values, encode it in N binary features, each feature will state if the category applies to the Document.
e.g.
Categorical Feature : colour
Distinct Values : red, green, blue
Encoded Features : colour_red, colour_green, colour_blue 
A document representing a blue shirt will be described by the following feature vector :
… colour_red:0 colour_green:0 colour_blue:1 …
One Hot Encoding is really useful to properly model your information, but take care of the cardinality of your categorical feature as this will be reflected in the number of final features that will describe your signal.
War Story 1 : High Cardinality Categorical Feature
A signal describing a document with a high level categorical feature (with N distinct values) can produce a Feature vector of length N.
This can deteriorate the performance of your trainer as it will need to manage many more features per signal.
It actually happened to me, that simply adding one categorical feature was bringing in thousands of binary features, exhausting the hardware my trainer was using,  killing the training process.
To mitigate this , can be useful to limit the encoded distinct values only to a subset :
  • with white list / black list approach business driven
  • keeping only the top occurring values
  • keeping only the values occurring more than a threshold
  • encode the rest as a scpecial feature :colour_misc
  • Hash the distinct values into a reduced set of hashes

Feature Normalization

Feature Normalisation is a method used to standardize the range of values across different features, a technique quite useful in the data pre-processing phase.
As the majority of machine learning algorithms use the Euclidean distance to calculate the distance between two different points (training vector signals), if a feature has a widely different scales, the distance can be governed by this particular feature.
Normalizing can simplify the problem and give the same weight to each of the features involved.
There are different type of normalization, some of them :
  • Linear Normalization ( min/max based)
  • Sum Normalization ( based on the sum of all the values of the feature )
  • Z Score ( based on the mean/standard deviation of the feature )

Feature Value Quantisation

Another approach to simplify the job of the training algorithm is to quantise the feature values, in order to reduce the cardinality of distinct values per feature.
It is basically the simple concept of rounding, whenever we realise that it does not make any difference for the domain to model the value with an high precision, it is suggested to simplify it and round it to the acceptable level.
e.g.
Domain: ospitality
Ranking problem : Rank restaurants documents
Assuming a feature is the trip_advisor_reviews_count, is it really necessary to model the value as the precise amount of reviews ? Normally would be simpler to round to the nearest k ( 250 or whatever sensible to the business)

Note : Extra care must be taken into account if following this approach.
The reason is that adding an artificial rounding to the data can be dangerous, we can basically compromise the feature itself setting up hard thresholds.
It is always better if the algorithm decides the thresholds with freedom.
It is suggested were possible to not quantise, or quantise only after a deep statistical analysis on our data.
If the target is to simplify the model for any reason, it is possible to evaluate at training time less threshold candidates, depending on the training algorithm.

Missing Values

Some of the signals we are collecting could miss some of the features ( data corruption, bug in the signal collection or simply the information was not available at the time ) .
Modelling our signals with a sparse feature vector will imply that a missing feature will actually be modelled as feature with value 0.
This should generally be ok, but we must be careful in the case that 0 is a valid value for the feature.
e.g.
Given a user_rating feature
A rating of 0 means the product has a very bad rating.
A missing rating means we don’t have a rating for the product ( the product can still be really good) 
A first approach could be to model the 0 rating as slightly greater than 0 (i.e. 0 + ε ) and keep the sparse representation.
In this way we are differentiating the information but we are still modelling the wrong ordinal relationship : 
Missing User Rating  (0) < User Rating 0 (0 + ε)
 
Unfortunately, at least for the RankLib implementation, a missing feature will always be modelled with a value 0, this of course will vary from algorithm to algorithm.
But we can enforce the learning a bit, adding an additional binary feature that states that the User Rating is actually missing :
user_rating:0 , user_rating_missing:1 .
This should help the learning process to actually understand better the difference.
Furthermore, if possible we can help the algorithm, avoiding the sparse representation if necessary and setting for the missing feature a value which is the avg of the feature itself across the different samples.

Outliers

Some of the signals we are collecting could have some outliers ( some signal with an unlikely extremely different value for a specific feature).
This can be caused by bugs in the signal collection process or simply the anomaly can be a real instance of a really rare signal.
Outliers can complicate the job of the model training and can end up in overfitting models that have difficulties in adaptation for unknown datasets.
Identify and resolve anomalies can be vitally important if your dataset is quite fragile.

Tool for data visualisation can help in visualising outliers, but for a deep analysis I suggest to have a read of this interesting blog post [4] .

Feature Definition

Defining the proper set of features to describe the document of our domain is an hard task.
It is not easy to identify in the first place all the relevant features even if we are domain experts, this procedure will take time and a lot of trial and error.
Let’s see a guideline to try to build a feature vector as best as possible :
  • Keep it simple : start from a limited set of features which are really fundamental to describe your problem, the model produced will be really poor, but at least you have the baseline.
  • Iteratively train the model : removing or adding a feature at each execution. this is time consuming but will allow you to identify clearly which features really  matter.
  • Visualise Important Features : After you trained the model, use a visualisation tool to verify which feature is appearing more
  • Meet the Business : Have meetings with the business, to compare what they would expect to see re-ranked and what actually the model re-ranks. When there is discordance let’s have the humans explain why, this should drive to identify missing features or feature that were used in wrong meaning .

Data Preparation

We have carefully designed our vectorial representation of the domain documents, we identified the source of our signals and built our training Set.
So far, so good…
But the model is still performing really poor.
In this scenario the reasons can be countless :

  •  Poor signal quality (noise)
  •  Incomplete feature vector representation
  •  Not uniform distribution of relevant-not relevant documents per queries

Let’s explore some guideline to overcome some of these difficulties :

Noise Removal

In the scenario of implicit signals, it is likely we model the relevancy rating based on an evaluation metric of the user engagement with the document given a certain query.
Depending of the domain we can measure the user engagement in different ways.
Let’s see an example for a specific domain :  E-commerce
We can assign the relevancy rating of each signal depending on the user interaction :
Given a scale 1 to 3 :
1 – User clicked the product
2 – User added the product to the basket
3 – User bought the product
The simplest approach would be to store 1 signal per user interaction.
User behavioural signals are noisy by nature, but this approach introduces even more noise, as for the same feature vector we introduce discordant signals, specifically we are telling the training algorithm that given that feature vector and that query, the document is at the same time :
vaguely relevant – relevant – strongly relevant .
This doesn’t help the training algorithm at all, so we need to find a strategy to avoid that.
One possible way is to keep only the strongest signal per document-query per user episode .
In the case of a user buying a product, we avoid storing in the training set 3 signals, but we keep only the most relevant one.
In this way we transmit to the training algorithm the only the important information for the user interaction with no confusion.

Unbalanced Dataset

In some domain, would be quite common to have a very unbalanced dataset.
A dataset is unbalanced when the relevancy classes are not represented equally in the dataset i.e. we have many more samples of a relevancy class than another.
Taking again the E-commerce example, the number of relevant signals (sales) will be much less than the number of weak signals (clicks).
This unbalance can make the life harder to the training algorithm, as each relevant signal can be covered by many more weakly relevant ones.
Let’s see how we can manipulate the dataset to partially mitigate this problem :

Collect more data

This sounds simple, but collecting more data is generally likely to help.
Of course there are domain when collecting more data is not actually beneficial ( for example when the market change quite dinamically and the previous years dataset becomes almost irrelevant for predicting the current behaviours ) .

Resample the dataset

You can manipulate the data you collected to have more balanced data.
This change is called sampling your dataset and there are two main methods that you can use to even-up the classes: Oversampling and Undersampling[5].
You can add copies of instances from the under-represented relevancy class, this is called over-sampling, or
you can delete instances from the over-represented class, this technique is called under-sampling.
These approaches are often very easy to implement and fast to run. They are an excellent starting point.
Some ideas and suggestion :
  • Consider testing under-sampling when you have an a lot data (tens- or hundreds of thousands of instances or more), you can undersample randomly or following any business logic available
  • Consider testing different resampled ratios (it is not necessary to target a 1:1 ratio)
  • When oversampling consider advanced approaches, instead of simple duplication of already existent samples, could be good to artificially generate new ones [6]
Be careful, resampling is not always helping your training algorithm, so experiment in details your use case.
War Story 2 : Oversampling by duplication
Given a dataset highly unbalanced, the model trained was struggling to predict accurately the desired relevancy class for document-query test samples.
Oversampling was a tempting approach, so here we go!
 
As I am using cross validation, the first approach has been to oversample the dataset by duplication.
I took each relevancy class and duplicate the samples until I built a balanced dataset.
Then started the training in Cross Validation and I trained a model which was immense and almost perfectly able to predict the relevancy of validation and test samples.
Cool ! I got it !
Actually was not an amazing result at all, because of course applying cross validation on an oversampled dataset built validation and test sets oversampled as well.
This means that it was really likely that a sample in the training set was appearing exactly the same in the validation set and in the test set.
The resulting model was basically highly overfitted and not that good to predict unknown test sets.
 
So I moved to a manual training set- validation set – test set split and oversampled only the training set.
This was definitely better and built a model that was much more suitable.
It was not able to perfectly predict validation and test sets but this was a good point as the model was able to predict unknown data sets better.
 
Then I trained again, this time the original dataset, manually split as before but not oversampled.
The resulting model was actually better than the oversampled one.
One of the possible reasons is that the training algorithm and model I was using (LambdaMART) didn’t get any specific help from the resampling, actually the model lost the capability of discovering which samples were converting better ( strong relevant signals : weak relevant signals ratio).
Practically I favoured the volume over the conversion ratio, increasing the recall but losing the precision of the ranker.
 
Conclusion : Experiment, evaluate the approach with your algorithm, compare, don’t assume it is going to be better without checking

Query Id hashing

As we have seen in the initial part of the blog, each sample is a document-query pair, represented in a vectorial format.
The query is represented by an Id, this Id is used to group samples for the same query, and evaluate the ranker performance over each samples group.
This can give us an evaluation of how good the ranker is performing on average on all the queries of interest.
This brings to carefully decide how we generate the query identifier.
If we generate a too specific hash, we risk to build small groups of samples, this small groups can get an high score when ranking them, biased by their small size.
Extreme case
e.g.
Really specific hash, brings many groups to be 1 sample groups.
This brings up the evaluation metric score, as we are averaging and a lot of groups, being of size 1, are perfectly easy to rank.
If we generate an hash that is not specific enough we can end up in immense groups, not that helpful to evaluate our ranking model on the different real world scenarios.
The ideal scenario is to have one query Id per query category of interest, with a good number of samples related, this would be the perfect dataset, in this way we can validate both :
  • the intra-group relevancy ( as the group are composed by sufficient samples)
  • the average across the queries ( as we have a valid set of different queries available)
The query category could depend on a set of Query Dependent Features, this means that we can calculate the hash using the values of these features.
Being careful we maintain a balance between the group sizes and the granularity of the hash.
It is important to have the query categories across the training/validation/test set :
e.g.
We have 3 different query categories , based on the value of a user dependent feature ( user_age_segment) .
These age segments represents three very different market segments, that require very different ranking models.
When building our training set we want enough samples for each category and we want them to be split across the training/validation/test sets to be able to validate how good we are in predicting the different market segment.
This can potentially drive to build separate models and separate data sets if it is the case.

Solr Document Classification – Part 1 – Indexing Time

Introduction

This blog post is about the Solr classification module and the way Lucene classification has been integrated at indexing time.

In the previous blog [1] we have explored the world of Lucene Classification and the extension to use it for Document Classification .
It comes natural to integrate Solr with the Classification module and allow Solr users to easily manage the Classification out of the box .

N.B.  This is supported from Solr 6.1

Solr Classification

Taking inspiration from the work of a dear friend [2] , integrating the classification in Solr can happen 2 sides :
  • Indexing time – through an Update Request Processor
  • Query Time – through a Request handler ( similar to the More like This )
In this first article we are going to explore the Indexing time integration :
The Classification Update Request Processor.

Classification Update Request Processor

First of all let’s describe some basic concepts :
An Update Request Processor Chain, associated to an Update handler, is a pipeline of Update processors, that will be executed in sequence.
It takes in input the added Document (to be indexed) and return the document after it has been processed by all the processors in the chain in sequence.
Finally the document is indexed.
An Update Request Processor is the unit of processing of a chain, it takes in input a Document and operates some processing before it is passed to the following processor in the chain if any.
The main reason for the Update processor is to add intermediate processing steps that can enrich, modify and possibly filter documents , before they are indexed.
It is important because the processor has a view of the entire Document, so it can operate on all the fields the Document is composed.
For further details, follow the official documentation [3].

Description

The Classification Update Request Processor is a simple processor that will automatically classify a document ( the classification will be based on the latest index available) adding a new field  containing the class, before the document is indexed.
After an initial valuable index has been built with human assigned labels to the documents, thanks to this Update Request Processor will be possible to ingest documents with automatically assigned classes.
The processing steps are quite simple :
When a document to be indexed enters the Update Processor Chain, and arrives to the Classification step, this sequence of operations will be executed :
  • The latest Index Reader is retrieved from the latest opened Searcher
  • A Lucene Document Classifier is instantiated with the config parameters in the solrconfig.xml
  • A Class is assigned by the classifier taking in consideration all the relevant fields from the input document
  • A new field is added to the original Document, with the class
  • The Document goes through the next processing step

Configuration

 

e.g. K Nearest Neighbours Classifier
<updateRequestProcessorChain name=”classification”>
<processor class=”solr.ClassificationUpdateProcessorFactory”>
<str name=”inputFields”>title^1.5,content,author</str>
<str name=”classField”>cat</str>
<str name=”algorithm”>knn</str>
<str name=”knn.k”>20</str>
<str name=”knn.minTf”>1</str>
<str name=”knn.minDf”>5</str>
</processor>
</updateRequestProcessorChain>

e.g. Simple Naive Bayes Classifier
<updateRequestProcessorChain name=”classification”>
<processor class=”solr.ClassificationUpdateProcessorFactory”>
<str name=”inputFields”>title^1.5,content,author</str>
<str name=”classField”>cat</str>
<str name=”algorithm”>bayes</str>
</processor>
</updateRequestProcessorChain>

e.g. Update Handler Configuration
<requestHandler name=”/update” >
<lst name=”defaults”>
<str name=”update.chain”>classification</str>
</lst>
</requestHandler>

Parameter Default Description
inputFields
This config param is mandatory The list of fields (comma separated) to be taken in consideration for doing the classification.
Boosting syntax is supported for each field.
classField
This config param is mandatory The field that contains the class of the document. It must appear in the indexed documents .
If knn algorithm it must be stored .
If bayes algorithm it must be indexed and ideally not heavily analysed.
algorithm
knn The algorithm to use for the classification:
– knn ( K Nearest neighbours )
– bayes ( Simple Naive Bayes )
knn.k
10 Advanced – the no. of top docs to select in the MLT results to find the nearest neighbor
knn.minDf
1 Advanced – A term (from the input text) will be taken in consideration by the algorithm only if it appears at least in this minimum number of docs in the index
knn.minTf
1 Advanced – A term (from the input text) will be taken in consideration by the algorithm only if it appears at least this minimum number of times in the input
 

Usage

Indexing News Documents ? we can use the already indexed news with category,  to automatically tag upcoming stories with no human intervention.
E-commerce Search System ? Category assignation will require few human interaction after a valid initial corpus of products has been indexed with manually assigned category.
The possible usage for this Update Request Processor are countless.
In any scenario where we have documents with a class or category manually assigned in our Search System, the automatic Classification can be a perfect fit.
Leveraging the existent Index , the overhead for the Classification processing will be minimal.
After an initial human effort to have a good corpus of classified Documents, the Search System will be able to automatically index the class for the upcoming Documents.
Of course we must remember that for advanced classification scenarios that require in deep tuning, this solution can be not optimal.

Code

The patch is attached to this Jira Issue :
This has been officially merged to Apache Solr starting with 6.1. version.

 

 

Lucene Document Classification

Introduction

This blog post describes the approach used in the Lucene Classification module to adapt text classification to document ( multi field ) classification.

Machine Learning and Search have been always strictly associated.
Machine Learning can help to improve the Search Experience in a lot of ways, extracting more information from the corpus of documents, auto classifying them and clustering them.
On the other hand , Search related data structures are quite similar in content to the ML models used to solve the problems, so we can use them to build up our own algorithms.
But let’s go in order…
In this article we are going to explore how much Auto Classification can be important to easy the user experience in Search, and how will be possible to have an easy, out of the box painless classification directly from Lucene, from our already existent Index, without any external dependency or trained model.

Classification

From Wikipedia :

In machine learning and statisticsclassification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. An example would be assigning a given email into “spam” or “non-spam” classes or assigning a diagnosis to a given patient as described by observed characteristics of the patient (gender, blood pressure, presence or absence of certain symptoms, etc.).

Classification is a very well known problem, and it is really easy to verify the utility of classifiers on a  daily basis.
When dealing with Search Engines , it is so common that our Documents have a category ( generally assigned by a human ). Wouldn’t be really useful to be able to automatic extract the category for the documents that are missing it ? Wouldn’t be really interesting to automate that processing step to have document automatically categorised, based on the experience ? ( which means using all the documents already categorised in our Search System)
Wouldn’t be amazing to do that without any external tool and without any additional expensive model training activity ?
We already have a lot of information in our Search System, let’s use it !

Lucene Classification Module

To be able to provide a Classification capability, our system generally needs a trained model.
Classification is a problem solved by Supervised Machine learning algorithms which means humans need to provide a training set.
A classification training set is a set of documents, manually tagged with the correct class.
It is the “experience” that the Classification system will use to classify upcoming unseen documents.
Building a trained model is expensive and requires to be stored in specific data structures.
Is it really necessary to build an external trained model when we already have in our index million of documents, already classified by humans and ready to be searched ?
A Lucene index has already really interesting data structures that relates terms to documents ( Inverted Index), fields to terms ( Term Vectors) and all the information we need about Term Frequency and Document Frequency.
A corpus of documents, already tagged with the proper class in a specific field, can be a really useful resource to build our classification inside our search engine without any external tool or data structure.
Based on this observations, the Apache Lucene [1] Open Source project introduced with version 4.2  a document classification module to manage the text classification using the Index data structures.
The facade for this module is the text Classifier, a simple interface ( with 3 implementations available ) :

public interface Classifier {/** * Assign a class (with score) to the given text String *
* @param text a String containing text to be classified *
@return a {@link ClassificationResult} holding assigned class of type T and score *

ClassificationResult assignClass(String text) throws IOException;

/** * Get all the classes (sorted by score, descending) assigned to the given text String.
@param text a String containing text to be classified *
@return the whole list of {@link ClassificationResult}, the classes and scores. Returns null if the classifier can’t make lists. */

List getClasses(String text) throws IOException;

/** * Get the first max classes (sorted by score, descending) assigned to the given text String.
@param text a String containing text to be classified *
@param max the number of return list elements *
@return the whole list of {@link ClassificationResult}, the classes and scores. Cut for “max” number of elements. Returns null if the classifier can’t make lists. */

List getClasses(String text, int max) throws IOException;
}

The available implementations are :

  • KNearestNeighborClassifier – A k-Nearest Neighbor classifier [2] based on MoreLikeThis
  • SimpleNaiveBayesClassifier – A simplistic Lucene based NaiveBayes classifier [3]
  • BooleanPerceptronClassifier – A perceptron [4] based Boolean Classifier. The weights are calculated using TermsEnum.totalTermFreq both on a per field and a per document basis and then a corresponding FST is used for class assignment.

Let’s see the first 2 in details :

KNearestNeighborClassifier

This classifier is based on the Lucene More like This [5]. 
A feature able to retrieve similar documents to a seed one ( or a seed text) calculating the similarity between the docs on a field base.
It takes in input the list of fields to take in consideration ( with relative boost factor to increase the importance of some field over the others).
 
The idea behind this algorithm is simple :
 – given a set of relevant fields for our classification
 – given a field containing the class of the document
It retrieves all the similar documents to the Text in input using the MLT.
 
Only the documents with the class field valorised, are taken in consideration.
The top k documents in result are evaluated, and the class extracted from all of them.
Then a ranking of the retrieved classes is made, based on the frequency of the class in the top K documents.
Currently the algorithm takes in consideration only the frequency of the class to calculate its score.
One limitation is that is not taking in consideration the ranking of the class.
This means that if in the top K, we have the first k/2 documents of class C1 and the second k/2 document of class C2, both the classes will have the same score [6] .
This classifier works on top of the Index, let’s quickly have an overview of the constructor parameters :  
Parameter Description
leafReader The Index Reader that will be used to read the index data structures and classify the document
analyzer The Analyzer to be used to analyse the input unseen text
query A filter to apply on the indexed documents. Only the ones that satisfy the query will be used for the classification
k the no. of top docs to select in the MLT results to find the nearest neighbor
minDocsFreq A term (from the input text) will be taken in consideration by the algorithm only if it appears at least in this minimum number of docs in the index
minTermFreq A term (from the input text) will be taken in consideration by the algorithm only if it appears at least this minimum number of times in the input
classFieldName The field that contains the class of the document. It must appear in the indexed documents. MUST BE STORED
textFieldNames The list of fields to be taken in consideration for doing the classification. They must appear in the indexed documents
Note :  MoreLikeThis constructs a Lucene query based on terms in a document. It does this by pulling terms from the defined list of fields (the textField  
Names parameter, below). 

You usually read that for best results, the fields should have stored term vectors.
In our case, the text to classify is unseen, it is not and indexed document.
So the TermVector is not used at all.

SimpleNaiveBayesClassifier

This Classifier is based on a simplistic implementation of a Naive Bayes Classifier [3].
It uses the Index to get the Term frequencies of the terms, the Doc frequencies and the unique terms.
First of all it extract all the possible classes, this is obtained getting all the terms for the class field.
Then , each class is scored based on
– how frequent is the class in all the classified documents in the index
– how much likely the tokenized text is to belong to the class
The details of the algorithm go beyond the scope of this blog post.
This classifier works on top of the Index, let’s quickly have an overview of the constructor parameters :

Note :  The NaiveBayes Classifier works on terms from the index. This means it pulls from the index the tokens for the class field. Each token will be considered as class, and will see a score associated.

This means that you must be careful in the analysis you choose for the classField and ideally use a not tokenizer field containing the class ( a copyField if necessary)

Document Classification Extension

Parameter Description
leafReader The Index Reader that will be used to read the index data structures and classify the document
analyzer The Analyzer to be used to analyse the input unseen text
query A filter to apply on the indexed documents. Only the ones that satisfy the query will be used for the classification
classFieldName The field that contains the class of the document. It must appear in the indexed documents
textFieldNames The list of fields to be taken in consideration for doing the classification. They must appear in the indexed documents
The original Text Classifiers are perfectly fine, but what about Document classification ?
In a lot of cases the information we want to classify is actually composed of a set of fields with one or more value each one.
Each field content contributes to the classification ( with a different weight to be precise).
Given a simple News article, the title is much more important for the classification in comparison with the text and the author.
But even if not so relevant even the author can have a little part in the class assignation.
Lucene atomic unit of information is the Document. 
 

Documents are the unit of indexing and search. A Document is a set of fields. Each field has a name and a textual value.

This data structure is a perfect input for a new generation of Classifiers that will benefit of augmented information to assign the relevant class(es).
Whenever a simple text is ambiguous, including other fields in the analysis is improving dramatically the precision of the classification.
In details :

  • the field content from the Input Document will be compared with the data structures in the index, only related to that field
  • an input field will be analysed accordingly to its own analysis chain, allowing greater flexibility and precision
  • an input field can be boosted, to affect more the classification. In this way different portions of the input document will have a different relevancy to discover the class of the document.
A new interface is provided :

public interface DocumentClassifier<T> {

/** * Assign a class (with score) to the given {@link org.apache.lucene.document.Document} * * @param document a {@link org.apache.lucene.document.Document} to be classified. Fields are considered features for the classification. * @return a {@link org.apache.lucene.classification.ClassificationResult} holding assigned class of type T and score */

ClassificationResult<T> assignClass(Document document) throws IOException;

/** * Get all the classes (sorted by score, descending) assigned to the given {@link org.apache.lucene.document.Document}. * * @param document a {@link org.apache.lucene.document.Document} to be classified. Fields are considered features for the classification. * @return the whole list of {@link org.apache.lucene.classification.ClassificationResult}, the classes and scores. Returns null if the classifier can’t make lists.*/

List<ClassificationResult<T>> getClasses(Document document) throws IOException;

/** * Get the first max classes (sorted by score, descending) assigned to the given text String. * * @param document a {@link org.apache.lucene.document.Document} to be classified. Fields are considered features for the classification. * @param max the number of return list elements * @return the whole list of {@link org.apache.lucene.classification.ClassificationResult}, the classes and scores. Cut for “max” number of elements. Returns null if the classifier can’t make lists.*/ 

List<ClassificationResult<T>> getClasses(Document document, int max) throws IOException;

And 2 Classifier extended to provide the new functionality :

  • KNearestNeighborDocumentClassifier 
  • SimpleNaiveBayesDocumentClassifier 

The first implementation is available as a contribution for this Jira issue :

LUCENE-6631

Now that a new interface is available, it will be much easier to integrate it with Apache Solr.

Stay tuned for the Solr Classification Integration.

[1] Apache Lucene
[2] K-nearest_neighbors
[3] Naive Bayes
[4] Perceptron
[5] More Like This
[6] LUCENE-6654 – Fix proposed