Solr Is Learning To Rank Better – Part 4 – Solr Integration

Last Stage Of The Journey

This blog post is about the Apache Solr Learning To Rank ( LTR ) integration.

We modelled our dataset, we collected the data and refined it in Part 1 .
Trained the model in Part 2 .
Analysed and evaluate the model and training set in Part 3 .
We are ready to rock and deploy the model and feature definitions to Solr.
I will focus in this blog post on the Apache Solr Learning To Rank ( LTR ) integration from Bloomberg [1] .
The contribution is completed and available from Apache Solr 6.4.
This blog is heavily based on the Learning To Rank ( LTR ) Bloomberg contribution readme [2].

Apache Solr Learning To Rank ( LTR ) integration

The Apache Solr Learning To Rank ( LTR ) integration allows Solr to rerank the search results evaluating a provided Learning To Rank model.
Main responsabilties of the plugin are :

– storage of feature definitions
– storage of models
– feature extraction and caching
– search result rerank

Features Definition

As we learnt from the previous posts, the feature vector is the mathematical representation of each document/query pair and the model will score each search result according to that vector.
Of course we need to tell Solr how to generate the feature vector for each document in the search results.
Here comes the Feature Definition file.
A Json array describing all the relevant features necessary to score our documents through the machine learned LTR model.

e.g.

[{ "name": "isBook",
  "class": "org.apache.solr.ltr.feature.SolrFeature",
  "params":{ "fq": ["{!terms f=category}book"] }
},
{
  "name":  "documentRecency",
  "class": "org.apache.solr.ltr.feature.SolrFeature",
  "params": {
      "q": "{!func}recip( ms(NOW,publish_date), 3.16e-11, 1, 1)"
  }
},
{
  "name" : "userTextTitleMatch",
  "class" : "org.apache.solr.ltr.feature.SolrFeature",
  "params" : { "q" : "{!field f=title}${user_text}" }
},
{
  "name":"book_price",
  "class":"org.apache.solr.ltr.feature.FieldValueFeature",
  "params":{"field":"book_price"}
},
{
  "name":"originalScore",
  "class":"org.apache.solr.ltr.feature.OriginalScoreFeature",
  "params":{}
},
{
   "name" : "userFromMobile",
   "class" : "org.apache.solr.ltr.feature.ValueFeature",
   "params" : { "value" : "${userFromMobile:}", "required":true }
}]  
SolrFeature
– Query Dependent
– Query Independent
A Solr feature is defined by a Solr query following the Solr sintax.
The value of the Solr feature is calculated as the return value of the query run against the document we are scoring.
This feature can depend from query time parameters or can be query independent ( see examples)
e.g.
“params”:{“fq”: [“{!terms f=category}book”] }
– Query Independent
– Boolean feature
If the document match the term ‘book’ in the field ‘category’ the feature value will be 1.
It is query independent as no query param affects this calculation.
“params”:{“q”: “{!func}recip( ms(NOW,publish_date), 3.16e-11, 1, 1)”}
– Query Dependent
– Ordinal feature
The feature value will be calculated as the result of the function query, more recent the document, closer to 1 the value.
It is query dependent as ‘NOW’ affects the feature value.
“params”:{“q”: “{!field f=title}${user_text}” }
– Query Dependent
– Ordinal feature
The feature value will be calculated as the result of the query, more relevant the title content for the user query, higher the value.
It is query dependent as the ‘user_text’ query param affects the calculation.
FieldValueFeature
– Query Independent
A Fiel Value feature is defined by a Solr field.
The value of the feature is calculated as the content of the field for the document we are scoring.
The field must be STORED or DOC-VALUED . This feature is query independent ( see examples)
e.g.
“params”:{“field”:”book_price”}
– Query Independent
– Ordinal feature
The value of the feature will be the content of the ‘book_price’ field for a given document.
It is query independent as no query param affects this calculation.
ValueFeature
– Query Level
– Constant
A Value feature is defined by a constant or an external query parameter.
The value of the feature is calculated as the value passed in the solr request as an efi(External Feature Information) parameter or as a constant.
This feature depends only on the param configured( see examples)
e.g.
“params” : { “value” : “${user_from_mobile:}”, “required”:false }
– Query Level
– Boolean feature
The user will pass the ‘userFromMobile’ request param as an efi
The value of the feature will be the value of the parameter
The default value will be assigned if the parameter is missing in the request
If it is required an exception will be thrown if the parameter is missing in the request“params” : { “value” : “5“, “required”:false }
– Constant
– Ordinal feature
The feature value will be calculated as the constant value of ‘5’ .Except the constant, nothing affect the calculation.
OriginalScoreFeature
– Query Dependent
An Original Score feature is defined with no additional parameters.
The value of the feature is calculated as the original lucene score of the document given the input query.
This feature depends from query time parameters ( see examples)
e.g.
“params”:{}
— Query Dependent
— Ordinal feature
The feature value will be the original lucene score given the input query.
It is query dependent as the entire input query affect this calculation.

EFI ( External Feature Information )

As you noticed in the feature definition json, external request parameters can affect the feature extraction calculation.
When running a rerank query it is possible to pass additional request parameters that will be used at feature extraction time.
We see this in details in the related section.

e.g.
rq={!ltr reRankDocs=3 model=externalmodel efi.user_from_mobile=1}

 

Deploy Features definition

Good, we defined all the features we require for our model, we can now send them to Solr :

curl -XPUT 'http://localhost:8983/solr/collection1/schema/feature-store' --data-binary @/path/features.json -H 'Content-type:application/json'  
 

View Features Definition

To visualise the features just sent, we can access the feature store:

curl -XGET 'http://localhost:8983/solr/collection1/schema/feature-store'  
 

Models Definition

We extensively explored how to train models and how models look like in the format the Solr plugin is expecting.
For details I suggest you reading : Part 2
Let’s have a quick summary anyway  :

 

Linear Model (Ranking SVM, Pranking)

e.g.

 {
    "class":"org.apache.solr.ltr.model.LinearModel",
    "name":"myModelName",
    "features":[
        { "name": "userTextTitleMatch"},
        { "name": "originalScore"},
        { "name": "isBook"}
    ],
    "params":{
        "weights": {
            "userTextTitleMatch": 1.0,
            "originalScore": 0.5,
            "isBook": 0.1
        }
    }
} 

 

Multiple Additive Trees (LambdaMART, Gradient Boosted Regression Trees )

e.g.

{
    "class":"org.apache.solr.ltr.model.MultipleAdditiveTreesModel",
    "name":"lambdamartmodel",
    "features":[
        { "name": "userTextTitleMatch"},
        { "name": "originalScore"}
    ],
    "params":{
        "trees": [
            {
                "weight" : 1,
                "root": {
                    "feature": "userTextTitleMatch",
                    "threshold": 0.5,
                    "left" : {
                        "value" : -100
                    },
                    "right": {
                        "feature" : "originalScore",
                        "threshold": 10.0,
                        "left" : {
                            "value" : 50
                        },
                        "right" : {
                            "value" : 75
                        }
                    }
                }
            },
            {
                "weight" : 2,
                "root": {
                    "value" : -10
                }
            }
        ]
    }
}  

Heuristic Boosted Model (experimental)

The Heuristic Boosted Model is an experimental model that combines linear boosting to any model.
It is currently available in the experimental branch [3].
This capability is currently supported only by the : org.apache.solr.ltr.ranking.HeuristicBoostedLambdaMARTModel .
The reason behind this approach is that sometimes, at training time we don’t have available all the features we want to use at query time.
e.g.
Your training set is not built on clicks of the search results and contains legacy data, but you want to include the original score as a boosting factor
Let’s see the configuration in details :
Given :

"features":[ { "name": "userTextTitleMatch"}, { "name": "originalScoreFeature"} ]
"boost":{ "feature":"originalScoreFeature", "weight":0.1, "type":"SUM" }  

The original score feature value, weighted by a factor of 0.1, will be added to the score produced by the LambdaMART trees.

 "boost":{ "feature":"originalScoreFeature", "weight":0.1, "type":"PRODUCT" }  
 

The original score feature value, weighted by a factor of 0.1, will be multiplied to the score produced by the LambdaMART trees.

N.B. Take extra care when using this approach. This introduces a manual boosting to the score calculation, which adds flexibility when you don’t have much data for training. However, you will loose some of the benefits of a machine learned model, which was optimized to rerank your results. As you get more data and your model becomes better, you should shift off the manual boosting.

 

e.g

{
    "class":"org.apache.solr.ltr.ranking.HeuristicBoostedLambdaMARTModel",
    "name":"lambdamartmodel",
    "features":[
        { "name": "userTextTitleMatch"},
        { "name": "originalScoreFeature"}
    ],
    "params":{
    "boost": {
          "feature": "originalScoreFeature",
          "weight": 0.5,
          "type": "SUM"
        },
        "trees": [
            {
                "weight" : 1,
                "root": {
                    "feature": "userTextTitleMatch",
                    "threshold": 0.5,
                    "left" : {
                        "value" : -100
                    },
                    "right": {
                        "value" : 10}
 }
 },
 {
 "weight" : 2,
 "root": {
 "value" : -10
 }
 }
 ]
 }
 }

Deploy Model

As we saw for the features definition, deploying the model is quite straightforward :

curl -XPUT 'http://localhost:8983/solr/collection1/schema/model-store' --data-binary @/path/model.json -H 'Content-type:application/json' 
 

View Model

The model will be stored in an easily accessible json store:

curl -XGET 'http://localhost:8983/solr/collection1/schema/model-store'
 

Rerank query

To rerank your search results using a machine learned LTR model it is required to call the rerank component using the Apache Solr Learning To Rank ( LTR ) query parser.

Query Re-Ranking allows you to run an initial query(A) for matching documents and then re-rank the top N documents re-scoring them based on a second query (B).
Since the more costly ranking from query B is only applied to the top N documents it will have less impact on performance then just using the complex query B by itself – the trade off is that documents which score very low using the simple query A may not be considered during the re-ranking phase, even if they would score very highly using query B.  Solr Wiki

The Apache Solr Learning To Rank ( LTR ) integration defines an additional query parser that can be used to define the rerank strategy.
In particular, when rescoring a document in the search results :

  • Features are extracted from the document
  • Score is calculated evaluating the model against the extracted feature vector
  • Final search results are reranked according to the new score
rq={!ltr model=myModelName reRankDocs=25}

!ltr – will use the ltr query parser
model=myModelName – specifies which model in the model-store to use to score the documents
reRankDocs=25 – specifies that only the top 25 search results from the original ranking, will be scored and reranked

When passing external feature information (EFI) that will be used to extract the feature vector, the syntax is pretty similar :

rq={!ltr reRankDocs=3 model=externalmodel efi.parameter1=’value1′ efi.parameter2=’value2′}

e.g.

rq={!ltr reRankDocs=3 model=externalModel efi.user_input_query=’Casablanca’ efi.user_from_mobile=1}

Sharding

When using sharding, each shard will rerank, so the reRankDocs will be considered per shard.

e.g.
10 shards
You run distributed query with :
rq={!ltr reRankDocs=10 …
You will get a total of 100 documents re-ranked .

Pagination

Pagination is delicate[4].

Let’s explore the scenario on a single Solr node and on a sharded architecture.

 

Single Solr node

reRankDocs=15
rows=10

This means each page is composed by 10 results.
What happens when we hit the page 2 ?
The first 5 documents in the search results will have been rescored and affected by the reranking.
The latter 5 documents will preserve the original score and original ranking.

e.g.
Doc 11 – score= 1.2
Doc 12 – score= 1.1
Doc 13 – score= 1.0
Doc 14 – score= 0.9
Doc 15 – score= 0.8
Doc 16 – score= 5.7
Doc 17 – score= 5.6
Doc 18 – score= 5.5
Doc 19 – score= 4.6
Doc 20 – score= 2.4

This means that score(15) could be < score(16), but document 15 and 16 are still in the expected order.
The reason is that the top 15 documents are rescored and reranked and the rest is left unchanged.

Sharded architecture

reRankDocs=15
rows=10
Shards number=2

When looking for the page 2, Solr will trigger queries to she shards to collect 2 pages per shard :
Shard1 : 10 ReRanked docs (page1) + 10 OriginalScored docs (page2)
Shard2 : 10 ReRanked docs (page1) + 10 OriginalScored docs (page2)

The the results will be merged, and possibly, original scored search results can top up reranked docs.
A possible solution could be to normalise the scores to prevent any possibility that a reranked result is surpassed by original scored ones.

Note: The problem is going to happen after you reach rows * page > reRankDocs. In situations when reRankDocs is quite high , the problem will occur only in deep paging.

Feature Extraction And Caching

Extracting the features from the search results document is the most onerous task while reranking using LTR.
The LTRScoringQuery will take care of computing the feature values in the feature vector and then delegate the final score generation to the LTRScoringModel.
For each document the definitions in the feature-store are applied to generate the vector.
The vector can be generate in parallel, leveraging a multi-threaded approach.
Extra care must be taken into account when configuring the number of threads in the game.
The feature vector is currently cached in toto in the QUERY_DOC_FV cache.

This means that given the query and EFIs, we cache the entire feature vector for the document.

Simply giving in input a different efi request parameter will imply a different hashcode for the feature vector and consequentially invalidate the cached one.
This bit can be potentially improved, managing separately caches for the query independent, query dependent and query level features[5].

Solr Is Learning To Rank Better – Part 2 – Model Training

Model Training For Apache Solr Learning To Rank 

If you want to train a model for Apache Solr Learning To Rank , you are in the right place.
This blog post is about the model training phase for the Apache Solr Learning To Rank integration.

We modelled our dataset, we collected the data and refined it in Part 1 .
We have now a shiny lot-of-rows training set ready to be used to train the mathematical model that will re-rank the resulting documents coming out from our queries.
The very first activity to carry on is to decide the model that fits best our requirements.
I will focus in this blog post on two type of models, the ones currently supported by the Apache Solr Learning To Rank integration [1] .

Ranking SVM

Ranking SVM is a linear model based on Support Vector Machines.

The Ranking SVM algorithm is a pair-wise ranking method that adaptively sorts results based on how ‘relevant’  they are for a specific query ( we saw example of relevancy rating for the pair document-query in Part 1).

The Ranking SVM function maps each search query to the features of each sample.
This mapping function projects each data pair onto a feature space.

Each sample ( query-document ) of the training data will be used for the Ranking SVM algorithm to refine the mapping.

Generally, Ranking SVM includes three steps at training time:

  • It maps the similarities between queries and the clicked results onto a certain feature space.
  • It calculates the distances between any two of the vectors obtained in step 1.
  • It forms an optimization problem which is similar to a standard SVM classification and solves this problem with the regular SVM solver
Given the list of the features that describe our problem, an SVM model will assign a numerical weight to each of them, the resulting function will assign a score to a document given in input the feature vector that describes it.
Let’s see an example SVM Model in the Json format expected by the Solr plugin :
e.g.
{
    "class":"org.apache.solr.ltr.ranking.RankSVMModel",
    "name":"myModelName",
    "features":[
        { "name": "userTextTitleMatch"},
        { "name": "originalScore"},
        { "name": "isBook"}
    ],
    "params":{
        "weights": {
            "userTextTitleMatch": 1.0,
            "originalScore": 0.5,
            "isBook": 0.1
        }

}
}

Given 3 example features :
userTextTitleMatch – (query dependant feature) – (binary)
originalScore –  (query dependant feature) – (ordinal)
isBook – (document dependant feature) – (binary)

The ranking SVM model in the example builds a linear function of the variables assigning a different weight to each of them.

Given the documents (expressed with their feature vector):
D1 [1.0, 100, 1]
D2 [0.0, 80, 1]

Applying the model we get the scores :

Score(D1) = 1.0 * userTextTitleMatch(1.0) + 0.5 * originalScore(100) + 0.1 * isBook(1.0) = 51.1
Score(D2) = 1.0 * userTextTitleMatch(0.0) + 0.5 * originalScore(80) + 0.1 * isBook(1.0) = 40.1
The D1 documents is more relevant according the model.

Key points :

  • easy to debug and understand
  • linear function

Here you can find some good Open Source libraries to train SVM models [2].

LambdaMART

LambdaMART is a tree ensemble based model.
Each tree of the ensemble is a weighted regression tree and the final predicted score is the weighted sum of the prediction of each regression tree.
A regression tree is a decision tree that takes in input a feature vector and returns a scalar numerical value in output.
At a high level, LambdaMART is an algorithm that uses gradient boosting to directly optimize Learning to Rank specific cost functions such as NDCG.

To understand LambdaMART let’s explore the main two aspects: Lambda and MART.

MART
LambdaMART is a specific instance of Gradient Boosted Regression Trees, also referred to as Multiple Additive Regression Trees (MART).
Gradient Boosting is a technique for forming a model that is a weighted combination of an ensemble of “weak learners”. In our case, each “weak learner” is a decision tree.
Lambda
At each training point we need to provide a gradient that will allow us to minimize the cost function ( whatever we selected, NDCG for example).
To solve the problem LambdaMART uses an idea coming from lambdaRank:
at each certain point we calculate a value that acts as the gradient we require, this component will effectively modify the ranking, pushing up or down groups of documents and affecting effectively the rules for the leaf values, which cover the point used to calculate lambda.
For additional details these resources has been useful [3] ,[4] .
LambdaMART is currently considered one of the most effective model and it has been proved by a number of winning contests.

Let’s see how a lambdaMART model looks like :

{
    "class":"org.apache.solr.ltr.ranking.LambdaMARTModel",
    "name":"lambdamartmodel",
    "features":[
        { "name": "userTextTitleMatch"},
        { "name": "originalScore"}
    ],
    "params":{
        "trees": [
            {
                "weight" : 1,
                "root": {
                    "feature": "userTextTitleMatch",
                    "threshold": 0.5,
                    "left" : {
                        "value" : -100
                    },
                    "right": {
                        "feature" : "originalScore",
                        "threshold": 10.0,
                        "left" : {
                            "value" : 50
                        },
                        "right" : {
                            "value" : 75
                        }
                    }
                }
            },
            {
                "weight" : 2,
                "root": {
                    "value" : -10
                }
            }
        ]
    }
}

This example model is composed by two trees, each branch split evaluates a conditional expression on a feature value, if the feature value is  :
<=  threshold we visit the left branch
>  threshold we visit the right branch

Reaching the leaf of a tree will produce a score, this will be weighted according to the tree weight and then summed to the other scores produced ( the model is an ensemble of trees).
Given the documents :

D1 [1, 9]
D2 [0, 10]

Applying the model we get the scores :

Score(D1) = 1* (userTextTitleMatch (1) > 0.5 go right , originalScore (9) < 10 = 50) +
2 * -10 = 30

Score(D2) = 1 *(userTextTitleMatch(0) <= 0.5 = -100) +
2 * -10 = -120

The D1 document is more relevant according the model.

Key points :

  • ensemble of trees are effective
  • difficult to debug
  • non linear function

There are good open source libraries to train LambdaMART models, the one that we are going to use in this case of study is RankLib [5]

Evaluation Metric

Important phase of the journey is to validate the model :
How good is our model in re-ranking our dataset ?
How good is the model in re-ranking unknown data  ?
It is really important to carefully select the evaluation metric that best suites your algorithm and your needs.
Evaluating the model will be vital for the validation phase during the training, for testing the model against new data and to consistently being able to assess the predicting quality of the model trained.
N.B. this is particularly important: having a good understanding of the evaluation metric for the algorithm selected, can help a lot in tuning,  improving the model and finding anomalies in the training data.

NDCG@K

Normalised Discounted Cumulative Gain @k is a natural fit when choosing lambdaMART.
This metric evaluates the performance of a ranking model, it varies from 0.0 to 1.0, with 1.0
representing the ideal ranking model.
It takes K as parameter :
K : maximum number of documents that will be returned .

K specifies, for each query,  after the re-ranking, the top K results to evaluate.
N.B. set K to be the number of top documents you want to optimise.
Generally it is the number of documents you show in the first page of results.

Given in input the test set, this is grouped by queryId and for each query the ideal ranking (obtained sorting the documents by descendant relevancy) is compared with the ranking generated by the ranking model.
An average for all the queryIds is calculated.
Detailed explanation can be found in Wikipedia [6].

Knowing how it works, 2 factors sound immediately important when reasoning about NDCG :
1) distribution of relevancy scores in the test set, per queryId
2) number of samples per queryId

War Story 1 : too many highly relevant samples -> high NDCG
Given the test set A :
5 qid:1 1:1 …
5 qid:1 1:1 …
4 qid:1 1:1 …
5 qid:2 1:1 …
4 qid:2 1:1 …
4 qid:2 1:1 …

Even if the ranking model does not a good job, the high distribution of relevant samples increase the probability of hitting high score in the metric.

War Story 2 : small RankLists -> high NDCG
Having few samples (<K) per queryId means that we will not evaluate the performance of our ranking model with accuracy.

In the edge case of 1 sample per queryId, our NDCG@10 will be perfect, but actually this reflect simply a really poor training/test set.

N.B. be careful on how you generate your queryId as you can end up in this edge case and have a wrong perception of the quality of your ranking model.

Model Training

Focus of this section will be how to train a LambdaMART model using RankLib[5],
Let’s see an example and analyse all the parameters:
java  -jar RankLib-2.7.jar 
train /var/trainingSets/trainingSet_2016-07-20.txt 
-feature /var/ltr/trainingSets/trainingSet_2016-07-20_features.txt 
-ranker
-leaf 75
mls 5000 
-metric2t NDCG@20 
kcv 5 
-tvs 0.8 
-kcvmd /var/models 
-kcvmn model-name.xml 
-sparse 
Parameter Description
train
Path to the training set file
feature Feature description file, list feature Ids to be considered by the learner, each on a separate line. If not specified, all features will be used.
ranker Specify which ranking algorithm to use e.g. 6: LambdaMART
metric2t Metric to optimize on the training data. Supported: MAP, NDCG@k, DCG@k, P@k, RR@k, ERR@k (default=ERR@10)
tvs
Set train-validation split to be (x)(1.0-x).
x * (size of the training set) will be used for training
(1.0 -x) * (size of the training set) will be used for validation
kcv
Stands for k Cross Validation
Specify if you want to perform k-fold cross validation using ONLY the specified training data (default=NoCV).
This means we split the training data in k subsets and we run k training executions.
In each execution 1 subset will be used as the test set and k-1 subsets will be used for training.
N.B. in the case that tvs and kcv are both defined, first we split for the cross validation, than the training set produced will be split in training/validation.
e.g.
Initial Training Set size : 100 rows

-kcv 5 
-tvs 0.8 
Test Set : 20 rows
Training Set :  64 ( 0.8 * 80)
Validation Set : 16 (0.2 * 80)
kcvmd
Directory where to save models trained via cross-validation (default=not-save).
kcvmn Name for model learned in each fold. It will be prefix-ed with the fold-number (default=empty).
sparse Allow sparse data processing for the training set ( which means that you don’t need to specify all the features for each sample where the feature has no value)
tree
Number of trees of the ensemble(default=1000).
More complex is the learning problem, more trees we need, but it’s important to be careful and not overfit the training data
leaf
Number of leaves for each tree (default=100).
As the number of trees, it is important to tune carefully this value to avoid overfitting trees.
shrinkage This is the learning rate of the algorithm(default=0.1)
If this is too aggressive the Ranking model will quickly fit the training set, but will not react properly to the validation set evaluation, which means an overall poorer model.
tc
Number of threshold candidates for tree spliting.
-1 to use all feature values (default=256)
Increasing this value we increase the complexity and possible overfitting of the model.
It is suggested to start with a low value and then increase it according the general cardinality of your features.
mls
Min leaf support – minimum #samples each leaf has to contain (default=1).
This is quite an important parameter if we want to take care of outliers.
We can tune this parameter to include only leaves with an high number of samples, discarding pattern validated by a weak support.

Tuning the model training parameters is not an easy task.
Apart the general suggestions, trial & error with a careful comparison of the evaluation metric score is the path to go.
Comparing different models on a common Test Set can help when we have produced a number of models with different tuning configuration.

Converting Model To Json

We finally produced a model which is performing quite well on our Test set.
Cool, it’s the time to push it to Solr and see the magic in action.
But before we can do that, we need to convert the model generated by the Machine Learning libraries into the Json format Solr expects ( you remember the section about the models supported ? )
Taking as example a LambdaMART model, Ranklib will produce an xml model.
So you will need to parse it and convert it to the Json format.
Would be interesting to implement directly in RankLib the possibility of selecting in output the Json format expected by Solr.
Any contribution is welcome [7]!
In the next part we’ll see how to visualise and understand better the model generated.
This activity can be vital to debug the model, see the most popular features, find out some anomaly in the training set and actually assess the validity of the model itself.