Apache Solr Learning To Rank Main Blog
train a model for Apache Solr Learning To Rank

Model Training For Apache Solr Learning To Rank 

If you want to train a model for Apache Solr Learning To Rank, you are in the right place.
This blog post is about the model training phase for the Apache Solr Learning To Rank integration.

We modelled our dataset, collected the data and refined it in Part 1.
We have now a shiny lot-of-rows training set ready to be used to train the mathematical model that will re-rank the resulting documents coming out from our queries.
The very first activity to carry on is to decide the model that fits best our requirements.

I will focus in this blog post on two types of models, the ones currently supported by the Apache Solr Learning To Rank integration [1].

Ranking SVM

Ranking SVM is a linear model based on Support Vector Machines.

The Ranking SVM algorithm is a pair-wise ranking method that adaptively sorts results based on how ‘relevant’  they are for a specific query ( we saw example of relevancy rating for the pair document-query in Part 1).

The Ranking SVM function maps each search query to the features of each sample.
This mapping function projects each data pair onto a feature space.

Each sample ( query-document ) of the training data will be used for the Ranking SVM algorithm to refine the mapping.

Generally, Ranking SVM includes three steps at training time:

    • It maps the similarities between queries and the clicked results onto a certain feature space.
    • It calculates the distances between any two of the vectors obtained in step 1.
    • It forms an optimization problem which is similar to a standard SVM classification and solves this problem with the regular SVM solver
Given the list of the features that describe our problem, an SVM model will assign a numerical weight to each of them, the resulting function will assign a score to a document given in input the feature vector that describes it.
Let’s see an example SVM Model in the Json format expected by the Solr plugin :
e.g.
{
    "class":"org.apache.solr.ltr.ranking.RankSVMModel",
    "name":"myModelName",
    "features":[
        { "name": "userTextTitleMatch"},
        { "name": "originalScore"},
        { "name": "isBook"}
    ],
    "params":{
        "weights": {
            "userTextTitleMatch": 1.0,
            "originalScore": 0.5,
            "isBook": 0.1
        }

Given 3 example features :
userTextTitleMatch – (query dependant feature) – (binary)
originalScore –  (query dependant feature) – (ordinal)
isBook – (document dependant feature) – (binary)

The ranking SVM model in the example builds a linear function of the variables assigning a different weight to each of them.

Given the documents (expressed with their feature vector):
D1 [1.0, 100, 1]
D2 [0.0, 80, 1]

Applying the model we get the scores :

Score(D1) = 1.0 * userTextTitleMatch(1.0) + 0.5 * originalScore(100) + 0.1 * isBook(1.0) = 51.1
Score(D2) = 1.0 * userTextTitleMatch(0.0) + 0.5 * originalScore(80) + 0.1 * isBook(1.0) = 40.1
The D1 documents is more relevant according the model.

Key points :

    • easy to debug and understand
    • linear function

Here you can find some good Open Source libraries to train SVM models [2], [2.1].

LambdaMART

LambdaMART is a tree ensemble based model.
Each tree of the ensemble is a weighted regression tree and the final predicted score is the weighted sum of the prediction of each regression tree.
A regression tree is a decision tree that takes in input a feature vector and returns a scalar numerical value in output.
At a high level, LambdaMART is an algorithm that uses gradient boosting to directly optimize Learning to Rank specific cost functions such as NDCG.

To understand LambdaMART let’s explore the main two aspects: Lambda and MART.

MART
LambdaMART is a specific instance of Gradient Boosted Regression Trees, also referred to as Multiple Additive Regression Trees (MART).
Gradient Boosting is a technique for forming a model that is a weighted combination of an ensemble of “weak learners”. In our case, each “weak learner” is a decision tree.
Lambda
At each training point we need to provide a gradient that will allow us to minimize the cost function ( whatever we selected, NDCG for example).
To solve the problem LambdaMART uses an idea coming from lambdaRank:
at each certain point we calculate a value that acts as the gradient we require, this component will effectively modify the ranking, pushing up or down groups of documents and affecting effectively the rules for the leaf values, which cover the point used to calculate lambda.
For additional details, these resources have been useful [3][4].

LambdaMART is currently considered one of the most effective models and it has been proved by a number of winning contests.

Let’s see how a lambdaMART model looks like 

{
    "class":"org.apache.solr.ltr.ranking.LambdaMARTModel",
    "name":"lambdamartmodel",
    "features":[
        { "name": "userTextTitleMatch"},
        { "name": "originalScore"}
    ],
    "params":{
        "trees": [
            {
                "weight" : 1,
                "root": {
                    "feature": "userTextTitleMatch",
                    "threshold": 0.5,
                    "left" : {
                        "value" : -100
                    },
                    "right": {
                        "feature" : "originalScore",
                        "threshold": 10.0,
                        "left" : {
                            "value" : 50
                        },
                        "right" : {
                            "value" : 75
                        }
                    }
                }
            },
            {
                "weight" : 2,
                "root": {
                    "value" : -10
                }
            }
        ]
    }
}

This example model is composed by two trees, each branch split evaluates a conditional expression on a feature value, if the feature value is  :
<=  threshold we visit the left branch
>  threshold we visit the right branch

 

Reaching the leaf of a tree will produce a score, this will be weighted according to the tree weight and then summed to the other scores produced ( the model is an ensemble of trees).
Given the documents :

D1 [1, 9]
D2 [0, 10]

Applying the model we get the scores :

Score(D1) = 1* (userTextTitleMatch (1) > 0.5 go right , originalScore (9) < 10 = 50) +
2 * -10 = 30

 

Score(D2) = 1 *(userTextTitleMatch(0) <= 0.5 = -100) +
2 * -10 = -120

The D1 document is more relevant according the model.

Key points :

    • ensemble of trees are effective
    • difficult to debug
    • non linear function

There are good open source libraries to train LambdaMART models, the one that we are going to use in this case of study is RankLib [5]

Evaluation Metric

Important phase of the journey is to validate the model :
How good is our model in re-ranking our dataset ?
How good is the model in re-ranking unknown data  ?
It is really important to carefully select the evaluation metric that best suites your algorithm and your needs.
Evaluating the model will be vital for the validation phase during the training, for testing the model against new data and to consistently being able to assess the predicting quality of the model trained.
 
N.B. this is particularly important: having a good understanding of the evaluation metric for the algorithm selected, can help a lot in tuning,  improving the model and finding anomalies in the training data.

NDCG@K

Normalised Discounted Cumulative Gain @k is a natural fit when choosing lambdaMART.
This metric evaluates the performance of a ranking model, it varies from 0.0 to 1.0, with 1.0
representing the ideal ranking model.
It takes K as parameter :
K : maximum number of documents that will be returned.

K specifies, for each query,  after the re-ranking, the top K results to evaluate.
N.B. set K to be the number of top documents you want to optimise.
Generally it is the number of documents you show in the first page of results.

Given in input the test set, this is grouped by queryId and for each query the ideal ranking (obtained sorting the documents by descendant relevancy) is compared with the ranking generated by the ranking model.
An average for all the queryIds is calculated.
A detailed explanation can be found on Wikipedia [6].

Knowing how it works, 2 factors sound immediately important when reasoning about NDCG :
1) distribution of relevancy scores in the test set, per queryId
2) number of samples per queryId

War Story 1 : too many highly relevant samples -> high NDCG
Given the test set A :
5 qid:1 1:1 …
5 qid:1 1:1 …
4 qid:1 1:1 …
5 qid:2 1:1 …
4 qid:2 1:1 …
4 qid:2 1:1 …

Even if the ranking model does not a good job, the high distribution of relevant samples increase the probability of hitting high score in the metric.

War Story 2: small RankLists -> high NDCG
Having few samples (<K) per queryId means that we will not evaluate the performance of our ranking model with accuracy.
 
In the edge case of 1 sample per queryId, our NDCG@10 will be perfect, but actually this reflect simply a really poor training/test set.
 
N.B. be careful on how you generate your queryId as you can end up in this edge case and have a wrong perception of the quality of your ranking model.

Model Training

Focus of this section will be how to train a LambdaMART model using RankLib [5],
Let’s see an example and analyse all the parameters:
 
java  -jar RankLib-2.7.jar 
train /var/trainingSets/trainingSet_2016-07-20.txt 
-feature /var/ltr/trainingSets/trainingSet_2016-07-20_features.txt 
-ranker
-leaf 75
mls 5000 
-metric2t NDCG@20 
kcv 5 
-tvs 0.8 
-kcvmd /var/models 
-kcvmn model-name.xml 
-sparse 
ParameterDescription
trainPath to the training set file
featureFeature description file, list feature Ids to be considered by the learner, each on a separate line. If not specified, all features will be used.
rankerSpecify which ranking algorithm to use e.g. 6: LambdaMART
metric2tMetric to optimize on the training data. Supported: MAP, NDCG@k, DCG@k, P@k, RR@k, ERR@k (default=ERR@10)
tvsSet train-validation split to be (x)(1.0-x).
x * (size of the training set) will be used for training
(1.0 -x) * (size of the training set) will be used for validation
kcv

Stands for k Cross Validation Specify if you want to perform k-fold cross validation using ONLY the specified training data (default=NoCV).
This means we split the training data in k subsets and we run k training executions.
In each execution 1 subset will be used as the test set and k-1 subsets will be used for training. 
N.B. in the case that tvs and kcv are both defined, first we split for the cross validation, than the training set produced will be split in training/validation. e.g. Initial Training Set size : 100 rows

-kcv 5 
-tvs 0.8  
-Test Set : 20 rows 
-Training Set :  64 ( 0.8 * 80) 
-Validation Set : 16 (0.2 * 80)
kcvmdDirectory where to save models trained via cross-validation (default=not-save).
kcvmnName for model learned in each fold. It will be prefix-ed with the fold-number (default=empty).
sparseAllow sparse data processing for the training set ( which means that you don’t need to specify all the features for each sample where the feature has no value)
treeNumber of trees of the ensemble(default=1000).
More complex is the learning problem, more trees we need, but it’s important to be careful and not overfit the training data
leafNumber of leaves for each tree (default=100).
As the number of trees, it is important to tune carefully this value to avoid overfitting trees.
shrinkageThis is the learning rate of the algorithm(default=0.1)
If this is too aggressive the Ranking model will quickly fit the training set, but will not react properly to the validation set evaluation, which means an overall poorer model.
tcNumber of threshold candidates for tree spliting.
-1 to use all feature values (default=256)
Increasing this value we increase the complexity and possible overfitting of the model.
It is suggested to start with a low value and then increase it according the general cardinality of your features.
mlsMin leaf support – minimum #samples each leaf has to contain (default=1).
This is quite an important parameter if we want to take care of outliers.
We can tune this parameter to include only leaves with an high number of samples, discarding pattern validated by a weak support.

Tuning the model training parameters is not an easy task.
Apart from the general suggestions, trial & error with a careful comparison of the evaluation metric score is the path to go.
Comparing different models on a common Test Set can help when we have produced a number of models with different tuning configurations.

Converting Model To Json

We finally produced a model which is performing quite well on our Test set.
Cool, it’s the time to push it to Solr and see the magic in action.
But before we can do that, we need to convert the model generated by the Machine Learning libraries into the Json format Solr expects ( you remember the section about the models supported ? )
 
Taking as example a LambdaMART model, Ranklib will produce an xml model.
So you will need to parse it and convert it to the Json format.
Would be interesting to implement directly in RankLib the possibility of selecting in output the Json format expected by Solr.
Any contribution is welcome, RankLib Extension for Json Models!
In the next part, we’ll see how to visualise and understand better the model generated.
This activity can be vital to debug the model, see the most popular features, find out some anomalies in the training set and actually assess the validity of the model itself.
// our service

Shameless plug for our training and services!

Did I mention we do Apache Solr Beginner and Learning to Rank training?
We also provide consulting on these topics, get in touch if you want to bring your search engine to the next level!

// STAY ALWAYS UP TO DATE

Subscribe to our newsletter

Did you like this post about Solr Is Learning To Rank Better? Don’t forget to subscribe to our Newsletter to stay always updated from the Information Retrieval world!

Author

Alessandro Benedetti

Alessandro Benedetti is the founder of Sease Ltd. Senior Search Software Engineer, his focus is on R&D in information retrieval, information extraction, natural language processing, and machine learning.

Comments (3)

  1. Solr Is Learning To Rank Better – Part 4 – Solr Integration
    February 3, 2018

    […] modelled our dataset, we collected the data and refined it in Part 1 . Trained the model in Part 2 . Analysed and evaluate the model and training set in Part 3 . We are ready to rock and deploy the […]

  2. Michał Wieczorek
    March 18, 2021

    Could you please provide some examples of trainingSet_2016-07-20.txt and trainingSet_2016-07-20_features.txt files?
    I tried to run RankLib library with the same params as above (without features option), but it ends up with the java.lang.ArrayIndexOutOfBoundsException.

    Btw, I checked the source code of the library and I’m really surprised to see only few unit test there, most of which are ignored because they don’t pass

  3. Anna Ruggero
    March 31, 2021

    Hi Michał,
    I don’t know if this could help you but in the RankLib documentation (https://sourceforge.net/p/lemur/wiki/RankLib%20How%20to%20use/) you can find a link to some files example.
    Specifically, RankLib uses LETOR format and in their website (https://www.microsoft.com/en-us/research/project/letor-learning-rank-information-retrieval/?from=http%3A%2F%2Fresearch.microsoft.com%2Fen-us%2Fum%2Fbeijing%2Fprojects%2Fletor%2Fletor4dataset.aspx) you can download their datasets and have a look on the right format to use.

    Cheers.

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: