Apache Solr, Learning To Rank, Main Blog

Solr Is Learning To Rank Better – Part 2 – Model Training

Model Training For Apache Solr Learning To Rank

If you want to train a model for Apache Solr Learning To Rank, you are in the right place.
This blog post is about the model training phase for the Apache Solr Learning To Rank integration.

We modelled our dataset, collected the data and refined it in Part 1.

We have now a shiny lot-of-rows training set ready to be used to train the mathematical model that will re-rank the resulting documents coming out from our queries.

The very first activity to carry on is to decide the model that fits best our requirements.

I will focus in this blog post on two types of models, the ones currently supported by the Apache Solr Learning To Rank integration [1].

Ranking SVM

Ranking SVM is a linear model based on Support Vector Machines.

The Ranking SVM algorithm is a pair-wise ranking method that adaptively sorts results based on how ‘relevant’ they are for a specific query (we saw an example of relevancy rating for the pair document query in Part 1).

The Ranking SVM function maps each search query to the features of each sample.
This mapping function projects each data pair onto a feature space.

Each sample ( query-document ) of the training data will be used for the Ranking SVM algorithm to refine the mapping.

Generally, Ranking SVM includes three steps at training time:

- It maps the similarities between queries and the clicked results onto a certain feature space.
- It calculates the distances between any two of the vectors obtained in step 1.
- It forms an optimization problem which is similar to a standard SVM classification and solves this problem with the regular SVM solver

Given the list of the features that describe our problem, an SVM model will assign a numerical weight to each of them, the resulting function will assign a score to a document given in input the feature vector that describes it.

Let’s see an example SVM Model in the JSON format expected by the Solr plugin :

e.g.

				
					{
    "class":"org.apache.solr.ltr.ranking.RankSVMModel",
    "name":"myModelName",
    "features":[
        { "name": "userTextTitleMatch"},
        { "name": "originalScore"},
        { "name": "isBook"}
    ],
    "params":{
        "weights": {
            "userTextTitleMatch": 1.0,
            "originalScore": 0.5,
            "isBook": 0.1
        }

Given 3 example features :
userTextTitleMatch – (query dependant feature) – (binary)
originalScore – (query dependant feature) – (ordinal)
isBook – (document dependant feature) – (binary)

The ranking SVM model in the example builds a linear function of the variables assigning a different weight to each of them.

Given the documents (expressed with their feature vector):
D1 [1.0, 100, 1]
D2 [0.0, 80, 1]

Applying the model we get the scores :

Score(D1) = 1.0 * userTextTitleMatch(1.0) + 0.5 * originalScore(100) + 0.1 * isBook(1.0) = 51.1
Score(D2) = 1.0 * userTextTitleMatch(0.0) + 0.5 * originalScore(80) + 0.1 * isBook(1.0) = 40.1
The D1 document is more relevant according to the model.

Key points :

- easy to debug and understand
- linear function

Here you can find some good Open Source libraries to train SVM models [2], [2.1].

LambdaMART

LambdaMART is a tree ensemble-based model.
Each tree of the ensemble is a weighted regression tree and the final predicted score is the weighted sum of the prediction of each regression tree.
A regression tree is a decision tree that takes in input a feature vector and returns a scalar numerical value in output.
At a high level, LambdaMART is an algorithm that uses gradient boosting to directly optimize Learning to Rank specific cost functions such as NDCG.

To understand LambdaMART let’s explore the main two aspects: Lambda and MART.

MART

LambdaMART is a specific instance of Gradient Boosted Regression Trees, also referred to as Multiple Additive Regression Trees (MART).
Gradient Boosting is a technique for forming a model that is a weighted combination of an ensemble of “weak learners”. In our case, each “weak learner” is a decision tree.

LAMBDA

At each training point, we need to provide a gradient that will allow us to minimize the cost function ( whatever we selected, NDCG for example).
To solve the problem LambdaMART uses an idea coming from lambdaRank:
at each certain point, we calculate a value that acts as the gradient we require, this component will effectively modify the ranking, pushing up or down groups of documents and affecting effectively the rules for the leaf values, which cover the point used to calculate lambda.
For additional details, these resources have been useful [3], [4].

LambdaMART is currently considered one of the most effective models and it has been proven by many winning contests.

Let’s see what a lambdaMART model looks like:

				
					{
    "class":"org.apache.solr.ltr.ranking.LambdaMARTModel",
    "name":"lambdamartmodel",
    "features":[
        { "name": "userTextTitleMatch"},
        { "name": "originalScore"}
    ],
    "params":{
        "trees": [
            {
                "weight" : 1,
                "root": {
                    "feature": "userTextTitleMatch",
                    "threshold": 0.5,
                    "left" : {
                        "value" : -100
                    },
                    "right": {
                        "feature" : "originalScore",
                        "threshold": 10.0,
                        "left" : {
                            "value" : 50
                        },
                        "right" : {
                            "value" : 75
                        }
                    }
                }
            },
            {
                "weight" : 2,
                "root": {
                    "value" : -10
                }
            }
        ]
    }
}

This example model is composed of two trees, each branch split evaluates a conditional expression on a feature value, if the feature value is :
<= threshold we visit the left branch
> threshold we visit the right branch

Reaching the leaf of a tree will produce a score, this will be weighted according to the tree weight and then summed to the other scores produced ( the model is an ensemble of trees).
Given the documents :

D1 [1, 9]
D2 [0, 10]

Applying the model we get the scores :

Score(D1) = 1* (userTextTitleMatch (1) > 0.5 go right , originalScore (9) < 10 = 50) +
2 * -10 = 30

Score(D2) = 1 *(userTextTitleMatch(0) <= 0.5 = -100) +
2 * -10 = -120

The D1 document is more relevant according to the model.

Key points :

- ensemble of trees is effective
- difficult to debug
- non-linear function

There are good open source libraries to train LambdaMART models, the one that we are going to use in this case of study is RankLib [5]

Evaluation Metric

An important phase of the journey is to validate the model:

How good is our model in re-ranking our dataset?

How good is the model in re-ranking unknown data?

It is really important to carefully select the evaluation metric that best suits your algorithm and your needs.

Evaluating the model will be vital for the validation phase during the training, for testing the model against new data and for consistently being able to assess the predicted quality of the model trained.

N.B. This is particularly important: having a good understanding of the evaluation metric for the algorithm selected can help a lot in tuning, improving the model and finding anomalies in the training data.

NDCG@K

Normalised Discounted Cumulative Gain @k is a natural fit when choosing lambdaMART.
This metric evaluates the performance of a ranking model, it varies from 0.0 to 1.0, with 1.0
representing the ideal ranking model.
It takes K as a parameter:
K: maximum number of documents that will be returned.

K specifies, for each query, after the re-ranking, the top K results to evaluate.
N.B. set K to be the number of top documents you want to optimise.
Generally, it is the number of documents you show on the first page of the results.

Given in input the test set, this is grouped by queryId and for each query the ideal ranking (obtained by sorting the documents by descendant relevancy) is compared with the ranking generated by the ranking model.
An average for all the queryIds is calculated.
A detailed explanation can be found on Wikipedia [6].

Knowing how it works, 2 factors sound immediately important when reasoning about NDCG :
1) distribution of relevancy scores in the test set, per queryId
2) number of samples per queryId

War Story 1 : too many highly relevant samples -> high NDCG

Given the test set A :
5 qid:1 1:1 …
5 qid:1 1:1 …
4 qid:1 1:1 …
5 qid:2 1:1 …
4 qid:2 1:1 …
4 qid:2 1:1 …

Even if the ranking model does not do a good job, the high distribution of relevant samples increases the probability of hitting a high score in the metric.

War Story 2: small RankLists -> high NDCG

Having few samples (<K) per queryId means that we will not evaluate the performance of our ranking model with accuracy.

In the edge case of 1 sample per queryId, our NDCG@10 will be perfect, but actually, this reflects simply a really poor training/test set.

N.B. Be careful about how you generate your queryId as you can end up in this edge case and have a wrong perception of the quality of your ranking model.

Model Training

The focus of this section will be how to train a LambdaMART model using RankLib [5],

Let’s see an example and analyse all the parameters:

java -jar RankLib-2.7.jar

–train /var/trainingSets/trainingSet_2016-07-20.txt

-feature /var/ltr/trainingSets/trainingSet_2016-07-20_features.txt

-ranker 6

-leaf 75

–mls 5000

-metric2t NDCG@20

–kcv 5

-tvs 0.8

-kcvmd /var/models

-kcvmn model-name.xml

-sparse

Parameter	Description
train	Path to the training set file
feature	Feature description file, list feature Ids to be considered by the learner, each on a separate line. If not specified, all features will be used.
ranker	Specify which ranking algorithm to use e.g. 6: LambdaMART
metric2t	Metric to optimize on the training data. Supported: MAP, NDCG@k, DCG@k, P@k, RR@k, ERR@k (default=ERR@10)
tvs	Set train-validation split to be (x)(1.0-x). x * (size of the training set) will be used for training (1.0 -x) * (size of the training set) will be used for validation
kcv	Stands for k Cross Validation Specify if you want to perform k-fold cross validation using ONLY the specified training data (default=NoCV). This means we split the training data in k subsets and we run k training executions. In each execution 1 subset will be used as the test set and k-1 subsets will be used for training. N.B. in the case that tvs and kcv are both defined, first we split for the cross validation, than the training set produced will be split in training/validation. e.g. Initial Training Set size : 100 rows -kcv 5 -tvs 0.8 -Test Set : 20 rows -Training Set : 64 ( 0.8 80)* -Validation Set : 16 (0.2 80)*
kcvmd	Directory where to save models trained via cross-validation (default=not-save).
kcvmn	Name for model learned in each fold. It will be prefix-ed with the fold-number (default=empty).
sparse	Allow sparse data processing for the training set ( which means that you don’t need to specify all the features for each sample where the feature has no value)
tree	Number of trees of the ensemble(default=1000). More complex is the learning problem, more trees we need, but it’s important to be careful and not overfit the training data
leaf	Number of leaves for each tree (default=100). As the number of trees, it is important to tune carefully this value to avoid overfitting trees.
shrinkage	This is the learning rate of the algorithm(default=0.1) If this is too aggressive the Ranking model will quickly fit the training set, but will not react properly to the validation set evaluation, which means an overall poorer model.
tc	Number of threshold candidates for tree spliting. -1 to use all feature values (default=256) Increasing this value we increase the complexity and possible overfitting of the model. It is suggested to start with a low value and then increase it according the general cardinality of your features.
mls	Min leaf support – minimum #samples each leaf has to contain (default=1). This is quite an important parameter if we want to take care of outliers. We can tune this parameter to include only leaves with an high number of samples, discarding pattern validated by a weak support.

Tuning the model training parameters is not an easy task.
Apart from the general suggestions, trial & error with a careful comparison of the evaluation metric score is the path to take.
Comparing different models on a common Test Set can help when we have produced several models with different tuning configurations.

Converting Model To Json

We finally produced a model which is performing quite well on our Test set.

Cool, it’s time to push it to Solr and see the magic in action.

But before we can do that, we need to convert the model generated by the Machine Learning libraries into the JSON format Solr expects (Do you remember the section about the models supported?)

Taking as an example a LambdaMART model, Ranklib will produce an XML model.

So you will need to parse it and convert it to the JSON format.

Would be interesting to implement directly in RankLib the possibility of selecting in the output the JSON format expected by Solr.

Any contribution is welcome, RankLib Extension for Json Models!

In the next part, we’ll see how to visualise and understand better the model generated.

This activity can be vital to debug the model, see the most popular features, find out some anomalies in the training set and assess the validity of the model itself.

Need Help With This Topic?

If you’re struggling with Learning to Rank in Apache Solr, don’t worry – we’re here to help! Our team offers expert services and training to help you optimize your Solr search engine and get the most out of your system. Contact us today to learn more!

Need Help with this topic?

If you're struggling with Learning to Rank in Apache Solr, don't worry - we're here to help! Our team offers expert services and training to help you optimize your Solr search engine and get the most out of your system. Contact us today to learn more!

Click Here

lambaMART, learning to rank, machine learning, NDCG, RankLib, search, SVM

Sign up for our Newsletter

Did you like this post? Don’t forget to subscribe to our Newsletter to stay always updated in the Information Retrieval world!

3 Responses

Pingback: Solr Is Learning To Rank Better – Part 4 – Solr Integration
Michał Wieczorek says:

March 18, 2021 at 9:47 am

Could you please provide some examples of trainingSet_2016-07-20.txt and trainingSet_2016-07-20_features.txt files?
I tried to run RankLib library with the same params as above (without features option), but it ends up with the java.lang.ArrayIndexOutOfBoundsException.

Btw, I checked the source code of the library and I’m really surprised to see only few unit test there, most of which are ignored because they don’t pass

Loading...

Reply
Anna Ruggero says:

March 31, 2021 at 10:28 am

Hi Michał,
I don’t know if this could help you but in the RankLib documentation (https://sourceforge.net/p/lemur/wiki/RankLib%20How%20to%20use/) you can find a link to some files example.
Specifically, RankLib uses LETOR format and in their website (https://www.microsoft.com/en-us/research/project/letor-learning-rank-information-retrieval/?from=http%3A%2F%2Fresearch.microsoft.com%2Fen-us%2Fum%2Fbeijing%2Fprojects%2Fletor%2Fletor4dataset.aspx) you can download their datasets and have a look on the right format to use.

Cheers.

Loading...

Reply

About the company

about our work

Rated Ranking Evaluator (RRE)

Rated Ranking Evaluator Enterprise (RREE)

Apache Solr LLM Highlighter plugin

News

Main Blog

TIPS AND TRICKS

LATEST BLOG POST

contact us

Don't miss all the news - subscribe to our newsletter!

Solr Is Learning To Rank Better – Part 2 – Model Training

Model Training For Apache Solr Learning To Rank

Ranking SVM

LambdaMART

MART

LAMBDA

Evaluation Metric

NDCG@K

War Story 1 : too many highly relevant samples -> high NDCG

War Story 2: small RankLists -> high NDCG

Model Training

Converting Model To Json

Need Help With This Topic?​​

Need Help with this topic?​

Other posts you may find useful

Synonyms + Stopwords?? OMG!

Semantic Search (Text to Vector) with Apache Solr

OpenSearch Neural Sparse Search Tutorial

Alessandro Benedetti

Alessandro Benedetti

Follow Us

Top Categories

Recent Posts

New in Apache Solr 10: Improved Filtering In Vector Search with ACORN

Hybrid Search Using a Custom Algorithm in Apache Solr

Hybrid Search with Reciprocal Rank Fusion in Apache Solr

Monthly video

Sign up for our Newsletter

3 Responses

Leave a Reply Cancel reply

Rated Ranking Evaluator
(RRE)

Need Help With This Topic?

Need Help with this topic?