Last Stage Of The Journey
This blog post is about the Apache Solr Learning To Rank ( LTR ) integration.
We modelled our dataset, we collected the data and refined it in Part 1 .
Trained the model in Part 2 .
Analysed and evaluate the model and training set in Part 3 .
We are ready to rock and deploy the model and feature definitions to Solr.
I will focus in this blog post on the Apache Solr Learning To Rank ( LTR ) integration from Bloomberg [1] .
The contribution is completed and available from Apache Solr 6.4.
This blog is heavily based on the Learning To Rank ( LTR ) Bloomberg contribution readme [2].
Apache Solr Learning To Rank ( LTR ) integration
The Apache Solr Learning To Rank ( LTR ) integration allows Solr to rerank the search results evaluating a provided Learning To Rank model.
Main responsabilties of the plugin are :
– storage of feature definitions
– storage of models
– feature extraction and caching
– search result rerank
Features Definition
As we learnt from the previous posts, the feature vector is the mathematical representation of each document/query pair and the model will score each search result according to that vector.
Of course we need to tell Solr how to generate the feature vector for each document in the search results.
Here comes the Feature Definition file.
A Json array describing all the relevant features necessary to score our documents through the machine learned LTR model.
e.g.
[{ "name": "isBook",
"class": "org.apache.solr.ltr.feature.SolrFeature",
"params":{ "fq": ["{!terms f=category}book"] }
},
{
"name": "documentRecency",
"class": "org.apache.solr.ltr.feature.SolrFeature",
"params": {
"q": "{!func}recip( ms(NOW,publish_date), 3.16e-11, 1, 1)"
}
},
{
"name" : "userTextTitleMatch",
"class" : "org.apache.solr.ltr.feature.SolrFeature",
"params" : { "q" : "{!field f=title}${user_text}" }
},
{
"name":"book_price",
"class":"org.apache.solr.ltr.feature.FieldValueFeature",
"params":{"field":"book_price"}
},
{
"name":"originalScore",
"class":"org.apache.solr.ltr.feature.OriginalScoreFeature",
"params":{}
},
{
"name" : "userFromMobile",
"class" : "org.apache.solr.ltr.feature.ValueFeature",
"params" : { "value" : "${userFromMobile:}", "required":true }
}]
SolrFeature |
---|
– Query Dependent – Query Independent |
A Solr feature is defined by a Solr query following the Solr sintax. The value of the Solr feature is calculated as the return value of the query run against the document we are scoring. This feature can depend from query time parameters or can be query independent ( see examples) |
e.g. “params”:{“fq”: [“{!terms f=category}book”] } – Query Independent – Boolean feature If the document match the term ‘book’ in the field ‘category’ the feature value will be 1. It is query independent as no query param affects this calculation. “params”:{“q”: “{!func}recip( ms(NOW,publish_date), 3.16e-11, 1, 1)”} – Query Dependent – Ordinal feature The feature value will be calculated as the result of the function query, more recent the document, closer to 1 the value. It is query dependent as ‘NOW’ affects the feature value. “params”:{“q”: “{!field f=title}${user_text}” } – Query Dependent – Ordinal feature The feature value will be calculated as the result of the query, more relevant the title content for the user query, higher the value. It is query dependent as the ‘user_text’ query param affects the calculation. |
FieldValueFeature |
---|
– Query Independent |
A Fiel Value feature is defined by a Solr field. The value of the feature is calculated as the content of the field for the document we are scoring. The field must be STORED or DOC-VALUED . This feature is query independent ( see examples) |
e.g. “params”:{“field”:”book_price”} – Query Independent – Ordinal feature The value of the feature will be the content of the ‘book_price’ field for a given document. It is query independent as no query param affects this calculation. |
ValueFeature |
---|
– Query Level – Constant |
A Value feature is defined by a constant or an external query parameter. The value of the feature is calculated as the value passed in the solr request as an efi(External Feature Information) parameter or as a constant. This feature depends only on the param configured( see examples) |
e.g. “params” : { “value” : “${user_from_mobile:}”, “required”:false } – Query Level – Boolean feature The user will pass the ‘userFromMobile’ request param as an efi The value of the feature will be the value of the parameter The default value will be assigned if the parameter is missing in the request If it is required an exception will be thrown if the parameter is missing in the request“params” : { “value” : “5“, “required”:false } – Constant – Ordinal feature The feature value will be calculated as the constant value of ‘5’ .Except the constant, nothing affect the calculation. |
OriginalScoreFeature |
---|
– Query Dependent |
An Original Score feature is defined with no additional parameters. The value of the feature is calculated as the original lucene score of the document given the input query. This feature depends from query time parameters ( see examples) |
e.g. “params”:{} — Query Dependent — Ordinal feature The feature value will be the original lucene score given the input query. It is query dependent as the entire input query affect this calculation. |
EFI ( External Feature Information )
As you noticed in the feature definition json, external request parameters can affect the feature extraction calculation.
When running a rerank query it is possible to pass additional request parameters that will be used at feature extraction time.
We see this in details in the related section.
Deploy Features definition
Good, we defined all the features we require for our model, we can now send them to Solr :
curl -XPUT 'http://localhost:8983/solr/collection1/schema/feature-store' --data-binary @/path/features.json -H 'Content-type:application/json'
View Features Definition
To visualise the features just sent, we can access the feature store:
curl -XGET 'http://localhost:8983/solr/collection1/schema/feature-store'
Models Definition
Linear Model (Ranking SVM, Pranking)
e.g.
{
"class":"org.apache.solr.ltr.model.LinearModel",
"name":"myModelName",
"features":[
{ "name": "userTextTitleMatch"},
{ "name": "originalScore"},
{ "name": "isBook"}
],
"params":{
"weights": {
"userTextTitleMatch": 1.0,
"originalScore": 0.5,
"isBook": 0.1
}
}
}
Multiple Additive Trees (LambdaMART, Gradient Boosted Regression Trees )
e.g.
{
"class":"org.apache.solr.ltr.model.MultipleAdditiveTreesModel",
"name":"lambdamartmodel",
"features":[
{ "name": "userTextTitleMatch"},
{ "name": "originalScore"}
],
"params":{
"trees": [
{
"weight" : 1,
"root": {
"feature": "userTextTitleMatch",
"threshold": 0.5,
"left" : {
"value" : -100
},
"right": {
"feature" : "originalScore",
"threshold": 10.0,
"left" : {
"value" : 50
},
"right" : {
"value" : 75
}
}
}
},
{
"weight" : 2,
"root": {
"value" : -10
}
}
]
}
}
Heuristic Boosted Model (experimental)
The Heuristic Boosted Model is an experimental model that combines linear boosting to any model.
It is currently available in the experimental branch [3].
This capability is currently supported only by the : org.apache.solr.ltr.ranking.HeuristicBoostedLambdaMARTModel .
The reason behind this approach is that sometimes, at training time we don’t have available all the features we want to use at query time.
e.g.
Your training set is not built on clicks of the search results and contains legacy data, but you want to include the original score as a boosting factor
Let’s see the configuration in details :
Given :
"features":[ { "name": "userTextTitleMatch"}, { "name": "originalScoreFeature"} ]
"boost":{ "feature":"originalScoreFeature", "weight":0.1, "type":"SUM" }
The original score feature value, weighted by a factor of 0.1, will be added to the score produced by the LambdaMART trees.
"boost":{ "feature":"originalScoreFeature", "weight":0.1, "type":"PRODUCT" }
The original score feature value, weighted by a factor of 0.1, will be multiplied to the score produced by the LambdaMART trees.
N.B. Take extra care when using this approach. This introduces a manual boosting to the score calculation, which adds flexibility when you don’t have much data for training. However, you will loose some of the benefits of a machine learned model, which was optimized to rerank your results. As you get more data and your model becomes better, you should shift off the manual boosting.
e.g
{
"class":"org.apache.solr.ltr.ranking.HeuristicBoostedLambdaMARTModel",
"name":"lambdamartmodel",
"features":[
{ "name": "userTextTitleMatch"},
{ "name": "originalScoreFeature"}
],
"params":{
"boost": {
"feature": "originalScoreFeature",
"weight": 0.5,
"type": "SUM"
},
"trees": [
{
"weight" : 1,
"root": {
"feature": "userTextTitleMatch",
"threshold": 0.5,
"left" : {
"value" : -100
},
"right": {
"value" : 10}
}
},
{
"weight" : 2,
"root": {
"value" : -10
}
}
]
}
}
Deploy Model
As we saw for the features definition, deploying the model is quite straightforward :
curl -XPUT 'http://localhost:8983/solr/collection1/schema/model-store' --data-binary @/path/model.json -H 'Content-type:application/json'
View Model
The model will be stored in an easily accessible json store:
curl -XGET 'http://localhost:8983/solr/collection1/schema/model-store'
Rerank query
To rerank your search results using a machine learned LTR model it is required to call the rerank component using the Apache Solr Learning To Rank ( LTR ) query parser.
Query Re-Ranking allows you to run an initial query(A) for matching documents and then re-rank the top N documents re-scoring them based on a second query (B).
Since the more costly ranking from query B is only applied to the top N documents it will have less impact on performance then just using the complex query B by itself – the trade off is that documents which score very low using the simple query A may not be considered during the re-ranking phase, even if they would score very highly using query B. Solr Wiki
The Apache Solr Learning To Rank ( LTR ) integration defines an additional query parser that can be used to define the rerank strategy.
In particular, when rescoring a document in the search results :
- Features are extracted from the document
- Score is calculated evaluating the model against the extracted feature vector
- Final search results are reranked according to the new score
!ltr – will use the ltr query parser
model=myModelName – specifies which model in the model-store to use to score the documents
reRankDocs=25 – specifies that only the top 25 search results from the original ranking, will be scored and reranked
When passing external feature information (EFI) that will be used to extract the feature vector, the syntax is pretty similar :
rq={!ltr reRankDocs=3 model=externalmodel efi.parameter1=’value1′ efi.parameter2=’value2′}
e.g.
rq={!ltr reRankDocs=3 model=externalModel efi.user_input_query=’Casablanca’ efi.user_from_mobile=1}
Sharding
When using sharding, each shard will rerank, so the reRankDocs will be considered per shard.
e.g.
10 shards
You run distributed query with :
rq={!ltr reRankDocs=10 …
You will get a total of 100 documents re-ranked .
Pagination
Pagination is delicate[4].
Single Solr node
This means each page is composed by 10 results.
What happens when we hit the page 2 ?
The first 5 documents in the search results will have been rescored and affected by the reranking.
The latter 5 documents will preserve the original score and original ranking.
Doc 18 – score= 5.5
Doc 19 – score= 4.6
Doc 20 – score= 2.4
This means that score(15) could be < score(16), but document 15 and 16 are still in the expected order.
The reason is that the top 15 documents are rescored and reranked and the rest is left unchanged.
Sharded architecture
Shards number=2
When looking for the page 2, Solr will trigger queries to she shards to collect 2 pages per shard :
Shard1 : 10 ReRanked docs (page1) + 10 OriginalScored docs (page2)
Shard2 : 10 ReRanked docs (page1) + 10 OriginalScored docs (page2)
The the results will be merged, and possibly, original scored search results can top up reranked docs.
A possible solution could be to normalise the scores to prevent any possibility that a reranked result is surpassed by original scored ones.
Note: The problem is going to happen after you reach rows * page > reRankDocs. In situations when reRankDocs is quite high , the problem will occur only in deep paging.
Feature Extraction And Caching
For each document the definitions in the feature-store are applied to generate the vector.
This means that given the query and EFIs, we cache the entire feature vector for the document.
[1] Solr LTR Plugin
[2] Solr LTR Plugin Official README
[3] Solr LTR Plugin Experimental Branch
[4] Pagination Issue
[5] Caching Improvements