ECIR 2018 Experience

Logo.png

This blog is a quick summary of my (subjective) experience at ECIR 2018 : the 40th European Conference on Information Retrieval, hosted in Grenoble (France) from 26/03/2018 to 29/03/2018.

Deep Learning and Explicability

Eight long papers accepted were about Deep Learning.
The topics “Neural Network” and “Word Embedding” were the most occurring in the accepted full papers (and rejected) at the conference. It is clear that deep learning technologies are strongly advancing as a de facto standard in Artificial Intelligence and this can be noticed also in Information Retrieval where these technologies can be used to better model the user behaviour and extract topics and semantic from documents.
But if with deep learning and the advanced capabilities of complex models you gain performance, on the other hand you  lose the ability of explaining and debugging why such output corresponds to a given input.
A recurring topic in the Deep Learning track ( and the Industry Day actually) was to find a balance in the performance gain of new techniques and the control over them.
It is an interesting topic and I believe it is just the other face of the coin, it is good though that Academia is noting the importance of such aspect : in the industry a technology requires control and maintenance much more than in the academic environment and most of the times the “debuggability” can affect the decision between a technology and another.

From Academic Papers to Production

The 2018 European Conference on Information Retrieval ended the 29/03 with a brilliant Industry Day focused on the perilous path from Research to Production.
This is the topic that most permeated the conference across different keynotes, sessions and informal discussions.
In Information Retrieval it is still difficult to converge successful researches into successful live systems : most of the times it is not clear which party should be interested in this process.
Okapi BM25 was first published and implemented in the 1980s and 1990s; it became the default similarity in Apache Lucene 6.0 in 2016 .
Academia is focused on finding new interesting problems to solve with inventive techniques and Industry is focused on finding the quickest solution that works (usually).
This brings a gap :  new solutions are not easily reproducible from academic papers and they are far from being practically ready to use outside the experimental controlled environment.
Brilliant researchers crave new interesting problems they can reason on and solve: their focus is to build a rough implementation and validate it with few metrics ( ideally with an accepted related publication ).
After that, their job is done, problem solved, the challenge is not interesting anymore and they can pass to the next problem.
Researchers get bored easily but they risk to never see their researches fulfilled, applied and used in real life.
Often Academia creates its own problems to solve them and get a publication : Publish or Perish -> no publications bring less funds.
This may be or may be not a personal problem, depending on the individual.
Industry is usually seen by academics as a boring place where you just apply consolidated techniques to just get the best result with a minimum effort.
Sometimes industry is just where you end up when you want to make some money and actually see your effort to bring benefits to some population.
And this was(is) true most of the times but with the IT explosion we are living and the boom of competition the situation nowadays is open to change, where a stronger connection between Academia and Industry can ( and should !) happen and conferences such as ECIR is a perfect ground to settle the basis.
So, building on the introduction let’s see a quick summary of the keynotes, topics and sessions that impressed me most from the conference!

From Academic Papers to Production : A Learning To Rank Story[1]

Let’s start from Sease contribution to the conference : a story about the adoption of Learning To Rank in a real world e-commerce scenario.
The session took place at the Industry Day (29/03) and focused on the challenges and pitfalls of moving from the research papers to production.
The entire journey was possible through Open Source software : Apache Solr and RankLib as main actors.
Learning To Rank is becoming extremely popular in the industry and it is a very promising and useful technology to improve the relevancy function leveraging the user behaviour.
But from an open source perspective is still quite a young technology and effort is required to get it right.
The main message that I wanted to transmit is : don’t be scared to fail, if something doesn’t work immediately out of the box, it doesn’t mean it’s not a valid technology, No Pain No Gain, Learning To Rank open source implementations are valid but require tuning and care to bring them to production with success ( and the improvement these technologies can bring is extremely valuable).

The Harsh Reality of Production Information Access Systems[2]

Recipient of the “Karen Spärck Jones Award” Fernando Diaz in this talk focused on the problems derived from the adoption of Information Retrieval technologies in production environments where a deep understanding of individuals, groups and society is required.
Building on the technical aspect involved in applying research in real world systems, the focus switched to the ethical side : are current IR systems moving in the direction of providing the user with a fair and equally accessible information or the monetisation process behind is producing just addicted users who see (and buy) ad hoc tailored information ?

Statistical Stemmers : A Reproducibility Study[3]

This paper is a sort of symbol of the conference trend : reproducibility is as important as the innovation that a research brings.
Winner of the Best Paper Award, it may have caused some perplexity among the audience ( why to reward a paper that is not innovating but just reproducing past techniques ?), but the message that transmits is clear : a research needs to be easily reproducible and effort is required in that direction for an healthy Research & Development flow that doesn’t target just the publication but a real world application.

Entity-centric Topic Extraction and Exploration: A Network-based Approach[4]

Interesting talk, it explores topic modelling over time in a network based fashion :
Instead of modelling a topic as ranked list of terms it uses a network (weighted graph) representation.
This may be interesting for an advanced More Like This implementation, it’s definitely worth an investigation.

Information Scent, Searching and Stopping : Modelling SERP Level Stopping Behaviour[5]

This talk focused on the entire Search Result Page as a possible signal that affects user stopping behaviour i.e. when a search result page is returned, the overall page quality affects the user perception of relevancy and may drive an immediate query reformulation OR a good abandonment ( when the information need is satisfied).
This something that I personally experimented : sometimes from a quick look to the result page you may realise if the search engine understood ( or misunderstood) your information need.
Different factors are involved in what the author calls “Information Scent” but definitely the perceived relevance (modelled through different User Experience approaches) is definitely an interesting topic that sits along the real relevance.
Further studies in this area may affect the way Search Results pages are rendered, to maximise the fruition of information.

Employing Document Embeddings to Solve the “New Catalog” Problem in User Targeting, and provide Explanations to the Users[6]

The new catalog problem is a practical problem for modern recommender systems and platforms. There are a lot of use cases where you have collection of items that you would like to recommend and this ranges over Music Streaming Platforms ( Playlists, Albums, ect), Video Streaming Platform ( Tv Series genres, To View lists ect) and many other domains.
This paper explores both the algorithm behind such recommendations and the explanation need : explaining to the user why a catalog may be relevant for his/her taste is as important as providing a relevant catalog of items.

Anatomy of an Idea: Mixing Open Source, research and Business

This keynote summarises the cornerstone of Sease culture :
Open Source as a bridge between Academia and Industry.
If every research paper should be implemented in a production ready open source platform as part of the publication process, the community would get a direct and immense benefit out of it.
Iterative improvement would get a great traction and generally speaking the entire scientific community will get a strong boost with a better accessibility .
Implementing researches in state of the art production ready open source systems ( where possible) would cut the adoption time at industry level, triggering an healthy process of utilisation and bug fixing.

Industry Day[7]

The industry day was the coronation of the overall trend that was permeating the conference : there is a strong need to build a better connection between the Academic and Industrial world.
And the good audience reception shown ( the organisers had to move the track to the main venue) is a proof that there is an increasingly stronger need of seeing interesting researches applied ( with success or failure) to the real world with the related lessons learned.
Plenty of talks in this session were brilliant, my favourites :

  • Fabrizio Silvestri (Facebook)
    Query Embeddings: From Research to Production and Back!
  • Manos Tsagkias (904Labs)
    A.I. for Search: Lessons Learned
  • Marc Bron (Schibsted Media Group)
    Managment of Industry Research: Experiences of a Research Scientist

In conclusion, the conference was perfectly organised in an excellent venue, the balance in topics and talks was fairly good( both academic and industrial) and I really enjoyed my time in Grenoble, see you next year in Cologne !

[1] https://www.slideshare.net/AlessandroBenedetti/ecir-2018-alessandro
[2] https://www.ecir2018.org/programme/keynote-speakers/
[3] http://www.dei.unipd.it/~silvello/papers/ECIR2018_SA.pdf
[4] https://dbs.ifi.uni-heidelberg.de/files/Team/aspitz/publications/Spitz_Gertz_2018_Entity-centric_Topic_Extraction.pdf
[5] https://strathprints.strath.ac.uk/62856/
[6] https://link.springer.com/chapter/10.1007%2F978-3-319-76941-7_28
[7] https://www.ecir2018.org/industry-day/

Solr Is Learning To Rank Better – Part 2 – Model Training

Model Training For Apache Solr Learning To Rank 

If you want to train a model for Apache Solr Learning To Rank , you are in the right place.
This blog post is about the model training phase for the Apache Solr Learning To Rank integration.

We modelled our dataset, we collected the data and refined it in Part 1 .
We have now a shiny lot-of-rows training set ready to be used to train the mathematical model that will re-rank the resulting documents coming out from our queries.
The very first activity to carry on is to decide the model that fits best our requirements.
I will focus in this blog post on two type of models, the ones currently supported by the Apache Solr Learning To Rank integration [1] .

Ranking SVM

Ranking SVM is a linear model based on Support Vector Machines.

The Ranking SVM algorithm is a pair-wise ranking method that adaptively sorts results based on how ‘relevant’  they are for a specific query ( we saw example of relevancy rating for the pair document-query in Part 1).

The Ranking SVM function maps each search query to the features of each sample.
This mapping function projects each data pair onto a feature space.

Each sample ( query-document ) of the training data will be used for the Ranking SVM algorithm to refine the mapping.

Generally, Ranking SVM includes three steps at training time:

  • It maps the similarities between queries and the clicked results onto a certain feature space.
  • It calculates the distances between any two of the vectors obtained in step 1.
  • It forms an optimization problem which is similar to a standard SVM classification and solves this problem with the regular SVM solver
Given the list of the features that describe our problem, an SVM model will assign a numerical weight to each of them, the resulting function will assign a score to a document given in input the feature vector that describes it.
Let’s see an example SVM Model in the Json format expected by the Solr plugin :
e.g.
{
    "class":"org.apache.solr.ltr.ranking.RankSVMModel",
    "name":"myModelName",
    "features":[
        { "name": "userTextTitleMatch"},
        { "name": "originalScore"},
        { "name": "isBook"}
    ],
    "params":{
        "weights": {
            "userTextTitleMatch": 1.0,
            "originalScore": 0.5,
            "isBook": 0.1
        }

}
}

Given 3 example features :
userTextTitleMatch – (query dependant feature) – (binary)
originalScore –  (query dependant feature) – (ordinal)
isBook – (document dependant feature) – (binary)

The ranking SVM model in the example builds a linear function of the variables assigning a different weight to each of them.

Given the documents (expressed with their feature vector):
D1 [1.0, 100, 1]
D2 [0.0, 80, 1]

Applying the model we get the scores :

Score(D1) = 1.0 * userTextTitleMatch(1.0) + 0.5 * originalScore(100) + 0.1 * isBook(1.0) = 51.1
Score(D2) = 1.0 * userTextTitleMatch(0.0) + 0.5 * originalScore(80) + 0.1 * isBook(1.0) = 40.1
The D1 documents is more relevant according the model.

Key points :

  • easy to debug and understand
  • linear function

Here you can find some good Open Source libraries to train SVM models [2].

LambdaMART

LambdaMART is a tree ensemble based model.
Each tree of the ensemble is a weighted regression tree and the final predicted score is the weighted sum of the prediction of each regression tree.
A regression tree is a decision tree that takes in input a feature vector and returns a scalar numerical value in output.
At a high level, LambdaMART is an algorithm that uses gradient boosting to directly optimize Learning to Rank specific cost functions such as NDCG.

To understand LambdaMART let’s explore the main two aspects: Lambda and MART.

MART
LambdaMART is a specific instance of Gradient Boosted Regression Trees, also referred to as Multiple Additive Regression Trees (MART).
Gradient Boosting is a technique for forming a model that is a weighted combination of an ensemble of “weak learners”. In our case, each “weak learner” is a decision tree.
Lambda
At each training point we need to provide a gradient that will allow us to minimize the cost function ( whatever we selected, NDCG for example).
To solve the problem LambdaMART uses an idea coming from lambdaRank:
at each certain point we calculate a value that acts as the gradient we require, this component will effectively modify the ranking, pushing up or down groups of documents and affecting effectively the rules for the leaf values, which cover the point used to calculate lambda.
For additional details these resources has been useful [3] ,[4] .
LambdaMART is currently considered one of the most effective model and it has been proved by a number of winning contests.

Let’s see how a lambdaMART model looks like :

{
    "class":"org.apache.solr.ltr.ranking.LambdaMARTModel",
    "name":"lambdamartmodel",
    "features":[
        { "name": "userTextTitleMatch"},
        { "name": "originalScore"}
    ],
    "params":{
        "trees": [
            {
                "weight" : 1,
                "root": {
                    "feature": "userTextTitleMatch",
                    "threshold": 0.5,
                    "left" : {
                        "value" : -100
                    },
                    "right": {
                        "feature" : "originalScore",
                        "threshold": 10.0,
                        "left" : {
                            "value" : 50
                        },
                        "right" : {
                            "value" : 75
                        }
                    }
                }
            },
            {
                "weight" : 2,
                "root": {
                    "value" : -10
                }
            }
        ]
    }
}

This example model is composed by two trees, each branch split evaluates a conditional expression on a feature value, if the feature value is  :
<=  threshold we visit the left branch
>  threshold we visit the right branch

Reaching the leaf of a tree will produce a score, this will be weighted according to the tree weight and then summed to the other scores produced ( the model is an ensemble of trees).
Given the documents :

D1 [1, 9]
D2 [0, 10]

Applying the model we get the scores :

Score(D1) = 1* (userTextTitleMatch (1) > 0.5 go right , originalScore (9) < 10 = 50) +
2 * -10 = 30

Score(D2) = 1 *(userTextTitleMatch(0) <= 0.5 = -100) +
2 * -10 = -120

The D1 document is more relevant according the model.

Key points :

  • ensemble of trees are effective
  • difficult to debug
  • non linear function

There are good open source libraries to train LambdaMART models, the one that we are going to use in this case of study is RankLib [5]

Evaluation Metric

Important phase of the journey is to validate the model :
How good is our model in re-ranking our dataset ?
How good is the model in re-ranking unknown data  ?
It is really important to carefully select the evaluation metric that best suites your algorithm and your needs.
Evaluating the model will be vital for the validation phase during the training, for testing the model against new data and to consistently being able to assess the predicting quality of the model trained.
N.B. this is particularly important: having a good understanding of the evaluation metric for the algorithm selected, can help a lot in tuning,  improving the model and finding anomalies in the training data.

NDCG@K

Normalised Discounted Cumulative Gain @k is a natural fit when choosing lambdaMART.
This metric evaluates the performance of a ranking model, it varies from 0.0 to 1.0, with 1.0
representing the ideal ranking model.
It takes K as parameter :
K : maximum number of documents that will be returned .

K specifies, for each query,  after the re-ranking, the top K results to evaluate.
N.B. set K to be the number of top documents you want to optimise.
Generally it is the number of documents you show in the first page of results.

Given in input the test set, this is grouped by queryId and for each query the ideal ranking (obtained sorting the documents by descendant relevancy) is compared with the ranking generated by the ranking model.
An average for all the queryIds is calculated.
Detailed explanation can be found in Wikipedia [6].

Knowing how it works, 2 factors sound immediately important when reasoning about NDCG :
1) distribution of relevancy scores in the test set, per queryId
2) number of samples per queryId

War Story 1 : too many highly relevant samples -> high NDCG
Given the test set A :
5 qid:1 1:1 …
5 qid:1 1:1 …
4 qid:1 1:1 …
5 qid:2 1:1 …
4 qid:2 1:1 …
4 qid:2 1:1 …

Even if the ranking model does not a good job, the high distribution of relevant samples increase the probability of hitting high score in the metric.

War Story 2 : small RankLists -> high NDCG
Having few samples (<K) per queryId means that we will not evaluate the performance of our ranking model with accuracy.

In the edge case of 1 sample per queryId, our NDCG@10 will be perfect, but actually this reflect simply a really poor training/test set.

N.B. be careful on how you generate your queryId as you can end up in this edge case and have a wrong perception of the quality of your ranking model.

Model Training

Focus of this section will be how to train a LambdaMART model using RankLib[5],
Let’s see an example and analyse all the parameters:
java  -jar RankLib-2.7.jar 
train /var/trainingSets/trainingSet_2016-07-20.txt 
-feature /var/ltr/trainingSets/trainingSet_2016-07-20_features.txt 
-ranker
-leaf 75
mls 5000 
-metric2t NDCG@20 
kcv 5 
-tvs 0.8 
-kcvmd /var/models 
-kcvmn model-name.xml 
-sparse 
Parameter Description
train
Path to the training set file
feature Feature description file, list feature Ids to be considered by the learner, each on a separate line. If not specified, all features will be used.
ranker Specify which ranking algorithm to use e.g. 6: LambdaMART
metric2t Metric to optimize on the training data. Supported: MAP, NDCG@k, DCG@k, P@k, RR@k, ERR@k (default=ERR@10)
tvs
Set train-validation split to be (x)(1.0-x).
x * (size of the training set) will be used for training
(1.0 -x) * (size of the training set) will be used for validation
kcv
Stands for k Cross Validation
Specify if you want to perform k-fold cross validation using ONLY the specified training data (default=NoCV).
This means we split the training data in k subsets and we run k training executions.
In each execution 1 subset will be used as the test set and k-1 subsets will be used for training.
N.B. in the case that tvs and kcv are both defined, first we split for the cross validation, than the training set produced will be split in training/validation.
e.g.
Initial Training Set size : 100 rows

-kcv 5 
-tvs 0.8 
Test Set : 20 rows
Training Set :  64 ( 0.8 * 80)
Validation Set : 16 (0.2 * 80)
kcvmd
Directory where to save models trained via cross-validation (default=not-save).
kcvmn Name for model learned in each fold. It will be prefix-ed with the fold-number (default=empty).
sparse Allow sparse data processing for the training set ( which means that you don’t need to specify all the features for each sample where the feature has no value)
tree
Number of trees of the ensemble(default=1000).
More complex is the learning problem, more trees we need, but it’s important to be careful and not overfit the training data
leaf
Number of leaves for each tree (default=100).
As the number of trees, it is important to tune carefully this value to avoid overfitting trees.
shrinkage This is the learning rate of the algorithm(default=0.1)
If this is too aggressive the Ranking model will quickly fit the training set, but will not react properly to the validation set evaluation, which means an overall poorer model.
tc
Number of threshold candidates for tree spliting.
-1 to use all feature values (default=256)
Increasing this value we increase the complexity and possible overfitting of the model.
It is suggested to start with a low value and then increase it according the general cardinality of your features.
mls
Min leaf support – minimum #samples each leaf has to contain (default=1).
This is quite an important parameter if we want to take care of outliers.
We can tune this parameter to include only leaves with an high number of samples, discarding pattern validated by a weak support.

Tuning the model training parameters is not an easy task.
Apart the general suggestions, trial & error with a careful comparison of the evaluation metric score is the path to go.
Comparing different models on a common Test Set can help when we have produced a number of models with different tuning configuration.

Converting Model To Json

We finally produced a model which is performing quite well on our Test set.
Cool, it’s the time to push it to Solr and see the magic in action.
But before we can do that, we need to convert the model generated by the Machine Learning libraries into the Json format Solr expects ( you remember the section about the models supported ? )
Taking as example a LambdaMART model, Ranklib will produce an xml model.
So you will need to parse it and convert it to the Json format.
Would be interesting to implement directly in RankLib the possibility of selecting in output the Json format expected by Solr.
Any contribution is welcome [7]!
In the next part we’ll see how to visualise and understand better the model generated.
This activity can be vital to debug the model, see the most popular features, find out some anomaly in the training set and actually assess the validity of the model itself.
 

Solr Is Learning To Rank Better – Part 1 – Data Collection

Learning To Rank In Apache Solr

Introduction

This blog post is about the journey necessary to bring Learning To Rank In Apache Solr search engines.

Learning to Rank[1] is the application of Machine Learning in the construction of ranking models for Information Retrieval systems.
Introducing supervised learning from user behaviour and signals can improve the relevancy of the documents retrieved bringing a new approach in ranking them.
Can be helpful in countless domains, refining free text search results or building a ranking algorithm where only filtering is happening and no initial scoring is available.

This series of blog posts will explore a full journey from the Signals Collection through the re-ranking in Solr.
Part 1  will explore the data collection, data modelling and refining phase
Part 2 will explore the training phase
Part 3 will describe a set of utilities to analyse your models
Part 4 will cover the Solr integration

Requirement : Minimal knowledge and interest in Machine Learning

Special thanks to Diego Ceccarelli, that helped me a lot in the last months introducing me to this amazing topic.
Second special thanks to Juan Luis Andrades, for the passion we shared during this journey and Jesse McLaughlin for the careful technical java insights.
Final special thanks to David Bunbury who contributed with interest, passion and very thoughtful ideas to the cause.

Collecting Signals

The start of the journey is the signals collection, it is a key phase and involves the modelling of the supervised training set that will be used to train the model.
A training set can be Explicit or Implicit.

Implicit

An Implicit training set is collected from user behaviours and interactions.
e.g.
Historical sales and transactions of an E-commerce website
User Clicks in the search result page
Time spent on each document accessed
 
A training set of this type is quite noisy but allows to collect great numbers of signals with small effort.
More the user was engaged with a particular document, stronger the signal of relevancy.
e.g.
A sale of a product is a stronger signal of relevancy than adding it to the basket
User Clicks in the search result page in comparison with documents shown but not clicked
Longer you read a document, stronger the relevancy
 
+ Pros : Cheap to build
Cons : Noisy

 

Explicit

An Explicit training set is collected directly from the interaction with Human Experts.
Given a query and a set of documents, the Human expert will rate the relevancy of each document in the result set.
The score assigned to the document will rate how relevant the document was for the query.
To remove the subjective bias is suggested to use a team of experts to build the training set.
A training set of this type is highly accurate but it is really expensive to build as you need a huge team of experts to produce thousands of rating for all the queries of interest.
+ Pros : Accuracy
– Cons : Expensive to build
 

Training Set

The training set can have different forms, for the sake of this post we will focus on the pairwise approach.
In particular the syntax exposed will be the RankLib[2] syntax.
Each sample of the training set is a signal / event that describes a pair (Query-Document) .
Let’s take as an example a query (1) and a document (1A)

3 qid:1 1:1 2:1 3:0 # 1A

The document can be represented as a feature vector which is a vector of scalar numeric values.
Each element in the vector represents a specific aspect of the document.
We’ll see this in the feature part in details.
Let’s now focus in understanding the sample :

3
How relevant the document is for the query

qid:1
Identify the query

1:1 2:1 3:0
The document, represented as a vector of numerical features

# 1A
A comment, to make more readable your training data

Feature Engineering

For convenience of Machine Learning algorithms, query-document pairs are represented by numerical vectors.
Components of such vectors are called features and can be divided into different groups (depending if they depend on the document, the query or both) :
Document Features (query independent)
This kind of feature depends only on the document and not on the query.
e.g.
Document length
Price of the product
User Rating of the product
Interesting aspect of these features is that they can be potentially precomputed in off-line mode during indexing. They may be used to compute document’s static quality score (or static rank), which is often used to speed up search query evaluation.

 

Query Dependent Features 
This kind of features depends on the query and on the document
e.g.
Is the document containing the query text in the title ?
Is the document (product) of the same brand as expressed in the query
 
Query Level Features 
This kind of features depends only on the query.
e.g.
Number of words in the query
Cuisine Type Selected( e.g. “Italian”, “Sushi” when searching for Restaurants)
Date Selected ( e.g. when searching in an Hotel Booking system)
Department Selected ( e.g. “electonics”, “kitchen”, “DIY” … in an E-commerce website)
User Dependent Features
Also in this case this kind of feature does not depend on the document.
It only depends on the user running the query.
e.g.
Device 
Age of the user
Gender
As described , for convenience of the mathematical algorithms each high level feature must be modelled as a numeric feature.
In the real world a feature describes an aspect of the object (document) and must be represented accordingly:
Ordinal Features
An ordinal feature represents a numerical value with a certain position in a sequence of numbers.
e.g.
Star Rating  ( for a document describing an Hotel)
Price  ( for a document describing an e-commerce product)
 
For the Star Rating feature, stands an order for the different values:
1<2<3<4<5  is logically correct.
For the Price feature, the same observation applies .
100$ < 200$ <300$
A feature is Ordinal when it is possible to compare different values and decide the ranking of these.
 
Categorical Features
A categorical feature represents an attribute of an object that have a set of distinct possible values.
In computer science it is common to call the possible values of a categorical features Enumerations.
e.g.
Colour ( for a document describing a dress)
Country ( for a document describing a location)
It easy to observe that to give an order to the values of a categorical feature does not make any sense.
For the Colour feature :
red < blue < black has no general meaning.
 
Binary Features
A binary feature represents an attribute of an object that can have only two possible values.
Traditionally 0 / 1 in accordance with the binary numeral system.
e.g.
Is the product available ? yes/no ( for a document describing an e-commerce product)
Is the colour Red ? ( for a document describing a dress)
Is the country Italy ? ( for a document describing a location)

One Hot Encoding

When a categorical feature describes your Document, it is tempting to represent each category as an Integer Id :
e.g.
Categorical Feature : colour
Distinct Values : red, green, blue
Representation : colour:1, colour:2 colour:3 
 
With this strategy, the Machine learning algorithm will be able to manage the feature values…
But is the information we pass, the same as the original one ?
Representing a categorical feature as an ordinal feature is introducing an additional ordinal relationship :
1(red) < 2(green) < 3(blue)
which doesn’t reflect the original information.
There are different ways to encode categorical features to make them understandable by the training algorithm. We need basically to encode the original information the feature provides in an numeric form, without any loss or addition if possible.
One possible approach is called One Hot Encoding [3]:
Given a categorical feature with N distinct values, encode it in N binary features, each feature will state if the category applies to the Document.
e.g.
Categorical Feature : colour
Distinct Values : red, green, blue
Encoded Features : colour_red, colour_green, colour_blue 
A document representing a blue shirt will be described by the following feature vector :
… colour_red:0 colour_green:0 colour_blue:1 …
One Hot Encoding is really useful to properly model your information, but take care of the cardinality of your categorical feature as this will be reflected in the number of final features that will describe your signal.
War Story 1 : High Cardinality Categorical Feature
A signal describing a document with a high level categorical feature (with N distinct values) can produce a Feature vector of length N.
This can deteriorate the performance of your trainer as it will need to manage many more features per signal.
It actually happened to me, that simply adding one categorical feature was bringing in thousands of binary features, exhausting the hardware my trainer was using,  killing the training process.
To mitigate this , can be useful to limit the encoded distinct values only to a subset :
  • with white list / black list approach business driven
  • keeping only the top occurring values
  • keeping only the values occurring more than a threshold
  • encode the rest as a scpecial feature :colour_misc
  • Hash the distinct values into a reduced set of hashes

Feature Normalization

Feature Normalisation is a method used to standardize the range of values across different features, a technique quite useful in the data pre-processing phase.
As the majority of machine learning algorithms use the Euclidean distance to calculate the distance between two different points (training vector signals), if a feature has a widely different scales, the distance can be governed by this particular feature.
Normalizing can simplify the problem and give the same weight to each of the features involved.
There are different type of normalization, some of them :
  • Linear Normalization ( min/max based)
  • Sum Normalization ( based on the sum of all the values of the feature )
  • Z Score ( based on the mean/standard deviation of the feature )

Feature Value Quantisation

Another approach to simplify the job of the training algorithm is to quantise the feature values, in order to reduce the cardinality of distinct values per feature.
It is basically the simple concept of rounding, whenever we realise that it does not make any difference for the domain to model the value with an high precision, it is suggested to simplify it and round it to the acceptable level.
e.g.
Domain: ospitality
Ranking problem : Rank restaurants documents
Assuming a feature is the trip_advisor_reviews_count, is it really necessary to model the value as the precise amount of reviews ? Normally would be simpler to round to the nearest k ( 250 or whatever sensible to the business)

Note : Extra care must be taken into account if following this approach.
The reason is that adding an artificial rounding to the data can be dangerous, we can basically compromise the feature itself setting up hard thresholds.
It is always better if the algorithm decides the thresholds with freedom.
It is suggested were possible to not quantise, or quantise only after a deep statistical analysis on our data.
If the target is to simplify the model for any reason, it is possible to evaluate at training time less threshold candidates, depending on the training algorithm.

Missing Values

Some of the signals we are collecting could miss some of the features ( data corruption, bug in the signal collection or simply the information was not available at the time ) .
Modelling our signals with a sparse feature vector will imply that a missing feature will actually be modelled as feature with value 0.
This should generally be ok, but we must be careful in the case that 0 is a valid value for the feature.
e.g.
Given a user_rating feature
A rating of 0 means the product has a very bad rating.
A missing rating means we don’t have a rating for the product ( the product can still be really good) 
A first approach could be to model the 0 rating as slightly greater than 0 (i.e. 0 + ε ) and keep the sparse representation.
In this way we are differentiating the information but we are still modelling the wrong ordinal relationship : 
Missing User Rating  (0) < User Rating 0 (0 + ε)
 
Unfortunately, at least for the RankLib implementation, a missing feature will always be modelled with a value 0, this of course will vary from algorithm to algorithm.
But we can enforce the learning a bit, adding an additional binary feature that states that the User Rating is actually missing :
user_rating:0 , user_rating_missing:1 .
This should help the learning process to actually understand better the difference.
Furthermore, if possible we can help the algorithm, avoiding the sparse representation if necessary and setting for the missing feature a value which is the avg of the feature itself across the different samples.

Outliers

Some of the signals we are collecting could have some outliers ( some signal with an unlikely extremely different value for a specific feature).
This can be caused by bugs in the signal collection process or simply the anomaly can be a real instance of a really rare signal.
Outliers can complicate the job of the model training and can end up in overfitting models that have difficulties in adaptation for unknown datasets.
Identify and resolve anomalies can be vitally important if your dataset is quite fragile.

Tool for data visualisation can help in visualising outliers, but for a deep analysis I suggest to have a read of this interesting blog post [4] .

Feature Definition

Defining the proper set of features to describe the document of our domain is an hard task.
It is not easy to identify in the first place all the relevant features even if we are domain experts, this procedure will take time and a lot of trial and error.
Let’s see a guideline to try to build a feature vector as best as possible :
  • Keep it simple : start from a limited set of features which are really fundamental to describe your problem, the model produced will be really poor, but at least you have the baseline.
  • Iteratively train the model : removing or adding a feature at each execution. this is time consuming but will allow you to identify clearly which features really  matter.
  • Visualise Important Features : After you trained the model, use a visualisation tool to verify which feature is appearing more
  • Meet the Business : Have meetings with the business, to compare what they would expect to see re-ranked and what actually the model re-ranks. When there is discordance let’s have the humans explain why, this should drive to identify missing features or feature that were used in wrong meaning .

Data Preparation

We have carefully designed our vectorial representation of the domain documents, we identified the source of our signals and built our training Set.
So far, so good…
But the model is still performing really poor.
In this scenario the reasons can be countless :

  •  Poor signal quality (noise)
  •  Incomplete feature vector representation
  •  Not uniform distribution of relevant-not relevant documents per queries

Let’s explore some guideline to overcome some of these difficulties :

Noise Removal

In the scenario of implicit signals, it is likely we model the relevancy rating based on an evaluation metric of the user engagement with the document given a certain query.
Depending of the domain we can measure the user engagement in different ways.
Let’s see an example for a specific domain :  E-commerce
We can assign the relevancy rating of each signal depending on the user interaction :
Given a scale 1 to 3 :
1 – User clicked the product
2 – User added the product to the basket
3 – User bought the product
The simplest approach would be to store 1 signal per user interaction.
User behavioural signals are noisy by nature, but this approach introduces even more noise, as for the same feature vector we introduce discordant signals, specifically we are telling the training algorithm that given that feature vector and that query, the document is at the same time :
vaguely relevant – relevant – strongly relevant .
This doesn’t help the training algorithm at all, so we need to find a strategy to avoid that.
One possible way is to keep only the strongest signal per document-query per user episode .
In the case of a user buying a product, we avoid storing in the training set 3 signals, but we keep only the most relevant one.
In this way we transmit to the training algorithm the only the important information for the user interaction with no confusion.

Unbalanced Dataset

In some domain, would be quite common to have a very unbalanced dataset.
A dataset is unbalanced when the relevancy classes are not represented equally in the dataset i.e. we have many more samples of a relevancy class than another.
Taking again the E-commerce example, the number of relevant signals (sales) will be much less than the number of weak signals (clicks).
This unbalance can make the life harder to the training algorithm, as each relevant signal can be covered by many more weakly relevant ones.
Let’s see how we can manipulate the dataset to partially mitigate this problem :

Collect more data

This sounds simple, but collecting more data is generally likely to help.
Of course there are domain when collecting more data is not actually beneficial ( for example when the market change quite dinamically and the previous years dataset becomes almost irrelevant for predicting the current behaviours ) .

Resample the dataset

You can manipulate the data you collected to have more balanced data.
This change is called sampling your dataset and there are two main methods that you can use to even-up the classes: Oversampling and Undersampling[5].
You can add copies of instances from the under-represented relevancy class, this is called over-sampling, or
you can delete instances from the over-represented class, this technique is called under-sampling.
These approaches are often very easy to implement and fast to run. They are an excellent starting point.
Some ideas and suggestion :
  • Consider testing under-sampling when you have an a lot data (tens- or hundreds of thousands of instances or more), you can undersample randomly or following any business logic available
  • Consider testing different resampled ratios (it is not necessary to target a 1:1 ratio)
  • When oversampling consider advanced approaches, instead of simple duplication of already existent samples, could be good to artificially generate new ones [6]
Be careful, resampling is not always helping your training algorithm, so experiment in details your use case.
War Story 2 : Oversampling by duplication
Given a dataset highly unbalanced, the model trained was struggling to predict accurately the desired relevancy class for document-query test samples.
Oversampling was a tempting approach, so here we go!
 
As I am using cross validation, the first approach has been to oversample the dataset by duplication.
I took each relevancy class and duplicate the samples until I built a balanced dataset.
Then started the training in Cross Validation and I trained a model which was immense and almost perfectly able to predict the relevancy of validation and test samples.
Cool ! I got it !
Actually was not an amazing result at all, because of course applying cross validation on an oversampled dataset built validation and test sets oversampled as well.
This means that it was really likely that a sample in the training set was appearing exactly the same in the validation set and in the test set.
The resulting model was basically highly overfitted and not that good to predict unknown test sets.
 
So I moved to a manual training set- validation set – test set split and oversampled only the training set.
This was definitely better and built a model that was much more suitable.
It was not able to perfectly predict validation and test sets but this was a good point as the model was able to predict unknown data sets better.
 
Then I trained again, this time the original dataset, manually split as before but not oversampled.
The resulting model was actually better than the oversampled one.
One of the possible reasons is that the training algorithm and model I was using (LambdaMART) didn’t get any specific help from the resampling, actually the model lost the capability of discovering which samples were converting better ( strong relevant signals : weak relevant signals ratio).
Practically I favoured the volume over the conversion ratio, increasing the recall but losing the precision of the ranker.
 
Conclusion : Experiment, evaluate the approach with your algorithm, compare, don’t assume it is going to be better without checking

Query Id hashing

As we have seen in the initial part of the blog, each sample is a document-query pair, represented in a vectorial format.
The query is represented by an Id, this Id is used to group samples for the same query, and evaluate the ranker performance over each samples group.
This can give us an evaluation of how good the ranker is performing on average on all the queries of interest.
This brings to carefully decide how we generate the query identifier.
If we generate a too specific hash, we risk to build small groups of samples, this small groups can get an high score when ranking them, biased by their small size.
Extreme case
e.g.
Really specific hash, brings many groups to be 1 sample groups.
This brings up the evaluation metric score, as we are averaging and a lot of groups, being of size 1, are perfectly easy to rank.
If we generate an hash that is not specific enough we can end up in immense groups, not that helpful to evaluate our ranking model on the different real world scenarios.
The ideal scenario is to have one query Id per query category of interest, with a good number of samples related, this would be the perfect dataset, in this way we can validate both :
  • the intra-group relevancy ( as the group are composed by sufficient samples)
  • the average across the queries ( as we have a valid set of different queries available)
The query category could depend on a set of Query Dependent Features, this means that we can calculate the hash using the values of these features.
Being careful we maintain a balance between the group sizes and the granularity of the hash.
It is important to have the query categories across the training/validation/test set :
e.g.
We have 3 different query categories , based on the value of a user dependent feature ( user_age_segment) .
These age segments represents three very different market segments, that require very different ranking models.
When building our training set we want enough samples for each category and we want them to be split across the training/validation/test sets to be able to validate how good we are in predicting the different market segment.
This can potentially drive to build separate models and separate data sets if it is the case.