This blog post is about the journey needed to bring Learning To Rank In Apache Solr search engines.
This series of blog posts explores a full journey from the Signals Collection through the re-ranking in Apache Solr.
Requirement: Minimal knowledge and interest in Machine Learning.
Special thanks to Diego Ceccarelli, who helped me a lot in 2015 introducing me to this amazing topic.
Second special thanks to Juan Luis Andrades, for the passion we shared during this journey and Jesse McLaughlin for the careful technical Java insights.
Final special thanks to David Bunbury who contributed with interest, passion and very thoughtful ideas to the cause.
Collecting Signals
IMPLICIT TRAINING SET
EXPLICIT TRAINING SET
The Training Set
The training set can have different forms, for the sake of this post we will focus on an algorithm that internally uses a pairwise/listwise approach.
3 qid:1 1:1 2:1 3:0 # 1A
3
How relevant the document is for the query
qid:1
Identify the query
1:1 2:1 3:0
The document represented as a vector of numerical features
# 1A
A comment, to make more readable your training data
Feature Engineering
Document Features (query independent)
Query Dependent Features
Is the document (product) of the same brand as expressed in the query?
Query Level Features
Date Selected ( e.g. when searching in a Hotel Booking system)
User Dependent Features
Ordinal Features
Categorical Features
Binary Features
One Hot Encoding
When a categorical feature describes your Document, it is tempting to represent each category as an Integer Id :
... colour_red:0 colour_green:0 colour_blue:1 ...
One Hot Encoding is useful to properly model your information, but take care of the cardinality of your categorical feature as this will be reflected in the number of final features that will describe your signal.
War Story 1: High Cardinality Categorical Feature
- with white list/black list approach business-driven
- keeping only the top occurring values
- keeping only the values occurring more than a threshold
- encode the rest as a special feature :colour_misc
- Hash the distinct values into a reduced set of hashes
Feature Normalization
- Linear Normalization (min/max based)
- Sum Normalization (based on the sum of all the values of the feature)
- Z Score (based on the mean/standard deviation of the feature)
Feature Value Quantization
Note: Extra care must be taken into account if following this approach.
The reason is that adding an artificial rounding to the data can be dangerous, we can compromise the feature itself setting up hard thresholds.
It is always better if the algorithm decides the thresholds with freedom.
It is suggested that it is possible to not quantize, or quantize only after a deep statistical analysis of our data.
If the target is to simplify the model for any reason, it is possible to evaluate at training time fewer threshold candidates, depending on the training algorithm.
Missing Values
Furthermore, if possible we can help the algorithm by avoiding the sparse representation if necessary and setting for the missing feature a value which is the avg of the feature itself across the different samples.
Outliers
Feature Definition
- Keep it simple: start from a limited set of features which are fundamental to describe your problem, the model produced will be really poor, but at least you have the baseline.
- Iteratively train the model: removing or adding a feature at each execution. this is time-consuming but will allow you to identify clearly which features really matter.
- Visualize Important Features: After you trained the model, use a visualisation tool to verify which feature is appearing more.
- Meet the Business: Have meetings with the business, to compare what they would expect to see re-ranked and what the model re-ranks. When there is discordance let’s have the humans explain why, this should drive to identify missing features or features that were used in the wrong meaning.
Data Preparation
We have carefully designed our vectorial representation of the domain documents, identified the source of our signals and built our training set.
So far, so good…
However, the model is still performing poorly.
In this scenario, the reasons can be countless :
- Poor signal quality (noise)
- Incomplete feature vector representation
- Not uniform distribution of relevant-not relevant documents per queries
Let’s explore some guidelines to overcome some of these difficulties:
Noise Removal
Unbalanced Dataset
Collect more data
Resample the dataset
You can manipulate the data you collected to have more balanced data.
- Consider testing under-sampling when you have a lot of data (tens- or hundreds of thousands of instances or more), you can undersample randomly or follow any business logic available.
- Consider testing different resampled ratios (it is not necessary to target a 1:1 ratio).
- When oversampling considers advanced approaches, instead of simple duplication of already existent samples, could be good to artificially generate new ones.
Be careful, resampling is not always helping your training algorithm, so experiment in detail with your use case.
War Story 2: Oversampling by duplication
Query Id hashing
- the intra-group relevancy (as the group are composed of sufficient samples).
- the average across the queries (as we have a valid set of different queries available).
Need Help With This Topic?
If you’re struggling with integrating Learning to Rank in Apache Solr, don’t worry – we’re here to help! Our team offers expert services and training to help you optimize your Solr search engine and get the most out of your system. Contact us today to learn more!
3 Responses
This is very informative article I have ever read. Thanks for detailed explanation of the problems & soln in data collection.
Could you please explain more about how to hashing the query. With my e-commerce client we are having the various department like(bedding, showers, etc) and we are getting the queries like curtains, shower curtains. In such case how should I hash the queries.
Thanks!
Hi Aman,
thanks for your kind feedback!
If your query is just composed by a free text component and categorical query level feature (department), you should use those two strings to be part of the hashing process.
You generally need to hash all the query level features associated to the query.