This blog post is about the journey necessary to bring Learning To Rank In Apache Solr search engines.
This series of blog posts explores a full journey from the Signals Collection through the re-ranking in Apache Solr.
Requirement : Minimal knowledge and interest in Machine Learning
Special thanks to Diego Ceccarelli, that helped me a lot in 2015 introducing me to this amazing topic.
Second special thanks to Juan Luis Andrades, for the passion we shared during this journey and Jesse McLaughlin for the careful technical java insights.
Final special thanks to David Bunbury who contributed with interest, passion and very thoughtful ideas to the cause.
3 qid:1 1:1 2:1 3:0 # 1A
How relevant the document is for the query
Identify the query
1:1 2:1 3:0
The document, represented as a vector of numerical features
A comment, to make more readable your training data
Is the document (product) of the same brand as expressed in the query
Date Selected ( e.g. when searching in a Hotel Booking system)
One Hot Encoding
When a categorical feature describes your Document, it is tempting to represent each category as an Integer Id :
- with white list / black list approach business driven
- keeping only the top occurring values
- keeping only the values occurring more than a threshold
- encode the rest as a scpecial feature :colour_misc
- Hash the distinct values into a reduced set of hashes
- Linear Normalization ( min/max based)
- Sum Normalization ( based on the sum of all the values of the feature )
- Z Score ( based on the mean/standard deviation of the feature )
Feature Value Quantisation
Note: Extra care must be taken into account if following this approach.
The reason is that adding an artificial rounding to the data can be dangerous, we can basically compromise the feature itself setting up hard thresholds.
It is always better if the algorithm decides the thresholds with freedom.
It is suggested were possible to not quantise, or quantise only after a deep statistical analysis on our data.
If the target is to simplify the model for any reason, it is possible to evaluate at training time less threshold candidates, depending on the training algorithm.
Furthermore, if possible we can help the algorithm, avoiding the sparse representation if necessary and setting for the missing feature a value which is the avg of the feature itself across the different samples.
- Keep it simple : start from a limited set of features which are really fundamental to describe your problem, the model produced will be really poor, but at least you have the baseline.
- Iteratively train the model : removing or adding a feature at each execution. this is time consuming but will allow you to identify clearly which features really matter.
- Visualise Important Features : After you trained the model, use a visualisation tool to verify which feature is appearing more
- Meet the Business : Have meetings with the business, to compare what they would expect to see re-ranked and what actually the model re-ranks. When there is discordance let’s have the humans explain why, this should drive to identify missing features or feature that were used in wrong meaning .
We have carefully designed our vectorial representation of the domain documents, we identified the source of our signals and built our training Set.
So far, so good…
But the model is still performing really poor.
In this scenario the reasons can be countless :
- Poor signal quality (noise)
- Incomplete feature vector representation
- Not uniform distribution of relevant-not relevant documents per queries
Let’s explore some guidelines to overcome some of these difficulties :
Collect more data
Resample the dataset
- Consider testing under-sampling when you have an a lot data (tens- or hundreds of thousands of instances or more), you can undersample randomly or following any business logic available
- Consider testing different resampled ratios (it is not necessary to target a 1:1 ratio)
- When oversampling consider advanced approaches, instead of simple duplication of already existent samples, could be good to artificially generate new ones
Be careful, resampling is not always helping your training algorithm, so experiment in detail with your use case.
Query Id hashing
- the intra-group relevancy ( as the group are composed by sufficient samples)
- the average across the queries ( as we have a valid set of different queries available)
Subscribe to our newsletter
Did you like this post about Learning to Rank in Apache Solr? Don’t forget to subscribe to our Newsletter to stay always updated from the Information Retrieval world!