Solr Is Learning To Rank Better – Part 1 – Data Collection
Introduction
This blog post is about the journey necessary to bring Learning To Rank In Apache Solr search engines.
This series of blog posts explores a full journey from the Signals Collection through the re-ranking in Apache Solr.
Requirement : Minimal knowledge and interest in Machine Learning
Special thanks to Diego Ceccarelli, that helped me a lot in 2015 introducing me to this amazing topic.
Second special thanks to Juan Luis Andrades, for the passion we shared during this journey and Jesse McLaughlin for the careful technical java insights.
Final special thanks to David Bunbury who contributed with interest, passion and very thoughtful ideas to the cause.
Collecting Signals
Implicit
Explicit
Training Set
3 qid:1 1:1 2:1 3:0 # 1A
3
How relevant the document is for the query
qid:1
Identify the query
1:1 2:1 3:0
The document, represented as a vector of numerical features
# 1A
A comment, to make more readable your training data
Feature Engineering
Is the document (product) of the same brand as expressed in the query
Date Selected ( e.g. when searching in a Hotel Booking system)
One Hot Encoding
When a categorical feature describes your Document, it is tempting to represent each category as an Integer Id :
- with white list / black list approach business driven
- keeping only the top occurring values
- keeping only the values occurring more than a threshold
- encode the rest as a scpecial feature :colour_misc
- Hash the distinct values into a reduced set of hashes
Feature Normalization
- Linear Normalization ( min/max based)
- Sum Normalization ( based on the sum of all the values of the feature )
- Z Score ( based on the mean/standard deviation of the feature )
Feature Value Quantisation
Note: Extra care must be taken into account if following this approach.
The reason is that adding an artificial rounding to the data can be dangerous, we can basically compromise the feature itself setting up hard thresholds.
It is always better if the algorithm decides the thresholds with freedom.
It is suggested were possible to not quantise, or quantise only after a deep statistical analysis on our data.
If the target is to simplify the model for any reason, it is possible to evaluate at training time less threshold candidates, depending on the training algorithm.
Missing Values
Furthermore, if possible we can help the algorithm, avoiding the sparse representation if necessary and setting for the missing feature a value which is the avg of the feature itself across the different samples.
Outliers
Feature Definition
- Keep it simple : start from a limited set of features which are really fundamental to describe your problem, the model produced will be really poor, but at least you have the baseline.
- Iteratively train the model : removing or adding a feature at each execution. this is time consuming but will allow you to identify clearly which features really matter.
- Visualise Important Features : After you trained the model, use a visualisation tool to verify which feature is appearing more
- Meet the Business : Have meetings with the business, to compare what they would expect to see re-ranked and what actually the model re-ranks. When there is discordance let’s have the humans explain why, this should drive to identify missing features or feature that were used in wrong meaning .
Data Preparation
We have carefully designed our vectorial representation of the domain documents, we identified the source of our signals and built our training Set.
So far, so good…
But the model is still performing really poor.
In this scenario the reasons can be countless :
- Poor signal quality (noise)
- Incomplete feature vector representation
- Not uniform distribution of relevant-not relevant documents per queries
Let’s explore some guidelines to overcome some of these difficulties :
Noise Removal
Unbalanced Dataset
Collect more data
Resample the dataset
-
- Consider testing under-sampling when you have an a lot data (tens- or hundreds of thousands of instances or more), you can undersample randomly or following any business logic available
- Consider testing different resampled ratios (it is not necessary to target a 1:1 ratio)
- When oversampling consider advanced approaches, instead of simple duplication of already existent samples, could be good to artificially generate new ones
Be careful, resampling is not always helping your training algorithm, so experiment in detail with your use case.
Query Id hashing
- the intra-group relevancy ( as the group are composed by sufficient samples)
- the average across the queries ( as we have a valid set of different queries available)
Shameless plug for our training and services!
Did I mention we do Apache Solr Beginner and Learning to Rank training?
We also provide consulting on these topics, get in touch if you want to bring your search engine to the next level!
Subscribe to our newsletter
Did you like this post about Learning to Rank in Apache Solr? Don’t forget to subscribe to our Newsletter to stay always updated from the Information Retrieval world!
Related
Author
Alessandro Benedetti
Alessandro Benedetti is the founder of Sease Ltd. Senior Search Software Engineer, his focus is on R&D in information retrieval, information extraction, natural language processing, and machine learning.
Comments (3)
Leave a comment Cancel reply
This site uses Akismet to reduce spam. Learn how your comment data is processed.
Solr Is Learning To Rank Better – Part 4 – Solr Integration
February 3, 2018[…] modelled our dataset, we collected the data and refined it in Part 1 . Trained the model in Part 2 . Analysed and evaluate the model and training set in Part 3 . We are […]
Aman Tandon
June 13, 2018This is very informative article I have ever read. Thanks for detailed explanation of the problems & soln in data collection.
Could you please explain more about how to hashing the query. With my e-commerce client we are having the various department like(bedding, showers, etc) and we are getting the queries like curtains, shower curtains. In such case how should I hash the queries.
Thanks!
Alessandro Benedetti
June 13, 2018Hi Aman,
thanks for your kind feedback!
If your query is just composed by a free text component and categorical query level feature (department), you should use those two strings to be part of the hashing process.
You generally need to hash all the query level features associated to the query.