A Learning to Rank Project on a Daily Song Ranking Problem – Part 4

Did you really think it was over? Good news!
The Learning to Rank project on a Daily Song Ranking problem has the fourth episode. If you are interested in integrating Machine Learning with Search, with this blog post we will continue to explore Learning to Rank techniques on a top K Song retrieval by country and date problem.

As always, in order to have the overall picture, I suggest you take a look at the previous episodes, where you will learn:
Part 1 – how to set up and build a Learning to Rank (LTR) system starting from available data and using open source libraries
Part 2 – how to explain the behaviors of the models through the use of the powerful library called SHAP
Part 3 – how to create the query ID and be careful when splitting the training and test sets

In this blog post, I will try to answer the following questions:

    1. Does it make sense to remove the query-level features after using them to create the query Id?
    2. What happens if I drop the under-sampled queries?
    3. What happens if I keep the under-sampled queries grouping them in a less fine-grained query Id?

Let’s get started!

As already explained in the previous blog post, each row in the training set is a query-document pair; it is composed of the query-level, the document-level, and potentially the query-document level features. We specify the “query_ID” as the unique query Id for each query, by concatenating all the query-level features (if possible).

In our project, the query-level features are Region, Day, Month, and Weekday. We used them to create the query Id hashing (query_ID), as shown in these tables:

It comes naturally to think that we could remove the query-level features since that information is already set out in the query Id.
Each query Id will have its ranked list of documents, which would consist of the same values for the query-level features; so what’s the reason for keeping them?

On the other hand, by removing all the query-level features, at the re-ranking time, the model will actually only rely on the document-level and query-document level features to decide how to re-order the documents; is it correct? Or underestimating? Can we remove the query-level features since they are superfluous?

We trained a Learning to Rank model using XGBoost (LambdaMART) and run two different tests:

    • dropping all the query-level features (13 features in total) -> DROP
    • keeping all the query-level features (17 features in total) -> KEEP

Here the results obtained (Table 1):

We can see that there isn’t a big difference, but keeping the query-level features seems to get higher performance results.

For a better comparison, we tested them on the same test sets; to do this, the features must be the same in terms of number and name in both training and test sets otherwise we get this error message: ‘ValueError: feature_names mismatch‘.

So in this case, instead of removing the query-level features, we kept and filled them with nan values; here an example:

Only then the model can be trained. The results obtained are as follows (Table 2):

Removing the query-level features (DROP – Table 1) or filling them with ‘nan’ values makes absolutely no difference. We can see from the table above that whether or not we use them in the test set, they are not considered at all; in fact, the result always remains the same.

On the other hand, we can notice that if we keep the query-level features at training time but we do not test them, the eval-ndcg@10 is negatively affected in some way and that confirms their usefulness (Table 3):

One reason might be that keeping query-level features gives the model the ability to adjust the weight of the other features, influencing the forest of tree building.

It should also be said that when working with real-world data, it could happen to have too many features and this could lead to some problems when encoding techniques are applied for the categorical features. For example, if you are using the one-hot encoding technique, the more categorical features with high cardinality you have, the more encoded features will be created. For this reason, we may compromise and keep only the most meaningful query-level features (or opt for alternative encoding techniques).

It is extremely important to assess the distribution of training samples per query Id, since we may end up with results that are not truthful. Few observations do not give us sufficient confidence in the documents’ relevance. Also, two common mistakes to avoid are:
– If we have a limited number of observations, during the split it might happen that we get some queries with only one training sample. In this case, the NDCG@K for the query group will be 1, independently of the model;
– During the split, we could put all the samples with a single relevance label in the test set.

That’s why it is a good recommendation to drop the under-sampled queries. However, as already said, this concept must be contextualized. In case we have reliable data and we are confident that we have properly managed the conditions (described above) that can potentially skyrocket the NDCG, we could try to keep those observations.

In fact, this is what we did in the previous implementations; we set a threshold and for the query Ids that had a number of observations under that threshold, we saved those observations in a new data frame called under_sampled_only_train. This is because we kept these observations in the training set and moved some of them to the test set (the query Ids with the highest number of observations) only if the test set appeared less than 20% of the entire dataset. Why this approach? In most situations, training a model using more data (since we have it!) is usually better; however, with this experiment, we investigated and deepened to understand if this really brought advantages or not.

Indeed, we wondered: what happens if we completely remove the ‘under_sampled_only_train‘ observations instead of keeping them in the training set?

We checked the differences between training a model by keeping the under-sampled query Ids (KEEP) and by removing them (DROP), then we tested these models on the same test set (which does not contain under-sampled queries at all!). Both models have the same features (query-level features included) so they can be compared; the only thing that changes is the number of total observations, so the number of query Ids. The results have been summarized below (Table 4):

We can notice there isn’t a big difference; in our case keeping the under-sampled queries in the training set seems to get equal or slightly better results, so we may be quite sure that we have reliable data and the performances are truthful.

It should also be said that these results are influenced by the threshold set; raising or lowering the threshold, you can decide to include more or fewer observations in the under_sampled_only_train data frame and potentially obtain different performances.
The threshold should be set according to the distribution of your data – try different values to see what happens!

Let’s see the results of another experiment.

As described so far, in our dataset there are 4 query-level features (Region, Day, Month, and Weekday) that were used to create the query Id hashing. The under-sampled queries are stored in a data frame called under_sampled_only_train and will only be present in the training set; this data frame contains 252423 observations and 3333 different query Ids.
To make these query Id groups more “populous”, what happens if we assign a less detailed query Id to these observations only?

We tried two different approaches, “loosening the grain” of the query Id of the under-sampled queries each time more:

    1. group all the observations that have the same “Region” and “Month”
    2. group all the observations that have the same “Region”

We then checked the differences between training a model by grouping the under-sampled queries, leaving them as they are, and dropping them. We tested these models on the same test set (which does not contain under-sampled queries at all!). They have the same features (query-level features included) so can be compared; the only thing that changes is the query Ids of the under-sampled queries.

How did we group and assign the new “query_ID” to the under-sampled queries?

1. query_ID based on Region and Month

In the under_sampled_only_train data frame, we have 16 distinct “Region” values and 12 distinct ‘Month’ values. We decided to group all the observations that have the same value for the “Region” and “Month” features, assigning them a new query Id: we then obtained 144 unique query Ids in the under_sampled_only_train (instead of 3333):

As you can see from the table to all the observations (i.e. 1793) that have the “Region” value equal to “it” and the “Month” value equal to “1“, the query Id 2226 has been assigned; while the query Id 2227 has been assigned to all the observations (i.e. 1408) that have the “Region” value equal to “it” and the “Month” value equal to “2“, and so on.

2. query_ID based on Region only

In the under_sampled_only_train data frame, we have 16 distinct “Region” values. We decided to group all the observations that have the same value for the “Region” feature, assigning them a new query Id: we then obtained 16 unique query Ids in the under_sampled_only_train (i

As you can see from the table to all the observations (i.e. 16799) that have the “Region” value equal to “it“, the query Id 2226 has been assigned; while the query Id 2227 has been assigned to all the observations (i.e. 12923) that have a “Region” value equal to “nz“, and so on.

N.B. in order to avoid duplicates and/or errors, the new query Id values were not randomly assigned to groups but we used the following method:

unique_query_id = under_sampled_only_train['query_ID'].unique().tolist()
unique_region = under_sampled_only_train['Region'].unique().tolist()

column_names = under_sampled_only_train.columns
under_sampled_only_train_grouped = pd.DataFrame(columns=column_names, dtype=int)

i = unique_query_id[0]
for value in unique_region:
    new_dataframe = under_sampled_only_train[under_sampled_only_train['Region'] == value]
    new_dataframe['query_ID'] = i
    under_sampled_only_train_grouped = pd.concat([under_sampled_only_train_grouped, new_dataframe], 
ignore_index=True, sort=False) i = i + 1

We have obtained the unique ‘Region’ values and the unique ‘query_ID’ values from the under_sampled_only_train data frame in two lists and then assigned the first 16 unique query Id values as the new ‘query_ID’ to each specific ‘Region’ group.

This choice was made to group rows that still have a part of the user’s request in common (i.e. Region OR Region and Month), creating groups with a more generic query and more observations than before.

Here the model training performances (Table 5):

The first and fourth “scenarios” are the same as described in Question 2, while the second and the third are the results obtained after grouping the under-sampled queries in the training set.
The results are more or less the same but it seems that keeping or dropping the under-sampled queries is better than grouping them in a less fine-grained query Id.
However, if we decide to group those observations to make the query Id’s list more “populated”, it is better to use as many query-level features as possible, which could lead to small improvements.

    1. If possible, it is better to keep all the query-level features in the dataset although they were already used to create the query Id. Even if it was not reported above, this was also verified during the implementation of questions 2 and 3, and we always got the best model performance when we kept them.
    2. If you have reliable data and are confident that you have handled the condition that can potentially skyrocket the NDCG correctly, you can try to keep the under-sampled queries in the training set to see if it makes improvements or not. Remember to set a threshold in line with your data distributions!
    3. It is better to train a model using query Ids as accurately as possible; if you have queries that have few observations, you can try to group them by “relaxing the grain” of the query Id a little bit, so including fewer query-level features.

Did I mention we do Learning To Rank and Search Relevance training?
We also provide consulting on these topics, get in touch if you want to bring your search engine to the next level!

Did you like this post about A Learning to Rank Project on a Daily Song Ranking Problem? Don’t forget to subscribe to our Newsletter to stay always updated from the Information Retrieval world!

We are Sease, an Information Retrieval Company based in London, focused on providing R&D project guidance and implementation, Search consulting services, Training, and Search solutions using open source software like Apache Lucene/Solr, Elasticsearch, OpenSearch and Vespa.

Follow Us

Top Categories

Recent Posts

Monthly video

Sign up for our Newsletter

Did you like this post? Don’t forget to subscribe to our Newsletter to stay always updated in the Information Retrieval world!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.