Learning To Rank Tips And Tricks
constant features learning to rank

This “TIPS AND TRICKS” just wants to advise you to double-check your dataset features at the end of the pipeline before training a Learning to Rank (LTR) model. 

An essential part of data preprocessing which is considered to be one of the most time-consuming parts of any machine learning pipeline is Feature Selection. It is the process of removing redundant features and keeping only the necessary and relevant features to use in the model construction.

Columns in a dataset that have the same value in all the observations/rows are known as Constant Features and are considered redundant data. Will their removal affect your model performance? No! Those features which contain constant values cannot provide any useful information in predicting your response so it is good to remove them from the dataset not only to speed up training but also because, if not managed well, they can lead to numerical problems and errors (for example during the encoding phase).

This might seem obvious but after all the data pre-processing steps (i.e. the data collection, data modeling, and refining phases), you may end up again with some constant features (without even realizing!) that should be ignored.

That’s why I want to show you one possible scenario that could occur during a real-world implementation.

Case Study Example

Let’s imagine a Learning to Rank scenario.

We have to manage an E-commerce product catalog. We collect all the interactions that users have with the website products (e.g. views, clicks, add to cart, sales..) and create a dataset consisting of <query, document> pairs (e.g. the filters selected by the users (query-level) and all the features of the product viewed/clicked/sold (document-level)). To keep it simple here is just an example with some features shown:

To train our LTR model, we carefully design the vectorial representation of the domain documents and build accurately our training set to allow the model to perform as well as possible.

Most of the features have to be manipulated; especially to make the categorical features understandable by machine learning algorithms, we need basically to encode the original information the feature provides in a numeric form, without any loss if possible. There are different ways to encode categorical features and one possible approach is called One Hot Encoding:
given a categorical feature with N distinct values, encode it in N binary features, each feature will state if the category applies to the Document.

Imagine you have multi-valued categorical features like the following one:

This feature, named ‘productCategories‘, is composed of an array of integers; for each interaction, it contains all the categories to which a product belongs. Using one-hot encoding, each N distinct value will then be encoded by adding N columns to the dataset, as you can see in the table below. This will happen for all the categorical features you have in your dataset. In addition, we use all the query-level features to generate the query_ID (check out this blog post about it).

Before splitting the dataset into training and test sets, we use a method called clean_data_frame_from_single_label which removes all the query_ID with only one relevance label (read Part3 and Part4 blog post to understand why).

It could happen that the query_ID = 3 (the third row of the table above) could be potentially removed during this phase since all the observations belonging to this query group have the same Relevance Label (e.g. equals to 1). We end up with the following data frame:

You can easily see that we have added two unnecessary features to the dataset: all the rows for these features are False since the row (or the rows) that contained True has been deleted by the clean_data_frame_from_single_label method. 

So, the features productCategories_3483 and productCategories_4639 have become unintentionally constant columns and, as already said, we should ignore columns from which no information can be gained before training the model.

Drop constant columns method

We need to add a method to check and delete all these feature before the splitting phase:

def drop_constant_columns(dataframe):
    constant_columns = [col for col in dataframe.columns if len(dataframe[col].unique()) == 1]
    dataframe_after_drop = dataframe.drop(constant_columns, axis=1)
    logging.debug('- - - - Number of constant features found : {:d}'.format(len(constant_columns)))
    logging.debug('- - - - Constant features removed :' + str(constant_columns))
    percentage = len(constant_columns) / len(dataframe.columns)
    if percentage > 0.50:
        logging.warning('**** WARNING - 50% of the features have been removed')
    return dataframe_after_drop

Model Training

To give you proof of what has been explained so far, we have trained a Learning to Rank model once by keeping the constant features and another time by dropping them from the dataset. For the experiment, we used interactions log data collected from a generic e-commerce. The dataset after the pre-processing part contained a total of 6035 features.

Our method found 825 constant features (around 14%). We trained 2 models using LambdaMART: keeping and removing the constant features. Here the results obtained:

We can confirm there is no difference in performance; the presence (or absence) of constant features has no effect on the target in fact the metric ndcg@10 is the same in both models. If we delete the constant features, the model training lasts 6 minutes less. We can only benefit from their removal, in terms of time, memory, and other technical problems (encoding), especially if they are many.


To drop constant features, you may also be interested in other methods, check out the following links:

1) Sklearn: sklearn.feature_selection.VarianceThreshold
2) H20: ‘ignore_const_cols‘ parameter
3) R: drop_constant_cols

// STAY ALWAYS UP TO DATE

Subscribe to our newsletter

Did you like this post about Drop constant features: a real-world Learning to Rank scenario? Don’t forget to subscribe to our Newsletter to stay always updated from the Information Retrieval world!

Author

Ilaria Petreti

Ilaria is a Data Scientist passionate about the world of Artificial Intelligence. She loves applying Data Mining and Machine Learnings techniques, strongly believing in the power of Big Data and Digital Transformation.

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.