Artificial Intelligence, Main Blog

From Training to Ranking: Using BERT to Improve Search Relevance

If you have attended our Artificial Intelligence in Search Training you should now be familiar with the use of Natural Language Processing and Deep Learning applied to search. If you didn’t attend, you can start learning about text ranking using Deep Learning (DL) by reading this blog post.

More and more frequently, we hear about how Artificial Intelligence (AI) permeates every aspect of our daily lives. When we talk about AI we are referring to a superset of techniques that enable machines to learn and reason like humans. Artificial Intelligence has been around since the ’50s, but it has exploded only recently with the advent of Deep Learning. Deep Learning is a part of AI which makes use of special neural networks to solve complex problems that an algorithm could not simply solve.

When it comes to search engines, Deep Learning can help in performing several tasks, such as query understanding, personalization, or recommendation. In this article, we will be focusing on text ranking.

The problem we will be solving is usually referred to as “ad hoc retrieval“, which is a standard retrieval task in which the user specifies his information need through a query that initiates a search for documents likely to be relevant to the user. Given a query and a collection of documents, we need to return a ranked list of results that maximizes a metric of interest.

Search is the most common practical implementation of text ranking, where a search engine is adopted as the retrieval system to produce a ranked list of web pages, PDFs, news articles, tweets, or any other form of text ordered by estimated relevance with respect to the user’s intent.

BERT

Recent advances in information retrieval have shown promising performance gains by utilizing large-scale pre-trained transformer-based language models like BERT [1].
BERT, an invention of Google, is the best-known example of these pre-trained models, which are a family of neural networks adopting the mechanism of self-attention [2], a technique that mimics cognitive attention. The attention mechanism attempts to relate different positions of a single sequence to compute its representation, enhancing the important parts of the input and fading out the rest. Transformers, this type of network built with attention mechanism, gather information about the relevant context of a given word and then encode that context in a rich vector that smartly represents the word.

The following figure depicts the internal architecture of the BERT model as well as the input/output representations used. The input is first tokenized using an algorithm called WordPiece [3] and special tokens are added:

- [CLS] -> is called the classification token and it is used at the beginning of the sequence.
- [SEP] -> indicates the separation of two sequences, acting as a delimiter.
- [MASK] -> used to indicate masked tokens in the MLM task.

Tokenized input is then transformed into Token Embeddings and, additionally, Segment Embeddings and Position Embeddings. Token embeddings are the vocabulary IDs for each of the tokens. Sentence Embeddings is just a numeric class to distinguish between sentences when two or more sentences are provided using a separator token. Positional embeddings just indicate the position of each word in the sequence. Finally, the embeddings used by model architecture are the sum of Token Embedding, Positional Embedding as well as Segment Embedding.

The resulting token embeddings then go through the BERT model that is composed of 12 layers (at least in the base version) of transformer encoders. The output of the BERT is the hidden state vector of a pre-defined hidden size corresponding to each token in the input sequence. In the case of the BERT base, these output embeddings are of size 768.

A figure depicting the internal architecture of the BERT model.

The use of BERT in commercial Web engines has been publicly confirmed by large companies like Google [4] and Microsoft [5]. As they say, longer and more conversational queries are harder for traditional approaches and contextualized language model approaches can better understand the meaning of prepositions like “for” and “to” by being able to capture the context of the words in your query. Transformer-based models understand the context and relationship between each word and all the words around it in a sentence.

An example of the improvement introduced by BERT on a commercial search engine

The example above is a clear demonstration that the word “to” and its relationship to the other words in the query are particularly important to understanding the meaning. This query shows the BERT’s ability to understand the intent behind the user’s query, which is about a Brazilian travelling to the U.S., and not the other way around.

Neural Ranking

One of the simplest but still very effective applications of pre-trained contextualized language models is the “vanilla BERT” setting where the query and document are jointly encoded and the model’s classification component is tuned to provide a ranking score. People sometimes have referred to this approach also as CLS, monoBERT [3], BERTcat, or just cross-encoder.

Cross-encoders are expensive approaches because the query and the document need to be processed simultaneously by the BERT model at query time, making them a better solution when employed as re-rankers of a pool of candidates previously generated using traditional inverted indexes and fast scoring functions, as BM25.

The model takes as input a sequence comprised of the query concatenated with the document, with these being separated by a [SEP] token. The input is also prepended with a [CLS] token and ends with another [SEP] token.

				
					[CLS] QUERY [SEP] DOCUMENT [SEP]

The above sequence, appropriately tokenized, is passed to BERT as input which produces a contextualized representation vector for each token in the input sequence. Every word of the query and the document, as well as the special tokens, will get a vector representation.

In this relevance classification approach, the vector representation corresponding to the [CLS] token is the one that is used to infer a query-document relevance score. To project the vector of the [CLS] token into a scalar value a single-layer, fully connected neural network is used. This single-layer network comes initially untrained as we initialize it from scratch and the weights are adjusted while fine-tuning the whole model.

To be precise, to estimate the score of a query-document pair the [CLS] vector representations first are generated, discarding the vectors of all the other tokens. We limit ourselves to the [CLS] token because it is an aggregate representation of the sequence (query concatenated with the document), which can capture the global context and can be seen as a weighted average of the words such that the representation of the whole sequence is captured.

In the BERT base model, the [CLS] vector is just an array of floats of size 768. The linear layer on top of the BERT model then needs to be an array of size (768,1), such that the dot-product of the two will result in a single scalar value. This scalar is the estimated relevance of the document for the query. By collecting the scores of each document of a pool of candidates for a given query, it is possible to rank them in decreasing relevance to the user’s intent.

Fine-tuning

Contextualized language models usually come pre-trained.
Pre-training is performed on very large datasets, without supervision. In the case of BERT, the model has been pre-trained on unsupervised Wikipedia and Bookcorpus datasets. Two tasks namely Masked Language Model (MLM) and Next Sentence Prediction (NSP) were performed.

- In Masked Language Model, 15% of the tokens from each sequence are randomly masked (replaced with the token [MASK]) and the model is trained to predict these tokens using all the other tokens of the sequence.
- In Next Sentence Prediction, the model is provided with two sentences as input and has to predict if the second sentence succeeds the first sentence in the corpus.

A common way to improve the effectiveness of the model adopted is to leverage a fine-tuning step to transfer to it better knowledge of the task intended to be solved. One way to fine-tune a model used for re-ranking documents is to set the goal of determining whether or not a document is relevant to a query with a pairwise strategy.

Pairwise fine-tuning a “vanilla” BERT model.

BERT can be used to produce a score for each document individually and be optimized via pairwise softmax cross-entropy loss over the computed scores.

Cross-entry loss:

where J_pos is the set of indexes of the relevant candidates and J_neg is the set of indexes of the non-relevant documents. In pairwise training, we take into account only two documents at the time and only one of the two is relevant.

Implementation

In this example, we are going to show an implementation of the concepts we have just described. In particular, we are going to see how to define the model and what a pairwise training step looks like. We are going to use Pytorch [6] and HuggingFace Transformers [7], a library that implements Transformer-based architectures and provides an API for these as well as the pre-trained models.

MODEL

In the following snippet, we are going to define our model. We rely on the BertForSequenceClassification model provided by the transformers library.
BertForSequenceClassification is just a Bert model transformer with a sequence classification/regression head on top, that is a linear layer on top of the pooled output. In this example, we set the number of labels to be equal to one, which is going to result in a model able to learn a single score for each query-document pair.

				
					from transformers import *

class MonoBERT(BertPreTrainedModel):
    def __init__(self, config):
        config.num_labels = 1
        super(MonoBERT, self).__init__(config)
        self.bert = BertForSequenceClassification(config)
        self.init_weights()

    def forward(self, input_ids, attention_mask, token_type_ids):
        outputs = self.bert(input_ids, attention_mask, token_type_ids)
        logits = outputs[0]
        return logits

TRAINING STEP

To fine-tune our model we need to leverage some training data in the form of triples. Each triple is made of a query and two documents. The first document has been marked as directly relevant to the query (positive example) and the second document has not been marked as relevant (negative example). The following snippet of code is an example of the steps needed to perform a training step. Please note that the following example does not take into consideration several fundamental components to make the training effective, such as larger batch sizes or gradient clipping.

				
					import torch
from torch.nn.functional import cross_entropy
from transformers import AdamW

model = MonoBERT.from_pretrained("bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

optimizer = AdamW(model.parameters(), lr=1e-5, eps=1e-8)
optimizer.zero_grad()

pos_text = "{} [SEP] {}".format(query, pos_doc) // query, pos_doc and neg_doc can be 
neg_text = "{} [SEP] {}".format(query, neg_doc) // retrieved from the training triples

pos_encoded = tokenizer.encode_plus(pos_text, return_tensors="pt")
neg_encoded = tokenizer.encode_plus(neg_text, return_tensors="pt")

pos_output = model.forward(**pos_encoded).squeeze(1)
neg_output = model.forward(**neg_encoded).squeeze(1)

labels = torch.zeros(1, dtype=torch.long)

loss = cross_entropy(torch.stack((pos_output, neg_output), dim=1), labels)

loss.backward()
optimizer.step()

After loading the model and the tokenizer, the training triple is formatted as two query-document pairs where the same query is concatenated twice with each document using the separator token. The two formatted strings, after being tokenized and converted into tensors, are used as input of the model which returns as output a score for each of the pairs.

The two inferred scores, in conjunction with the label indicating the relevant document, are then used to compute the training loss. With the loss, the gradient can be computed and the weights are finally updated by the optimizer’s step.

Summary

In this post, we have looked at how we can leverage BERT to perform a more accurate document ranking. In particular, we have studied a cross-encoder technique able to infer a relevance score for a query-document pair. Furthermore, we have quickly glanced into model fine-tuning for the relevance classification task using a pairwise approach.

Although the simplicity of the implementation, transformers-based models are extremely powerful. If you are planning to give it a try, be aware that training a transformer model is a very sensitive task to mistakes and it could potentially lead to failure. Also, interpreting what went wrong is not easy until you become very familiar with these tools. If on the first try your model is not working as expected, do not give up and think it does not apply to your data, but ask the information retrieval experts to seek help.

Finally, if you are interested in hearing more about the use of Deep Learning applied to Search see our next episode on document expansion.

Need Help With This Topic?

If you’re struggling with BERT, don’t worry – we’re here to help! Our team offers expert services and training to help you optimize your search engine and get the most out of your system. Contact us today to learn more!

Need Help with this topic?

If you're struggling with BERT, don't worry - we're here to help! Our team offers expert services and training to help you optimize your search engine and get the most out of your system. Contact us today to learn more!

Click Here

artificial intellicence, bert, deep learning, information retrieval, machine learning, search, search relevance

Sign up for our Newsletter

Did you like this post? Don’t forget to subscribe to our Newsletter to stay always updated in the Information Retrieval world!

About the company

about our work

Rated Ranking Evaluator
(RRE)

Rated Ranking Evaluator Enterprise (RREE)

Apache Solr LLM Highlighter plugin

News

Main Blog

TIPS AND TRICKS

LATEST BLOG POST

contact us

Don't miss all the news - subscribe to our newsletter!

From Training to Ranking: Using BERT to Improve Search Relevance

BERT

Neural Ranking

Fine-tuning

Implementation

MODEL

TRAINING STEP

Summary

Need Help With This Topic?

Need Help with this topic?

Other posts you may find useful

ColBERT in Practice: Bridging Research and Industry

The Importance of Online Testing in Learning to Rank – Part 1

Faster Vector Search: Early Termination Strategy Now in Apache Solr

Antonio Mallia

Antonio Mallia

Follow Us

Top Categories

Recent Posts

Retrieval and Responsibility: The Ethics of Augmented Knowledge

Faster Vector Search: Early Termination Strategy Now in Apache Solr

OpenSearch and Large Language Models

Monthly video

Sign up for our Newsletter

Leave a Reply Cancel reply

Quick Links

Services

Subscribe

About the company

about our work

Rated Ranking Evaluator (RRE)

Rated Ranking Evaluator Enterprise (RREE)

Apache Solr LLM Highlighter plugin

News

Main Blog

TIPS AND TRICKS

LATEST BLOG POST

contact us

Don't miss all the news - subscribe to our newsletter!

From Training to Ranking: Using BERT to Improve Search Relevance

BERT

Neural Ranking

Fine-tuning

Implementation

MODEL

TRAINING STEP

Summary

Need Help With This Topic?​​

Need Help with this topic?​

Other posts you may find useful

ColBERT in Practice: Bridging Research and Industry

The Importance of Online Testing in Learning to Rank – Part 1

Faster Vector Search: Early Termination Strategy Now in Apache Solr

Antonio Mallia

Antonio Mallia

Follow Us

Top Categories

Recent Posts

Retrieval and Responsibility: The Ethics of Augmented Knowledge

Faster Vector Search: Early Termination Strategy Now in Apache Solr

OpenSearch and Large Language Models

Monthly video

Sign up for our Newsletter

Leave a Reply Cancel reply

Rated Ranking Evaluator
(RRE)

Need Help With This Topic?

Need Help with this topic?