From Training to Ranking: Using BERT to Improve Search Relevance
If you have attended our Artificial Intelligence in Search Training you should now be familiar with the use of Natural Language Processing and Deep Learning applied to search. If you have not, do not worry as we are planning to arrange another date and we will keep you posted through our newsletter, so make sure you subscribe. In the meantime, you can start learning about text ranking using Deep Learning (DL) by reading this blog post.
More and more frequently, we hear about how Artificial Intelligence (AI) permeates every aspect of our daily lives. When we talk about AI we are referring to a superset of techniques that enable machines to learn and reason like humans. Artificial Intelligence has been around since the ’50s, but it has exploded only recently with the advent of Deep Learning. In fact, Deep Learning is a part of AI which makes use of special neural networks to solve complex problems that could not simply be solved by an algorithm.
When it comes to search engines, Deep Learning can help at performing several tasks, such as query understanding, personalization, or recommendation. In this article, we will be focusing on text ranking.
The problem we will be solving is usually referred to as “ad hoc retrieval”, which is a standard retrieval task in which the user specifies his information need through a query that initiates a search for documents likely to be relevant to the user. Basically, given a query and a collection of documents we need to return a ranked list of results that maximizes a metric of interest.
Search is the most common practical implementation of text ranking, where a search engine is adopted as the retrieval system to produce a ranked list of web pages, PDFs, news articles, tweets, or any other form of text ordered by estimated relevance with respect to the user’s intent.

BERT
Recent advances in information retrieval have shown promising performance gain by utilizing large-scale pre-trained transformer-based language models like BERT [1].
BERT, an invention of Google, is the best-known example of these pre-trained models, which are a family of neural networks adopting the mechanism of self-attention [2], a technique that mimics cognitive attention. The attention mechanism attempts at relating different positions of a single sequence in order to compute its representation, enhancing the important parts of the input and fading out the rest. Transformers, this type of networks built with attention mechanism, gather information about the relevant context of a given word, and then encode that context in a rich vector that smartly represents the word.
The following figure depicts the internal architecture of the BERT model as well as the input/output representations used. The input is first tokenized using an algorithm called WordPiece [3] and special tokens are added:
-
- [CLS] -> is called the classification token and it is used at the beginning of the sequence.
- [SEP] -> indicates the separation of two sequences, acting as a delimiter.
- [MASK] -> used to indicate masked token in the MLM task.
Tokenized input is then transformed into Token Embeddings and, additionally, Segment Embeddings and Position Embeddings. Token embeddings are the vocabulary IDs for each of the tokens. Sentence Embeddings is just a numeric class to distinguish between sentences when two or more sentences are provided using a separator token. Positional embeddings just indicate the position of each word in the sequence. Finally, the embeddings used by model architecture are the sum of Token Embedding, Positional Embedding as well as Segment Embedding.
The resulting token embeddings then go through BERT model that is composed of 12 layers (at least in the base version) of transformer encoders. The output of the BERT is the hidden state vector of pre-defined hidden size corresponding to each token in the input sequence. In the case of BERT base, these output embeddings are of size 768.

The use of BERT in commercial Web engines has been publicly confirmed by large companies like Google [4] or Microsoft [5]. As they say, longer and more conversational queries are harder for traditional approaches and contextualized language models approaches can better understand the meaning of prepositions like “for” and “to” being able to capture the context of the words in your query. Transformer-based models understand the context and relationship between each word and all the words around it in a sentence.

The example above is a clear demonstration that the word “to” and its relationship to the other words in the query are particularly important to understanding the meaning. This query shows the BERT’s ability to understand the intent behind the user’s query, which is about a Brazilian traveling to the U.S., and not the other way around.
Neural Ranking
One of the simplest but still very effective application of pre-trained contextualized language models is the “vanilla BERT” setting where the query and document are jointly encoded and the model’s classification component is tuned to provide a ranking score. People sometime have referred to this approach also as CLS, monoBERT [3], BERTcat, or just cross-encoder.

Cross-encoders are expensive approaches because the query and the document need to be processed simultaneously by the BERT model at query time, making them a better solution when employed as re-rankers of a pool of candidates previously generated using traditional inverted indexes and fast scoring functions, as BM25.
The model takes as input a sequence comprised of the query concatenated with the document, with these being separated by a [SEP] token. The input is also prepended with a [CLS] token and ended with another [SEP] token.
[CLS] QUERY [SEP] DOCUMENT [SEP]
The above sequence, appropriately tokenized, is passed to BERT as input which produces a contextualized representation vector for each token in the input sequence. Basically, every word of the query and the document, as well as the special tokens, will get a vector representation.
In this relevance classification approach, the vector representation corresponding to the [CLS] token is the one that is used to infer a query-document relevance score. To project the vector of the [CLS] token into a scalar value a single-layer, fully connected neural network is used. This single-layer network comes initially untrained as we initialize it from scratch and the weights are adjusted while fine-tuning the whole model.
To be precise, to estimate the score of a query-document pair the [CLS] vector representations first is generated, discarding the vectors of all the others tokens. We limit ourselves to the [CLS] token because it is an aggregate representation of the sequence (query concatenated with the document), which is able to capture the global context and can be seen as a weighted average of the words such that the representation of the whole sequence is captured.
In the BERT base model, the [CLS] vector is just an array of floats of size 768. The linear-layer on top of the BERT model then needs to be an array of size (768,1), such that the dot-product of the two will result in a single scalar value. This scalar is the estimated relevance of the document for the query. By collecting the scores of each document of a pool of candidates for a given query, it is possible to rank them in decreasing relevance to the user’s intent.
Fine-tuning
Contextualized language models usually come pre-trained.
Pre-training is performed on very large datasets, without supervision. In the case of BERT, the model has been pre-trained on unsupervised Wikipedia and Bookcorpus datasets. Two tasks namely Masked Language Model (MLM) and Next Sentence Prediction (NSP) were performed.
-
- In Masked Language Model, 15% of the tokens from each sequence are randomly masked (replaced with the token [MASK]) and the model is trained to predict these tokens using all the other tokens of the sequence.
- In Next Sentence Prediction, the models is provided with two sentences as input and has to predict if the second sentence succeeds the first sentence in the corpus.
A common way to improve the effectiveness of the model adopted is to leverage a fine-tuning step to transfer to it better knowledge of the task intended to be solved. One way to fine-tune a model used for re-ranking documents is to set the goal of determining whether or not a document is relevant to a query with a pairwise strategy.

BERT can be used to produce a score for each document individually and be optimized
via pairwise softmax cross-entropy loss over the computed scores.
Cross-entry loss:

where Jpos is the set of indexes of the relevant candidates and Jneg is the set of indexes of the non-relevant documents. In pairwise training, we take into account only two documents at the time and only one of the two is relevant.
Implementation
In this example, we are going to show an implementation of the concepts we have just described. In particular, we are going to see how to define the model and what a pairwise training step looks like. We are going to use Pytorch [6] and HuggingFace Transformers [7], a library that implements Transformer-based architectures and provides an API for these as well as the pre-trained models.
MODEL
In the following snippet, we are going to define our model. We rely on the BertForSequenceClassification model provided by the transformers library.
BertForSequenceClassification is just a Bert model transformer with a sequence classification/regression head on top, that is a linear layer on top of the pooled output. In this example, we set the number of labels to be equal to one, which is going to result in a model able to learn a single score for each query-document pair.
from transformers import *
class MonoBERT(BertPreTrainedModel):
def __init__(self, config):
config.num_labels = 1
super(MonoBERT, self).__init__(config)
self.bert = BertForSequenceClassification(config)
self.init_weights()
def forward(self, input_ids, attention_mask, token_type_ids):
outputs = self.bert(input_ids, attention_mask, token_type_ids)
logits = outputs[0]
return logits
TRAINING STEP
To fine-tune our model we need to leverage some training data in the form of triples. Each triple is made of a query and two documents. The first document has been marked as directly relevant to the query (positive example) and the second document has not been marked as relevant (negative example). The following snippet of code is an example of the steps needed to perform a training step. Please note that the following example does not take into consideration several fundamental components to make the training effective, such as larger batch sizes or gradient clipping.
import torch
from torch.nn.functional import cross_entropy
from transformers import AdamW
model = MonoBERT.from_pretrained("bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
optimizer = AdamW(model.parameters(), lr=1e-5, eps=1e-8)
optimizer.zero_grad()
pos_text = "{} [SEP] {}".format(query, pos_doc) // query, pos_doc and neg_doc can be
neg_text = "{} [SEP] {}".format(query, neg_doc) // retrieved from the training triples
pos_encoded = tokenizer.encode_plus(pos_text, return_tensors="pt")
neg_encoded = tokenizer.encode_plus(neg_text, return_tensors="pt")
pos_output = model.forward(**pos_encoded).squeeze(1)
neg_output = model.forward(**neg_encoded).squeeze(1)
labels = torch.zeros(1, dtype=torch.long)
loss = cross_entropy(torch.stack((pos_output, neg_output), dim=1), labels)
loss.backward()
optimizer.step()
After loading the model and the tokenizer, the training triple is formatted as two query-document pairs where the same query is concatenated twice with each document using the separator token. The two formatted strings, after being tokenized and converted into tensors, are used as input of the model which returns as output a score for each of the pairs.
The two inferred scores, in conjunction with the label indicating the relevant document, are then used to compute the training loss. With the loss, the gradient can be computed and the weights are finally updated by the optimizer’s step.
Summary
In this post, we have looked at how we can leverage BERT to perform a more accurate document ranking. In particular, we have studied a cross-encoder technique able to infer a relevance score for a query-document pair. Furthermore, we have quickly glanced into model fine-tuning for the relevance classification task using a pairwise approach.
Although the simplicity of the implementation, transformers-based models are extremely powerful. If you are planning to give it a try, be aware that training a transformer model is a very sensitive task to mistakes and it could potentially lead to failure. Also, interpreting what went wrong is not easy until you become very familiar with these tools. If at the first try your model is not working as expected, do not give up and think it does not apply to your data, but ask the experts to seek help.
Finally, if you are interested in hearing more about the use of Deep Learning applied to Search stay tuned for our next episode on document expansion.
Shameless plug for our training and services!
Did I mention we do Learning To Rank and Search Relevance training?
We also provide consulting on these topics, get in touch if you want to bring your search engine to the next level!
Subscribe to our newsletter
Did you like this post about using BERT to improve Search Relevance? Don’t forget to subscribe to our Newsletter to stay always updated from the Information Retrieval world!
Related
Author
Antonio Mallia
Antonio explores new techniques and architectures for large-scale search engines, including indexing, compression, and retrieval. He contributes to the Academic and Open Source community.