If you have read Part 1 of this blog post, you should know by now how many fantastic things can be done with online testing! In particular, the advantages that interleaving brings with respect to A/B testing, but you are still waiting for the answer to a question: how to implement it?
Let’s see together some of these interleaving implementations. First of all, let’s describe the first implemented interleaving approach, then I will present to you the two most used ones.
Balanced Interleaving
The first implementation is: balanced interleaving.
Balanced interleaving is the simplest method for interleaving. It always puts in the top k results of the result list l_{I}, documents from the top k results of both lists l_{A} and l_{B} .
Suppose to have two pointers k_{a} and k_{b} , for the two ranked lists l_{A} and l_{B} , always pointing to the higher result of each list.
The implemented algorithm is the following:
Let’s make an example to explain everything more clearly.
Suppose to have l_{A} = (a_{1}, a_{2}, …) and l_{B} = (b_{1}, b_{2}, …).
We give a priority to one of the two models with the randomBit() function (line 2). The chosen model will decide first which documents to add to the final list. Suppose we select model_A: Afirst = 1.
Then we will repeat this process unless we have at least one document in both l_{A} and l_{B} :

 If (k_{a} < k_{b}) or (k_{a} = k_{b} and model_A is the favorite one): if the document pointed by k_{a} isn’t already in l_{I} , we insert it.
 Else: if the document pointed by k_{b} isn’t already in l_{I} , we insert it.
At the end of this loop, we will have added all the documents from l_{A} and l_{B} (with at most one discard) in l_{I} .
We can, therefore, see that in this way we create a unique list l_{I} containing documents from both the model, respecting the relative rank of the documents inside their original list ( l_{A} and l_{B}) .
How to evaluate the results?
The obtained result list l_{I} = (i_{1}, i_{2}, …) is now shown to the user, who clicks those documents that mostly reflect his information need. Suppose to call c_{1}, c_{2}, … the ranks of the clicked documents and c_{max} = max_{i}c_{i} .
In order to derive a preference between the two models, we count the number of clicked documents from each model in the top k results of l_{I} , where:
k = min{ j: (i_{cmax}= a_{j}) or (i_{cmax}= b_{j})} .
In particular the number of clicked documents for each model is:
h_{a} = {c_{j}: i_{cj} ∈ (a_{1}, …, a_{k})} for model_A
h_{b} = {c_{j}: i_{cj} ∈ (b_{1}, …, b_{k})} for model_B
If h_{a} > h_{b} model_A will be the favorite one, if h_{a} < h_{b} model_B will be the favorite one, if h_{a} = h_{b} we have a tie.
This is a preference derived for the query q, but many other queries can be done during the online evaluation. How can we derive a preference for all the queries?
Let wins(A) be the number of comparisons in which model_A won, wins(B) the number of comparisons in which model_B won and ties(A,B) the number of times we have a tie.
To compute the overall preference we use another statistical measure:
A positive value of Δ_{AB} means that model_A is the favorite one, a negative value of Δ_{AB} means that model_B is the favorite one.
Drawbacks
One drawback of this approach arises when comparing two models that produce very similar ranking lists. If for example the ranking list of model_A is identical to the one of model_B with the exception of the first document: l_{A} = (a, b, c, d) and l_{B} = (b, c, d, a). The comparison phase will bring the model_B to win more often than model_A. This happens regardless of the model chosen by RandomBit().
This drawback arises due to the way in which the k parameter is chosen (the min rank that includes all the clicked documents) and to the fact that model_B rank higher than model_A all documents with the exception of a.
TeamDraft Interleaving
The teamdraft interleaving approach overcomes the issue described for balanced interleaving with a new way of constructing the interleaved result list l_{I} . In particular, the algorithm relies on the widely used method of team captains, in teammatches, for the selection of the players.
Suppose to have two captains A and B. Suppose to choose one of the two by chance at each round. Then the selected captain will choose his favorite player still available, add him to the team and then leave the turn to the other captain that will do the same. This will repeat unless there are players available.
The implemented algorithm is the following:
Let’s make also here an example to explain everything more clearly.
Suppose to have l_{A} = (a_{1}, a_{2}, …) and l_{B} = (b_{1}, b_{2}, …).
Here, like in balanced interleaving, the choice of the starting model is given by randomBit(), but this choice is done at every round and not only at the beginning of the algorithm.
We will repeat this process unless we have at least one document in both l_{A} and l_{B} :
 If (the size of TeamA is smaller than the one of TeamB) or (the two teams have the same size and model_A has the priority): k will take the rank of the top document in l_{A} not yet in l_{I} ; this document will be added to the interleaved list l_{I} and to the TeamA set (this set record all the documents that are taken from model_A).
 Else: k will take the rank of the top document in l_{B} not yet in l_{I} ; this document will be added to the interleaved list l_{I} and to the TeamB set (this set record all the documents that are taken from model_B).
At the end of this loop, we have added all the documents from l_{A} and l_{B} (with at most one discard) in l_{I} . We will also obtain the list of documents belonging to TeamA and TeamB.
How to evaluate the results?
Again, the obtained result list l_{I} = (i_{1}, i_{2}, …) is shown to the user, who clicks those documents that mostly reflect his information need. Suppose to call c_{1}, c_{2}, … the ranks of the clicked documents.
In order to derive a preference between the two models, we count the number of clicked documents in each Team:
h_{a} = {c_{j}: i_{cj} ∈ TeamA for model_A
h_{b} = {c_{j}: i_{cj} ∈ TeamB} for model_B
If h_{a} > h_{b} model_A will be the favorite one, if h_{a} < h_{b} model_B will be the favorite one, if h_{a} = h_{b} we have a tie.
In the same way as the balanced interleaving, we can derive a preference for all the queries.
We compute wins(A), wins(B), ties(A,B) and then Δ_{AB} .
Drawbacks
Also for teamdraft interleaving there is a drawback.
Suppose to have two ranked lists: l_{A} = (a, b, c, d) and l_{B} = (b, c, d, a). Suppose c to be the only relevant document.
With this approach we can obtain four different interleaved lists:
l_{I1} = (a_{A}, b_{B}, c_{A}, d_{B})
l_{I2} = (b_{B}, a_{A}, c_{B}, d_{A})
l_{I3} = (b_{B}, a_{A}, c_{A}, d_{B})
l_{I4} = (a_{A}, b_{B}, c_{B}, d_{A})
All of them putting c at the same rank.
Applying our preference computation, it will result in a tie between the two models, when actually l_{B} should be chosen as the best one.
Probabilistic Interleaving
The probabilistic interleaving approach overcomes the issue described for balanced interleaving and teamdraft with a new way of constructing the interleaved result list l_{I} . In particular, the algorithm relies on the creation of a probability distribution over the ranking lists. This distribution allows every document to have a nonzero probability to be added in the interleaved result list.
The implemented algorithm is the following:
The equation for computing the probability distribution, given a ranked list of documents, is:
Let’s explain a bit more how it works.
First of all, we have to spend some more words on the computation of the probability distribution (line 3). Suppose to start with model_A, in line (3) we have the initialization of s(lA). What does it mean?
The idea behind this step is to associate to each document of l_{A} (and after of l_{B}) a probability. This probability is computed using a softmax function on the rank of each document. This allows us to obtain a higher probability for documents with higher ranks and lower probability for documents with a lower rank. In this way, documents at the top of the list will have a higher probability to be chosen as we expected, since we ideally would like to maintain the relative rank as much as possible. Moreover, all the documents of l_{A} have a nonzeroprobability to be chosen for l_{I}, increasing the fairness of the algorithm.
Once obtained the distribution for both the models (model_A and model_B), we start to create the interleaved list l_{I} .
At every loop:

 A random model is selected (as in teamdraft interleaving), the choice is stored both in the assignment variable and in the assignment vector. The last one will report the final sequence of assignments that happened during the interleaved list creation.
 The not_assignment variable is upgraded with the opposite value of the assignment variable. This variable is needed by the remove_and_renormalize() method in order to upgrade the probability distribution after the documents selection.
 A document is selected by the s(l_{assignment}) model distribution. The way in which the choice happen is through a sample without replacement from the probability distribution of the chosen model. This document is then added to l_{I} .
 The selected documents are removed from the ranked list of the not chosen model. The probability distribution is then recomputed due to this removal.
After the realization of l_{I} an evaluation similar to the one of the previous methods can be applied. Alternatively, a probabilistic comparison with marginalization can be done. I’m not going to explain it, but if you are interested in this topic I recommend you to read [1, 2].
This implementation, as said in [1], ensure no bias. First because, in expectation, each ranker contributes the same number of documents to l_{I} .
Second, because the softmax function, constructed for each ranker, has the same shape and thus, the probability allocated to each document, depends only on its rank.
Third, because the use of assignments guarantees that each click is attributed to only one list, as in the teamdraft method.
Drawbacks
The use of a probability distribution for the document selection increases the fairness of the algorithm, but it could also lead to a worse user experience. It may indeed happen that documents with a low rank are selected and put higher in the final list l_{I} .
Summary
Let’s recap the main points of this blog post.
Make online testing is very important because it allows us to have direct feedback on how our model is performing. All the obtained results are computed on user interactions, which are the most reliable form of the relevance expression.
There are several advantages for interleaving with respect to A/B testing:

 Reduces variance: in the interleaving approach, unlike A/B testing, there isn’t a separation of the users in groups. Here the same user evaluates at the same time both the models, making it possible to execute a direct comparison between them. This approach overcomes the problem of the users’ variance that can come out with the query traffic separation.
 More sensitive: due to the high user variance, the A/B testing is also less sensitive to differences between models. The small is the difference between models, the more difficult it is to evaluate the variation.
 Less traffic: due to less sensitivity, the A/B testing requires much more traffic than interleaving to achieve the same result. We need to analyze many user interactions to understand if the obtained results are reliable.
 Less time: due to the much traffic required, A/B testing takes longer to execute than interleaving. The more time we run the test, the more interactions we collect.
There are three main implementations in interleaving:

 Balanced Interleaving: based on a simple alternation of documents coming from both the models we are evaluating. It allows us to present results from both the models, but it’s not able to correctly choose the best one when they generate very similar rankings.
 TeamDraft Interleaving: based on the player’s selection done from captains in a teammatch. As for the balanced interleaving, also here there is a specific case in which the method isn’t able to select the correct method as the winner.
 Probabilistic Interleaving: based on the generation of a probability distribution associated with each model we want to evaluate. This method can lead to a worse interleaved list in terms of ranking, but it is more fair, sensitive, and present less bias than the other two approaches.
REFERENCES
[1] Hofmann, Katja, Shimon Whiteson, and Maarten De Rijke. “A probabilistic method for inferring preferences from clicks.” Proceedings of the 20th ACM international conference on Information and knowledge management. 2011.
[2] Hofmann, Katja. “Fast and reliable online learning to rank for information retrieval.” SIGIR Forum. Vol. 47. No. 2. 2013.
[3] Chapelle, Olivier, et al. “Largescale validation and analysis of interleaved search evaluation.” ACM Transactions on Information Systems (TOIS) 30.1 (2012): 141.
[4] Hofmann, Katja, Shimon Whiteson, and Maarten De Rijke. “Fidelity, soundness, and efficiency of interleaved comparison methods.” ACM Transactions on Information Systems (TOIS) 31.4 (2013): 143.
Shameless plug for our training and services!
Did I mention we do Learning To Rank and Search Quality Evaluation training?
We also provide consulting on these topics, get in touch if you want to bring your search engine to the next level!
Subscribe to our newsletter
Did you like this post about Online Testing in Learning to Rank? Don’t forget to subscribe to our Newsletter to stay always updated from the Information Retrieval world!
Related
Author
Anna Ruggero
Anna Ruggero is a software engineer passionate about Information Retrieval and Data Mining. She loves to find new solutions to problems, suggesting and testing new ideas, especially those that concern the integration of machine learning techniques into information retrieval systems.