If you have read Part 1 of this blog post, you should know by now how many fantastic things can be done with online testing! In particular, the advantages that interleaving brings with respect to A/B testing, but you are still waiting for the answer to a question: *how to implement it?*

Let’s see together some of these interleaving implementations. First of all, let’s describe the first implemented interleaving approach, then I will present to you the two most used ones.

## BALANCED INTERLEAVING

The first implementation is: **balanced interleaving**.

Balanced interleaving is the simplest method for interleaving. It always puts in the top *k *results of the result list *l _{I}*, documents from the top

*k*results of both lists

*l*and

_{A}*l*

_{B}.Suppose to have two pointers

*k*and

_{a}*k*, for the two ranked lists

_{b}*l*and

_{A}*l*, always pointing to the higher result of each list.

_{B}The implemented algorithm is the following:

Let’s make an example to explain everything more clearly.

Suppose to have *l _{A}* = (

*a*,

_{1}*a*, …) and

_{2}*l*= (

_{B}*b*,

_{1}*b*, …).

_{2}We give a priority to one of the two models with the randomBit() function (line 2). The chosen model will decide first which documents to add to the final list. Suppose we select

*model_A*:

*Afirst = 1*.

Then we will repeat this process unless we have at least one document in both

*l*and

_{A}*l*:

_{B}- If (
*k*<_{a}*k*) or (_{b}*k*=_{a}*k*and_{b}*model_A*is the favorite one): if the document pointed by*k*isn’t already in_{a}*l*, we insert it._{I} - Else: if the document pointed by
*k*isn’t already in_{b}*l*, we insert it._{I}

At the end of this loop, we will have added all the documents from *l _{A}* and

*l*(with at most one discard) in

_{B}*l*.

_{I}We can, therefore, see that in this way we create a unique list *l _{I}* containing documents from both the model, respecting the relative rank of the documents inside their original list (

*l*and

_{A}*l*) .

_{B}**How to evaluate the results?**

The obtained result list *l _{I}* = (

*i*,

_{1}*i*, …) is now shown to the user, who clicks those documents that mostly reflect his information need. Suppose to call

_{2}*c*,

_{1}*c*, … the ranks of the clicked documents and

_{2}*c*.

_{max}= max_{i}c_{i}In order to derive a preference between the two models, we count the number of clicked documents from each model in the top

*k*results of

*l*, where:

_{I}*k = min*{

*j*: (

*i*=

_{cmax}*a*) or (

_{j}*i*=

_{cmax}*b*)} .

_{j}In particular the number of clicked documents for each model is:

*h*= |{

_{a}*c*:

_{j}*i*∈ (

_{cj}*a*, …,

_{1}*a*)}| for

_{k}*model_*A

*h*= |{

_{b}*c*:

_{j}*i*∈ (

_{cj}*b*, …,

_{1}*b*)}| for

_{k}*model_B*

If

*h*>

_{a}*h*

_{b}*model_A*will be the favorite one, if

*h*<

_{a}*h*

_{b}*model_B*will be the favorite one, if

*h*=

_{a}*h*we have a tie.

_{b}This is a preference derived for the query *q*, but many other queries can be done during the online evaluation. How can we derive a preference for all the queries?

Let *wins*(A) be the number of comparisons in which model_A won, wins(B) the number of comparisons in which *model_B* won and *ties*(A,B) the number of times we have a tie.

To compute the overall preference we use another statistical measure:

A positive value of Δ_{AB} means that *model_A* is the favorite one, a negative value of Δ_{AB} means that *model_B* is the favorite one.

**Drawbacks**

One drawback of this approach arises when comparing two models that produce very similar ranking lists. If for example the ranking list of *model_A *is identical to the one of *model_B* with the exception of the first document: *l _{A}* = (

*a*,

*b*, c, d) and

*l*= (

_{B}*b*,

*c*, d, a). The comparison phase will bring the

*model_B*to win more often than

*model_A*. This happens regardless of the model chosen by

*RandomBit()*.

This drawback arises due to the way in which the

*k*parameter is chosen (the min rank that includes all the clicked documents) and to the fact that

*model_B*rank higher than

*model_A*all documents with the exception of

*a*.

## TEAM-DRAFT INTERLEAVING

The **team-draft interleaving** approach overcomes the issue described for balanced interleaving with a new way of constructing the interleaved result list *l _{I}* . In particular, the algorithm relies on the widely used method of team captains, in team-matches, for the selection of the players.

Suppose to have two captains *A* and *B*. Suppose to choose one of the two by chance at each round. Then the selected captain will choose his favorite player still available, add him to the team and then leave the turn to the other captain that will do the same. This will repeat unless there are players available.

The implemented algorithm is the following:

Let’s make also here an example to explain everything more clearly.

Suppose to have *l _{A}* = (

*a*,

_{1}*a*, …) and

_{2}*l*= (

_{B}*b*,

_{1}*b*, …).

_{2}Here, like in balanced interleaving, the choice of the starting model is given by

*randomBit()*, but this choice is done at every round and not only at the beginning of the algorithm.

We will repeat this process unless we have at least one document in both

*l*and

_{A}*l*:

_{B}- If (the size of
*TeamA*is smaller than the one of*TeamB*) or (the two teams have the same size and*model_A*has the priority):*k*will take the rank of the top document in*l*not yet in_{A}*l*; this document will be added to the interleaved list_{I}*l*and to the_{I}*TeamA*set (this set record all the documents that are taken from*model_A*). - Else:
*k*will take the rank of the top document in*l*not yet in_{B}*l*; this document will be added to the interleaved list_{I}*l*and to the_{I}*TeamB*set (this set record all the documents that are taken from*model_B*).

At the end of this loop, we have added all the documents from *l _{A}* and

*l*(with at most one discard) in

_{B}*l*. We will also obtain the list of documents belonging to

_{I}*TeamA*and

*TeamB*.

**How to evaluate the results?**

Again, the obtained result list *l _{I}* = (

*i*,

_{1}*i*, …) is shown to the user, who clicks those documents that mostly reflect his information need. Suppose to call

_{2}*c*,

_{1}*c*, … the ranks of the clicked documents.

_{2}In order to derive a preference between the two models, we count the number of clicked documents in each

*Team*:

*h*= |{

_{a}*c*:

_{j}*i*∈

_{cj}*TeamA*| for

*model_*A

*h*= |{

_{b}*c*:

_{j}*i*∈

_{cj}*TeamB*}| for

*model_B*

If

*h*>

_{a}*h*

_{b}*model_A*will be the favorite one, if

*h*<

_{a}*h*

_{b}*model_B*will be the favorite one, if

*h*=

_{a}*h*we have a tie.

_{b}In the same way as the balanced interleaving, we can derive a preference for all the queries.

We compute *wins(A)*, *wins(B)*, *ties(A,B)* and then Δ_{AB} .

**Drawbacks**

**Drawbacks**

Also for team-draft interleaving there is a drawback.

Suppose to have two ranked lists: *l _{A}* = (

*a*,

*b*, c, d) and

*l*= (

_{B}*b*,

*c*, d, a). Suppose

*c*to be the only relevant document.

With this approach we can obtain four different interleaved lists:

*l*= (

_{I1}*a*,

_{A}*b*,

_{B}*c*,

_{A}*d*)

_{B}*l*= (

_{I2}*b*,

_{B}*a*,

_{A}*c*,

_{B}*d*)

_{A}*l*= (

_{I3}*b*,

_{B}*a*,

_{A}*c*,

_{A}*d*)

_{B}*l*= (

_{I4}*a*,

_{A}*b*,

_{B}*c*,

_{B}*d*)

_{A}All of them putting

*c*at the same rank.

Applying our preference computation, it will result in a tie between the two models, when actually

*l*should be chosen as the best one.

_{B}## PROBABILISTIC INTERLEAVING

The **probabilistic interleaving** approach overcomes the issue described for balanced interleaving and team-draft with a new way of constructing the interleaved result list *l _{I}* . In particular, the algorithm relies on the creation of a probability distribution over the ranking lists. This distribution allows every document to have a non-zero probability to be added in the interleaved result list.

The implemented algorithm is the following:

The equation for computing the probability distribution, given a ranked list of documents, is:

Let’s explain a bit more how it works.

First of all, we have to spend some more words on the computation of the probability distribution (line 3). Suppose to start with *model_A, in line (3) we have the initialization of s(lA). What does it mean*?

The idea behind this step is to associate to each document of *l _{A}* (and after of

*l*) a probability. This probability is computed using a

_{B}*softmax*function on the rank of each document. This allows us to obtain a higher probability for documents with higher ranks and lower probability for documents with a lower rank. In this way, documents at the top of the list will have a higher probability to be chosen as we expected, since we ideally would like to maintain the relative rank as much as possible. Moreover, all the documents of

*l*have a non-zero-probability to be chosen for

_{A}*l*, increasing the fairness of the algorithm.

_{I}Once obtained the distribution for both the models (*model_A* and *model_B*), we start to create the interleaved list *l _{I}* .

At every loop:

- A random model is selected (as in team-draft interleaving), the choice is stored both in the
*assignment*variable and in thevector. The last one will report the final sequence of assignments that happened during the interleaved list creation.**assignment** - The
*not_assignment*variable is upgraded with the opposite value of the*assignment*variable. This variable is needed by the*remove_and_renormalize()*method in order to upgrade the probability distribution after the documents selection. - A document is selected by the
*s(l*model distribution. The way in which the choice happen is through a sample without replacement from the probability distribution of the chosen model. This document is then added to_{assignment})*l*._{I} - The selected documents are removed from the ranked list of the not chosen model. The probability distribution is then recomputed due to this removal.

After the realization of *l _{I}* an evaluation similar to the one of the previous methods can be applied. Alternatively, a probabilistic comparison with marginalization can be done. I’m not going to explain it, but if you are interested in this topic I recommend you to read [1, 2].

This implementation, as said in [1], ensure no bias. First because, in expectation, each ranker contributes the same number of documents to *l _{I}* .

Second, because the

*softmax*function, constructed for each ranker, has the same shape and thus, the probability allocated to each document, depends only on its rank.

Third, because the use of assignments guarantees that each click is attributed to only one list, as in the team-draft method.

**Drawbacks**

**Drawbacks**

The use of a probability distribution for the document selection increases the fairness of the algorithm, but it could also lead to a worse user experience. It may indeed happen that documents with a low rank are selected and put higher in the final list* l _{I} .*

## Summary

Let’s recap the main points of this blog post.

Make online testing is very important because it allows us to have direct feedback on how our model is performing. All the obtained results are computed on user interactions, which are the most reliable form of the relevance expression.

There are several advantages for interleaving with respect to A/B testing:

**Reduces variance:**in the interleaving approach, unlike A/B testing, there isn’t a separation of the users in groups. Here the same user evaluates at the same time both the models, making it possible to execute a direct comparison between them. This approach overcomes the problem of the users’ variance that can come out with the query traffic separation.**More sensitive:**due to the high user variance, the A/B testing is also less sensitive to differences between models. The small is the difference between models, the more difficult it is to evaluate the variation.**Less traffic:**due to less sensitivity, the A/B testing requires much more traffic than interleaving to achieve the same result. We need to analyze many user interactions to understand if the obtained results are reliable.**Less time:**due to the much traffic required, A/B testing takes longer to execute than interleaving. The more time we run the test, the more interactions we collect.

There are three main implementations in interleaving:

**Balanced Interleaving:**based on a simple alternation of documents coming from both the models we are evaluating. It allows us to present results from both the models, but it’s not able to correctly choose the best one when they generate very similar rankings.**Team-Draft Interleaving:**based on the player’s selection done from captains in a team-match. As for the balanced interleaving, also here there is a specific case in which the method isn’t able to select the correct method as the winner.**Probabilistic Interleaving:**based on the generation of a probability distribution associated with each model we want to evaluate. This method can lead to a worse interleaved list in terms of ranking, but it is more fair, sensitive, and present less bias than the other two approaches.

#### REFERENCES

[1] Hofmann, Katja, Shimon Whiteson, and Maarten De Rijke. “A probabilistic method for inferring preferences from clicks.” *Proceedings of the 20th ACM international conference on Information and knowledge management*. 2011.

[2] Hofmann, Katja. “Fast and reliable online learning to rank for information retrieval.” *SIGIR Forum*. Vol. 47. No. 2. 2013.

[3] Chapelle, Olivier, et al. “Large-scale validation and analysis of interleaved search evaluation.” *ACM Transactions on Information Systems (TOIS)* 30.1 (2012): 1-41.

[4] Hofmann, Katja, Shimon Whiteson, and Maarten De Rijke. “Fidelity, soundness, and efficiency of interleaved comparison methods.” *ACM Transactions on Information Systems (TOIS)* 31.4 (2013): 1-43.