If you have read Part 1 of this blog post, you should know by now how many fantastic things can be done with online testing! In particular, the advantages that interleaving brings with respect to A/B testing, but you are still waiting for the answer to a question: how to implement it?

Let’s see together some of these interleaving implementations. First of all, let’s describe the first implemented interleaving approach, then I will present to you the two most used ones.

BALANCED INTERLEAVING

The first implementation is: balanced interleaving.

Balanced interleaving is the simplest method for interleaving. It always puts in the top k results of the result list lI, documents from the top k results of both lists lA and lB .
Suppose to have two pointers ka and kb , for the two ranked lists lA and lB , always pointing to the higher result of each list.

The implemented algorithm is the following:

This image has an empty alt attribute; its file name is balanced-1.jpg

Let’s make an example to explain everything more clearly.

Suppose to have lA = (a1, a2, …) and lB = (b1, b2, …).
We give a priority to one of the two models with the randomBit() function (line 2). The chosen model will decide first which documents to add to the final list. Suppose we select model_A: Afirst = 1.
Then we will repeat this process unless we have at least one document in both lA and lB :

  1. If (ka < kb) or (ka = kb and model_A is the favorite one): if the document pointed by ka isn’t already in lI , we insert it.
  2. Else: if the document pointed by kb isn’t already in lI , we insert it.

At the end of this loop, we will have added all the documents from lA and lB (with at most one discard) in lI .

We can, therefore, see that in this way we create a unique list lI containing documents from both the model, respecting the relative rank of the documents inside their original list ( lA and lB) .

How to evaluate the results?

The obtained result list lI = (i1, i2, …) is now shown to the user, who clicks those documents that mostly reflect his information need. Suppose to call c1, c2, … the ranks of the clicked documents and cmax = maxici .
In order to derive a preference between the two models, we count the number of clicked documents from each model in the top k results of lI , where:
k = min{ j: (icmax= aj) or (icmax= bj)} .
In particular the number of clicked documents for each model is:
ha = |{cj: icj ∈ (a1, …, ak)}| for model_A
hb = |{cj: icj ∈ (b1, …, bk)}| for model_B
If ha > hb model_A will be the favorite one, if ha < hb model_B will be the favorite one, if ha = hb we have a tie.

This is a preference derived for the query q, but many other queries can be done during the online evaluation. How can we derive a preference for all the queries?
Let wins(A) be the number of comparisons in which model_A won, wins(B) the number of comparisons in which model_B won and ties(A,B) the number of times we have a tie.
To compute the overall preference we use another statistical measure:

This image has an empty alt attribute; its file name is delta.jpg

A positive value of ΔAB means that model_A is the favorite one, a negative value of ΔAB means that model_B is the favorite one.

Drawbacks

One drawback of this approach arises when comparing two models that produce very similar ranking lists. If for example the ranking list of model_A is identical to the one of model_B with the exception of the first document: lA = (a, b, c, d) and lB = (b, c, d, a). The comparison phase will bring the model_B to win more often than model_A. This happens regardless of the model chosen by RandomBit().
This drawback arises due to the way in which the k parameter is chosen (the min rank that includes all the clicked documents) and to the fact that model_B rank higher than model_A all documents with the exception of a.

TEAM-DRAFT INTERLEAVING

The team-draft interleaving approach overcomes the issue described for balanced interleaving with a new way of constructing the interleaved result list lI . In particular, the algorithm relies on the widely used method of team captains, in team-matches, for the selection of the players.

Suppose to have two captains A and B. Suppose to choose one of the two by chance at each round. Then the selected captain will choose his favorite player still available, add him to the team and then leave the turn to the other captain that will do the same. This will repeat unless there are players available.

The implemented algorithm is the following:

This image has an empty alt attribute; its file name is team-1.jpg

Let’s make also here an example to explain everything more clearly.

Suppose to have lA = (a1, a2, …) and lB = (b1, b2, …).
Here, like in balanced interleaving, the choice of the starting model is given by randomBit(), but this choice is done at every round and not only at the beginning of the algorithm.
We will repeat this process unless we have at least one document in both lA and lB :

  1. If (the size of TeamA is smaller than the one of TeamB) or (the two teams have the same size and model_A has the priority): k will take the rank of the top document in lA not yet in lI ; this document will be added to the interleaved list lI and to the TeamA set (this set record all the documents that are taken from model_A).
  2. Else: k will take the rank of the top document in lB not yet in lI ; this document will be added to the interleaved list lI and to the TeamB set (this set record all the documents that are taken from model_B).

At the end of this loop, we have added all the documents from lA and lB (with at most one discard) in lI . We will also obtain the list of documents belonging to TeamA and TeamB.

How to evaluate the results?

Again, the obtained result list lI = (i1, i2, …) is shown to the user, who clicks those documents that mostly reflect his information need. Suppose to call c1, c2, … the ranks of the clicked documents.
In order to derive a preference between the two models, we count the number of clicked documents in each Team:
ha = |{cj: icjTeamA| for model_A
hb = |{cj: icjTeamB}| for model_B
If ha > hb model_A will be the favorite one, if ha < hb model_B will be the favorite one, if ha = hb we have a tie.

In the same way as the balanced interleaving, we can derive a preference for all the queries.
We compute wins(A), wins(B), ties(A,B) and then ΔAB .

Drawbacks

Also for team-draft interleaving there is a drawback.
Suppose to have two ranked lists: lA = (a, b, c, d) and lB = (b, c, d, a). Suppose c to be the only relevant document.
With this approach we can obtain four different interleaved lists:
lI1 = (aA, bB, cA, dB)
lI2 = (bB, aA, cB, dA)
lI3 = (bB, aA, cA, dB)
lI4 = (aA, bB, cB, dA)
All of them putting c at the same rank.
Applying our preference computation, it will result in a tie between the two models, when actually lB should be chosen as the best one.

PROBABILISTIC INTERLEAVING

The probabilistic interleaving approach overcomes the issue described for balanced interleaving and team-draft with a new way of constructing the interleaved result list lI . In particular, the algorithm relies on the creation of a probability distribution over the ranking lists. This distribution allows every document to have a non-zero probability to be added in the interleaved result list.

The implemented algorithm is the following:

This image has an empty alt attribute; its file name is probabilistic-2.jpg

The equation for computing the probability distribution, given a ranked list of documents, is:

This image has an empty alt attribute; its file name is softmax.jpg

Let’s explain a bit more how it works.

First of all, we have to spend some more words on the computation of the probability distribution (line 3). Suppose to start with model_A, in line (3) we have the initialization of s(lA). What does it mean?
The idea behind this step is to associate to each document of lA (and after of lB) a probability. This probability is computed using a softmax function on the rank of each document. This allows us to obtain a higher probability for documents with higher ranks and lower probability for documents with a lower rank. In this way, documents at the top of the list will have a higher probability to be chosen as we expected, since we ideally would like to maintain the relative rank as much as possible. Moreover, all the documents of lA have a non-zero-probability to be chosen for lI, increasing the fairness of the algorithm.

Once obtained the distribution for both the models (model_A and model_B), we start to create the interleaved list lI .
At every loop:

  1. A random model is selected (as in team-draft interleaving), the choice is stored both in the assignment variable and in the assignment vector. The last one will report the final sequence of assignments that happened during the interleaved list creation.
  2. The not_assignment variable is upgraded with the opposite value of the assignment variable. This variable is needed by the remove_and_renormalize() method in order to upgrade the probability distribution after the documents selection.
  3. A document is selected by the s(lassignment) model distribution. The way in which the choice happen is through a sample without replacement from the probability distribution of the chosen model. This document is then added to lI .
  4. The selected documents are removed from the ranked list of the not chosen model. The probability distribution is then recomputed due to this removal.

After the realization of lI an evaluation similar to the one of the previous methods can be applied. Alternatively, a probabilistic comparison with marginalization can be done. I’m not going to explain it, but if you are interested in this topic I recommend you to read [1, 2].

This implementation, as said in [1], ensure no bias. First because, in expectation, each ranker contributes the same number of documents to lI .
Second, because the softmax function, constructed for each ranker, has the same shape and thus, the probability allocated to each document, depends only on its rank.
Third, because the use of assignments guarantees that each click is attributed to only one list, as in the team-draft method.

Drawbacks

The use of a probability distribution for the document selection increases the fairness of the algorithm, but it could also lead to a worse user experience. It may indeed happen that documents with a low rank are selected and put higher in the final list lI .

Summary

Let’s recap the main points of this blog post.

Make online testing is very important because it allows us to have direct feedback on how our model is performing. All the obtained results are computed on user interactions, which are the most reliable form of the relevance expression.

There are several advantages for interleaving with respect to A/B testing:

  • Reduces variance: in the interleaving approach, unlike A/B testing, there isn’t a separation of the users in groups. Here the same user evaluates at the same time both the models, making it possible to execute a direct comparison between them. This approach overcomes the problem of the users’ variance that can come out with the query traffic separation.
  • More sensitive: due to the high user variance, the A/B testing is also less sensitive to differences between models. The small is the difference between models, the more difficult it is to evaluate the variation.
  • Less traffic: due to less sensitivity, the A/B testing requires much more traffic than interleaving to achieve the same result. We need to analyze many user interactions to understand if the obtained results are reliable.
  • Less time: due to the much traffic required, A/B testing takes longer to execute than interleaving. The more time we run the test, the more interactions we collect.

There are three main implementations in interleaving:

  • Balanced Interleaving: based on a simple alternation of documents coming from both the models we are evaluating. It allows us to present results from both the models, but it’s not able to correctly choose the best one when they generate very similar rankings.
  • Team-Draft Interleaving: based on the player’s selection done from captains in a team-match. As for the balanced interleaving, also here there is a specific case in which the method isn’t able to select the correct method as the winner.
  • Probabilistic Interleaving: based on the generation of a probability distribution associated with each model we want to evaluate. This method can lead to a worse interleaved list in terms of ranking, but it is more fair, sensitive, and present less bias than the other two approaches.

REFERENCES

[1] Hofmann, Katja, Shimon Whiteson, and Maarten De Rijke. “A probabilistic method for inferring preferences from clicks.” Proceedings of the 20th ACM international conference on Information and knowledge management. 2011.
[2] Hofmann, Katja. “Fast and reliable online learning to rank for information retrieval.” SIGIR Forum. Vol. 47. No. 2. 2013.
[3] Chapelle, Olivier, et al. “Large-scale validation and analysis of interleaved search evaluation.” ACM Transactions on Information Systems (TOIS) 30.1 (2012): 1-41.
[4] Hofmann, Katja, Shimon Whiteson, and Maarten De Rijke. “Fidelity, soundness, and efficiency of interleaved comparison methods.” ACM Transactions on Information Systems (TOIS) 31.4 (2013): 1-43.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.