A valid and truthful test would be to show the model data it has never seen before! In fact, the effectiveness of a model must be evaluated in the worst-case scenario, therefore on data not used at training time.
Based on what we just said, we believe both approaches should apparently be correct as we make sure their test sets do not contain any documents/products present in the training sets.
However, we chose the query integrity approach because we think it is more conceptually correct than the query splitting when talking about ranking. If a system has to learn how to rank a list of documents, how could it best do it if some documents were removed from the list?!
In fact, notwithstanding we make sure all the relevance labels (related to a specific query) are in both sets in the query splitting, we are essentially removing from the training set some information on what the overall ranking of the query was, and therefore the model could be less able to generalize.
Furthermore, it may happen that some queries may have very few observations with high relevance, which will inevitably end up in one of the two sets only (or in training or in test).
The query splitting shouldn’t be entirely wrong but again we cannot be sure not to introduce an evaluation ‘bias’. It is true that the test set contains different data, but it still contains products that are indirectly related to the products in the training set belonging to the same query and this could falsify the performance results obtained.