Main Blog search quality evaluation
offline search quality evaluation

Introduction

With Rated Ranking Evaluator Enterprise approaching soon, we take the occasion of explaining in details why Offline Search Quality Evaluation is so important nowadays and what you can do already with the Rated Ranking Evaluator open-source libraries. More news will come soon as we are approaching the V1 release date. Stay tuned!

Search Quality Evaluation

Evaluation is fundamental in every scientific development. Scientists come up with hypotheses to model real-world phenomena, and validate them by comparing their output with observations in nature. Evaluation plays the exact key role in the field of information retrieval. Researchers and practitioners develop ranking models to explain the relationship between an information need expressed by a user (query) and information (search result) contained in available resources (corpus) and test these models by comparing their outcomes with a collection of observations (implicit/explicit user feedback). Search Quality has more than one interpretations however, this blog focuses only on one of them: the effectiveness of a search system to find the information relevant to the user (search relevance).
There are two types of observations used for the purpose of evaluation: (a) explicit feedback (relevance annotations), and (b) implicit feedback (observable user interactions).
Most of the times, evaluation of information retrieval systems happens casually offline and/or online, with one or more stakeholders running a non-reproducible set of queries and evaluating search results from the feels of the moment (that may change later). It’s necessary to bring a scientific structure to the process:
  • collect the ground truth observations (Ratings from implicit/explicit feedback)
  • run a finite and well-defined list of corresponding queries against the search system
  • compare the results with the ground truth, exploring different search quality metrics
This must be reproducible, persistent and explainable. Here comes Rated Ranking Evaluator (hereafter RRE) an open-source library for offline search quality evaluation of Apache Lucene based search engines (Apache Solr and Elasticsearch)

What it is?

The Rated Ranking Evaluator (RRE) is a search quality evaluation library which evaluates the quality of results coming from a search system. It helps a Search Engineer and the business stakeholders alike:
  • Are you tuning/implementing/changing/configuring a search infrastructure?
  • Do you want evidence of the improvements derived from your latest changes?
RRE helps you with that. RRE formalises how well a search system satisfies the user information needs, at “technical” level, combining the possibility of expressing ground truth ratings and assessing a search system quality with several evaluation metrics. RRE provides human-readable outputs also useful for non-technical stakeholders. It encourages an incremental/iterative/reproducible approach during the development and the evolution of a search system: assuming we’re starting our system from version x.y: when it’s time to apply some relevant change to its configuration, the new version is evaluated in isolation  (source control helps here in keeping track of evolving configuration files). RRE executes the evaluation process for all the search system versions in input and provides the delta/trend between them, so you can immediately spot improvements or regressions in the metrics of your interest.

Input

In order to execute an evaluation, RRE needs the following things:
  • One or more corpus / test collection: these are the representative datasets of a specific domain, that will be used for populating and querying a target search platform
  • One or more configuration sets: the Apache Solr/Elasticsearch configuration(s) that regulate your target search engine.
  • One or more ratings sets: the ground truth, a list of <query,document> pairs associated to a relevance rating

Ground Truth Definition (Ratings)

First of all, you need to define your ground truth: <query, document> pairs tagged with a relevance label that states how relevant a document is for the given query. The rating files, which are provided in JSON format, are the core input of RRE. Each rating file is a structured set of <query, document> pairs (i.e. relevant documents for a given query). In the ratings file, we can define all aspects of the information need supported by RRE: corpus, topics, query groups, and queries. The current implementation uses a configurable judgement range: e.g.
  • 1 => marginally relevant
  • 2 => relevant
  • 3 => very relevant
within the “relevant_documents” node, you can provide the judgements in one of the following  (alternative) ways:
"relevant_documents": {
   "docid1": {
       "gain": 2
   }, 
   "docid2": {
       "gain": 2
   }, 
   "docid3": {
       "gain": 3
   }, 
   "docid5": {
       "gain": 2
   }, 
   "docid99": {
       "gain": 3
   }
}
"relevant_documents": {
   "2": ["docid1", "docid2", "docid5"],
   "3": ["docid3", "docid99"]   
}
Note that if a document is not relevant for a given query, it doesn’t appear in the relevant list. In other words, there’s no a “0” judgement value. A more complete example of a rating file:
{
  "index": "<string>",
  "corpora_field": "<string>",
  "id_field": "<string>",
  "topics": [
    {
      "description": "<string>",
      "query_groups": [
        {
          "name": "<string>",
          "queries": [
            {
              "template": "<string>",
              "placeholders": {
                "$key": "<value>",
              }
            }
          ],
          "relevant_documents": [
            {
              "document_id": {
                "gain": "<number>"
              }
            }
          ]
        }
      ]
    }
  ],
  "query_groups": [
    {
      "name": "<string>",
      "queries": [
        {
          "template": "<string>",
          "placeholders": {
            "$key": "<value>",
          }
        }
      ],
      "relevant_documents": [
        {
          "document_id": {
            "gain": "<number>"
          }
        }
      ]
    }
  ],
  "queries": [
    {
      "template": "<string>",
      "placeholders": {
        "$key": "<value>",
      }
    }
  ]
}
  • index: index name in ElasticSearch/collection name in Solr
  • corpora_field: not required for external Solr/ElasticSearch, corresponds to corpus filename
  • id_field: Field in the schema that represents document ID
  • topics: optional list of topics and/or query groups
    • description: Title of a topic used in reporting output
    • query_groups: list of queries that are grouped with the same name and topic
      • name: query name used for the reporting
      • queries: list of queries to execute for a topic
        • template: String name for the template this query uses
        • placeholders: Object of key-value pairs to substitute in the template
    • relevant_documents: list objects with mapping documents to gain or relevance values
  • query_groups: list of objects with related queries. Can exist outside of topics, is optional
  • queries: required* list of objects for template and placeholder substitutions for evaluations.
    • *If topics and query_groups do not exist

Query Templates

For each query (or for each query group) it’s possible to define a query template, which is the definition of the query containing one or more placeholders. Then, in the rating file, you can reference one of those defined templates and you can provide a value for each placeholder. Templates have been introduced in order to:
  • allow common query management between search platforms
  • define complex queries
  • define runtime parameters that cannot be statically determined (e.g. filters)
In the picture below you can see three examples: the first two are examples of Solr queries, while the third is using Elasticsearch. query_templates You may also create multiple version folders (v1.0, v1.1, v1.2, etc) and keep different query template versions in those folders. This will run evaluations for each query template version and allows you easily to compare across query template changes.

Search Engine Configuration/ External Search Engine

Once defined the ground truth, it’s possible to define a target Search System for the evaluation. The Open Source version of RRE supports the embedding of a search instance (from config files) and an external running instance.

Configuration Sets

RRE encourages a reproducible approach: keep track of your versions and clearly assign to each of you version an immutable tag. In this way, we’ll end up having the historical progression of our system, and RRE will be able to make comparisons and reproduce evaluations. The actual content of the configuration sets actually depends on the target search platform. For Apache Solr, each version folder (see the image below) contains one or more Solr cores. Elasticsearch instead, contains a JSON file (the “index-shape.json” in the picture below) which contains the index settings & mappings. configuration_sets

Execution

The best way to run an RRE evaluation is to use the maven plugin, setting up a maven project to import RRE libraries and execute them. A detailed step by step guide is at this link: Run your evaluation!

Output

The output evaluation data is available:
  • as a JSON file: for further elaborations
  • as a spreadsheet: for delivering the evaluation results to someone else (e.g. a business stakeholder)
  • in a Web Console where metrics and their values get refreshed in real time (after each build)
offline search quality evaluation

Links

// STAY ALWAYS UP TO DATE

Subscribe to our newsletter

Did you like this post about Offline Search Quality Evaluation? Don’t forget to subscribe to our Newsletter to stay always updated from the Information Retrieval world!

 

Author

Alessandro Benedetti

Alessandro Benedetti is the founder of Sease Ltd. Senior Search Software Engineer, his focus is on R&D in information retrieval, information extraction, natural language processing, and machine learning.

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.