Haystack 2019 Experience

This blog is a quick summary of my (subjective) experience at Haystack 2019 : the Search Relevance Conference, hosted in Charlottesville (Virginia, USA) from 24/04/2019 to 25/04/2019.
References to the slides will be updated as soon as they become available.

First of all my feedback on the Haystack Conference is extremely positive.
From my perspective the conference has been a success.
Charlottesville is a delightful small city in the heart of Virginia, clean, organized, spatious and definitely relaxing, it has been a pleasure to spend my time there.
The venue chosen for the conference was a Cinema, initially I was surprised but it worked really well, kudos to OpenSource Connections for the idea.
The conference and talks were meticulously organised, on time and with a relaxed pace, that definitely helped both the audience and the speakers to enjoy it more: thanks to the whole organisation for such quality!
Let’s take a look to the conference itself now: it has been 2 days of very interesting talks, exploring the latest trends in the industry in regards to search relevance with a delightful tech agnostic approach.
That’s been one of my favourite aspects of the conference: no one was trying to sell its product, it was just a genuine discussion of interesting problems and practical solutions, no comparison between Apache Solr and Elasticsearch, just pure reasoning on challenging problems, that’s brilliant!
Last but not least, the conference allowed amazing search people from all over the world and cultures to meet, interact and discuss about search problems and technologies, it may sound obvious for a conference but it’s a great achievement nonetheless!

Keynote: What is Search Relevance?

Max Irwin opened the conference with its keynote on the meaning of Search Relevance, the talk was a smooth and nice introduction to the topic, making sure everyone was on the same page, ready for the following talks.
A good part of the opening was dedicated to the problem of collecting ground truth ratings (from explicit to implicit and hybrid approaches).

Rated Ranking Evaluation: An Open Source Approach for Search Quality Evaluation

After the keynote it was our turn, it has been an honour to open the track sessions in theatre 5 with our talk “Rated Ranking Evaluator: An Open Source Approach to Search Quality Evaluation”.
Our talk was a revised version on the introduction to RRE with a focus on the whole picture and how our software fits industry requirements.
Building on the introduction, we explored what search quality evaluation means for a generic information retrieval system and how you can apply the fundamental concepts of the topic to the real world with a full journey of assessing your system quality in an open source ecosystem.
Last part of the session was reserved for a quick demo, showing the key components in the RRE framework.
Really happy of the reception from the audience, I take the occasion to say a big thank you to everyone present in the theatre that day, this really encourages us to continue our work and make RRE even better.

Making the Case for Human Judgement Relevance Testing

After our talk, it was the turn of LexisNexis with an overview on judgement relevancy testing with the talk by Tito Serra and Tara Diedrichsen “Making the Case for Human Judgement Relevance Testing”.
The talk was quite interesting and explored the ways to practically setup a human relevance testing programme.
When dealing with humans, reaching or estimating consensus is not trivial and it is also quite important to details as much as possible why a document is rated that way (the reason is as important as the rating).

Query Relaxation – a Rewriting Technique between Searching and Recommendations

Lunch break and we’re back to the business with “Query Relaxation – a Rewriting Technique between Searching and Recommendations” by Rene Kriegler.
This one has been personally one of my favourites: from a clear definition of the problem (reducing the occurrence of zero results searches), the speaker illustrated various approaches, starting from just naive techniques (based on random removal of terms or term frequencies based removal) to the final word2vec + neural network system, able to drop words to maximise the probability of presenting a query reformulation that appeared in past sessions.
The overview of the entire journey was detailed and direct, especially because all the iterations were described and not only the final successful steps.

Beyond the Search Engine: Improving Relevancy through Query Expansion

And to conclude the first day I chose “Beyond the Search Engine: Improving Relevancy through Query Expansion”, a journey to improve the relevance in an e-commerce domain, from Taylor Rose and David Mitchell from Ibotta.
Focus of the talk was to describe a successful inter-team collaboration where a curated knowledge base used by the Machine Learning team proved quite useful to improve the mechanics of synonym matching and product categorisation.

Lightning Talks

After the sessions the first day ended with lightning talks.
They were very quick and thoughts provoking, some of them that caught my attention:

  • Quaerite – From Tim Allison, a toolkit to optimise search parameters using genetic algorithms
  • Hello LTR – From Doug Turnbull, a set of Jupiter notebooks to quickly spin up LTR experiments
  • Hathithrust – finally had the chance to hear live about one of the earliest Solr adopters for “big data” (I remember their to be the first articles I read about scaling up Apache Solr back in 2010)
  • Smui – Search Management UI for Synonyms
  • Querqy – from Rene Kriegler, a framework for query preprocessing in Java-based search engines

Addressing Variance in AB Tests: Interleaved Evaluation of Rankers

The second day opened for me with “Addressing Variance in AB Tests: Interleaved Evaluation of Rankers” where Erik Bernhardson went through the way the Wikimedia foundation faced the necessity of speeding up their AB tests, reducing the data necessary to validate the statistical significance of such tests.
The concept of interleaving results to assess rankers is well known to the academic community, but it was extremely useful to see a real life application and comparison of some of the available techniques.
Especially useful was the description of 2 tentative approaches:
– Balanced Interleaving
– Team Draft Interleaving
To learn more about the topic Erik recommended this very interesting blog post by Netflix : Innovating Faster on Personalization Algorithms at Netflix Using Interleaving
In addition to that, for people curious of exploring more the topic I would recommend this github project : https://github.com/mpkato/interleaving .
It offers the python implementations of various interleaving algorithms and present a valid bibliography of solid publications on the matter.

Solving for Satisfaction: Introduction to Click Models

Then was Elizabeth Haubert turn with “Solving for Satisfaction: Introduction to Click Models” a very interesting talk, cursed by some technical issues that didn’t prevent Elizabeth to perform brilliantly and detail to the audience various approaches in modelling the attractiveness and utility of search results from the user interactions.
If you are curious to learn more about click models I recommend this interesting survey:
Click Models for Web Search that explores in details some of the models introduced by Elizabeth.

Custom Solr Query Parser Design Option, and Pros & Cons

Last in the morning was “Custom Solr Query Parser Design Option, and Pros & Cons”[8] from Bertrand Rigaldies:  a live manual to customise Apache Solr query parsing capabilities to your needs, including a bit of coding to show the key components involved in writing a custom query parser.The example illustrated was about a slight customisation of proximity search behaviour (to parse the user query and build Lucene Span Queries to satisfy a specific requirement in distance tolerance) and capitalisation support.
The code and slides used in the presentation are available here : https://github.com/o19s/solr-query-parser-demo

Search Logs + Machine Learning = Auto-Tagging Inventory

After lunch John Berryman (co-author of Relevant Search) with “Search Logs + Machine Learning = Auto-Tagging Inventory” faced content tagging from a different perspective:
can we use query and clicks logs to guess tags for documents?
The idea makes sense, when given a query you interact with a document you are effectively generating a correlation between the two entities and this can definitely be used to help in the generation of tags!
In the talk John went through few iterative approaches (one based on just query-clicked docs training set and one based on query grouped by session), you find the Jupiter notebooks here for your reference, try them out!
First implementation
Query collapsing
Second implementation
Third implementation

Learning To Rank Panel

Following up the unfortunate absence of one of the speakers, a panel on Learning To Rank industry application took place, with interesting discussions about one of the hottest technologies right now that presents a lot of challenges still.
Various people were involved in the session and it was definitely pleasant to partecipate to the discussion.
The main takeaway from the panel has been that even if LTR is an extremely promising technology, few adopters are right now really ready to proceed with the integration:
garbage in, garbage out is still valid and extra care is needed when starting a LTR project.

Search with Vectors

Before the conference wrap up, the last session I attended was from Simon Hughes “Search with Vectors”, a beautiful survey of vectorised similarity calculation strategies and how to use them in search nowadays in correlation with word2vec and similar approaches.
The focus of the talk is to describe how vector based search can help with synonymy, polysemy, hyper/hypo-nyms and related concepts.
The related code and slides from previous talks are available in the Dice repo: https://github.com/DiceTechJobs/VectorsInSearch

Rated Ranking Evaluator: Help the poor (Search Engineer)

A Software Engineer is always required to give his customers a concrete evidence about deliverables quality. A Search Engineer deals with a specialisation of such generic Software Quality, which is called Search Quality.

What is Search Quality? And why is it so important in a search infrastructure? After all, the “Software Quality” should be omni-comprensive, it should always include everything (and actually it is), but when we are dealing with search systems, the quality is a very abstract term, which is very hard to define in advance.

The functional correctness of a search infrastructure (assuming the correctness is the only factor which influences the system quality – and it isn’t) is naturally associated with human judgments, with opinions, and unfortunately we know opinions can be different among people.

The business stakeholders, which will get a value from a search system, can belong to different categories, can have different expectations, and they can have in mind a different idea about the expected system correctness.

In this scenario a Search Engineer is facing many challenges in terms of choices, and at the end, he has to provide concrete evidences about the functional coverage of those choices.

This is the context where we developed the Rated Ranking Evaluator (hereafter RRE).

What it is?

The Rated Ranking Evaluator (RRE) is a search quality evaluation tool which evaluates the quality of results coming from a search infrastructure.

It helps a Search Engineer in his daily job. Are you a Search Engineer? Are you tuning/implementing/changing/configuring a search infrastructure? Do you want to have something that gives you an evidence about the improvements between changes? RRE could give you a hand on that.

RRE formalises how well a search system satisfies the user information needs, at “technical” level, combining a rich tree-like domain model with several evaluation measures, but also at “functional” level, providing human-readable outputs that could target the business stakeholders.

It encourages an incremental/iterative/immutable approach during the deveoopment and the evolution of a search system: assuming we’re starting our system from version x.y: when it’s time to apply some relevant change to its configuration, instead of applying changes to x.y, is better to clone it and apply those changes to the new fresh version.

In this way, RRE will execute the evaluation process on all available versions, it will provide the delta/trend between  subsequent versions, so you can immediately get a fine-grained picture about where the system is going, in terms of relevance.

This post is only a brief summary about RRE. You can find more detailed information in the project Wiki.

In a few words, what can I get from RRE?

You can configure RRE as a compounding part of your project build cycle. That means, every time a build is triggered, an evaluation process will be executed.

RRE is not tied to a given search platform: it provides a mini-framework for plugging-in different search platforms. At the moment we have two available bindings: Apache Solr and Elasticsearch  (see here for supported versions).

The output evaluation data will be available:

  • as a JSON file: for further elaborations
  • as a spreadsheet: for delivering the evaluation results to someone else (e.g. a business stakeholder)
  • in a Web Console where metrics and their values get refreshed in real time (after each build)

How it works

RRE provides a rich, composite, tree-like, domain model, where the evaluation concept can be seen at different levels.

RRE Domain Model

The Evaluation at the top level is just a container of the nested entities. Note that all entities relationships are 1 to many. In this context, a Corpus is defined as a test dataset. RRE will use it for executing the evaluation process; in a single evaluation process you can have multiple datasets.

A Topic is an information need: it defines a functional requirement on the end-user perspective. Within a topic we can have several queries, which express the same need but more close to a technical layer. RRE provides a further abstraction in the middle: query groups. A Query Group is a group of queries which are supposed to produce the same results (and therefore are associated with the same judgments set).

Queries, which are the technical leaves of RRE domain model, are furtherly decomposed in several perspectives, one for each available version of our system. A query itself is of course a single entity, but during an evaluation session, its concrete execution happens several times, one for each available version. That because RRE needs to measure the search results (i.e. the query executions) against all versions.

For each version we will finally have one or more metrics, depending on the configuration. Last but not least, even if metrics are computed at query/version level, RRE will aggregate those values at upper levels (see the dashed vertical lines in the diagram) so each entity/level in the domain model will offer an aggregate perspective of all available metrics (i.e I could be interested in the NDCG for a given query, or I could just stop my analysis at a topic level).

Input

In order to execute an evaluation process, RRE needs the following things:

  • One or more corpus / test collection: these are the representative datasets of a specific domain, that will be used for populating and querying a target search platform
  • One or more configuration sets: although there’s nothing against having one single configuration, a minimum of two versions are required in order to provide a comparison between evaluation measures.
  • One or more ratings sets: this is where judgments are defined, in terms of relevant documents for each query group.

Output

The RRE concrete output depends on the runtime container where it is running. The RRE core itself is just a library, so when used programmatically within a project, it outputs a set of objects corresponding to the domain model described above.

When it is used as a Maven plugin, it primarily outputs the same structure in JSON format. This data is then used for producing further outputs, like a spreadsheet. The same payload can be sent to another module called RRE Server, which offers an AngularJS based web console that gets automatically refreshed.

The RRE console is very useful when we are doing internal iterations / tries around some issue, which usually requires very short edit-and-immediately-check cycles. Imagine if you can have a couple of monitors on your desk: in the first there’s your favourite IDE, where you change things, run builds. In the second there’s the RRE Console (see below). After each build, just have a look on the console in order to get an immediate feedback of your changes.

Where can I start?

The project repository in Github offers all what you need: a detailed documentation about how it works and how to quick start with RRE.

If you need some help, feel free to contact us! We appreciate any feedback, suggestion and, last but not least, contribution.

Future works

As you can imagine, the topic is quite huge. We have a lot of interesting ideas about the platform evolution.

These are some examples:

  • integration with some tool for building the relevance judgments. That could be some UI or a more sophisticated user interaction collector (which will automatically generates the ratings sets on top of computed online metrics like click through rate, sales rate)
  • Jenkins plugin: for a better integration of RRE into the popular CI tool
  • Gradle plugin
  • Apache Solr Rank Eval API: using the RRE core we could implement a Rank Eval endpoint in Solr, similar to the Rank Eval API provided in Elasticsearch
  • ??? Other? Any suggestion is warmly welcome!

Links