Haystack 2019 Experience

This blog is a quick summary of my (subjective) experience at Haystack 2019 : the Search Relevance Conference, hosted in Charlottesville (Virginia, USA) from 24/04/2019 to 25/04/2019.
References to the slides will be updated as soon as they become available.

First of all my feedback on the Haystack Conference is extremely positive.
From my perspective the conference has been a success.
Charlottesville is a delightful small city in the heart of Virginia, clean, organized, spatious and definitely relaxing, it has been a pleasure to spend my time there.
The venue chosen for the conference was a Cinema, initially I was surprised but it worked really well, kudos to OpenSource Connections for the idea.
The conference and talks were meticulously organised, on time and with a relaxed pace, that definitely helped both the audience and the speakers to enjoy it more: thanks to the whole organisation for such quality!
Let’s take a look to the conference itself now: it has been 2 days of very interesting talks, exploring the latest trends in the industry in regards to search relevance with a delightful tech agnostic approach.
That’s been one of my favourite aspects of the conference: no one was trying to sell its product, it was just a genuine discussion of interesting problems and practical solutions, no comparison between Apache Solr and Elasticsearch, just pure reasoning on challenging problems, that’s brilliant!
Last but not least, the conference allowed amazing search people from all over the world and cultures to meet, interact and discuss about search problems and technologies, it may sound obvious for a conference but it’s a great achievement nonetheless!

Keynote: What is Search Relevance?

Max Irwin opened the conference with its keynote on the meaning of Search Relevance, the talk was a smooth and nice introduction to the topic, making sure everyone was on the same page, ready for the following talks.
A good part of the opening was dedicated to the problem of collecting ground truth ratings (from explicit to implicit and hybrid approaches).

Rated Ranking Evaluation: An Open Source Approach for Search Quality Evaluation

After the keynote it was our turn, it has been an honour to open the track sessions in theatre 5 with our talk “Rated Ranking Evaluator: An Open Source Approach to Search Quality Evaluation”.
Our talk was a revised version on the introduction to RRE with a focus on the whole picture and how our software fits industry requirements.
Building on the introduction, we explored what search quality evaluation means for a generic information retrieval system and how you can apply the fundamental concepts of the topic to the real world with a full journey of assessing your system quality in an open source ecosystem.
Last part of the session was reserved for a quick demo, showing the key components in the RRE framework.
Really happy of the reception from the audience, I take the occasion to say a big thank you to everyone present in the theatre that day, this really encourages us to continue our work and make RRE even better.

Making the Case for Human Judgement Relevance Testing

After our talk, it was the turn of LexisNexis with an overview on judgement relevancy testing with the talk by Tito Serra and Tara Diedrichsen “Making the Case for Human Judgement Relevance Testing”.
The talk was quite interesting and explored the ways to practically setup a human relevance testing programme.
When dealing with humans, reaching or estimating consensus is not trivial and it is also quite important to details as much as possible why a document is rated that way (the reason is as important as the rating).

Query Relaxation – a Rewriting Technique between Searching and Recommendations

Lunch break and we’re back to the business with “Query Relaxation – a Rewriting Technique between Searching and Recommendations” by Rene Kriegler.
This one has been personally one of my favourites: from a clear definition of the problem (reducing the occurrence of zero results searches), the speaker illustrated various approaches, starting from just naive techniques (based on random removal of terms or term frequencies based removal) to the final word2vec + neural network system, able to drop words to maximise the probability of presenting a query reformulation that appeared in past sessions.
The overview of the entire journey was detailed and direct, especially because all the iterations were described and not only the final successful steps.

Beyond the Search Engine: Improving Relevancy through Query Expansion

And to conclude the first day I chose “Beyond the Search Engine: Improving Relevancy through Query Expansion”, a journey to improve the relevance in an e-commerce domain, from Taylor Rose and David Mitchell from Ibotta.
Focus of the talk was to describe a successful inter-team collaboration where a curated knowledge base used by the Machine Learning team proved quite useful to improve the mechanics of synonym matching and product categorisation.

Lightning Talks

After the sessions the first day ended with lightning talks.
They were very quick and thoughts provoking, some of them that caught my attention:

  • Quaerite – From Tim Allison, a toolkit to optimise search parameters using genetic algorithms
  • Hello LTR – From Doug Turnbull, a set of Jupiter notebooks to quickly spin up LTR experiments
  • Hathithrust – finally had the chance to hear live about one of the earliest Solr adopters for “big data” (I remember their to be the first articles I read about scaling up Apache Solr back in 2010)
  • Smui – Search Management UI for Synonyms
  • Querqy – from Rene Kriegler, a framework for query preprocessing in Java-based search engines

Addressing Variance in AB Tests: Interleaved Evaluation of Rankers

The second day opened for me with “Addressing Variance in AB Tests: Interleaved Evaluation of Rankers” where Erik Bernhardson went through the way the Wikimedia foundation faced the necessity of speeding up their AB tests, reducing the data necessary to validate the statistical significance of such tests.
The concept of interleaving results to assess rankers is well known to the academic community, but it was extremely useful to see a real life application and comparison of some of the available techniques.
Especially useful was the description of 2 tentative approaches:
– Balanced Interleaving
– Team Draft Interleaving
To learn more about the topic Erik recommended this very interesting blog post by Netflix : Innovating Faster on Personalization Algorithms at Netflix Using Interleaving
In addition to that, for people curious of exploring more the topic I would recommend this github project : https://github.com/mpkato/interleaving .
It offers the python implementations of various interleaving algorithms and present a valid bibliography of solid publications on the matter.

Solving for Satisfaction: Introduction to Click Models

Then was Elizabeth Haubert turn with “Solving for Satisfaction: Introduction to Click Models” a very interesting talk, cursed by some technical issues that didn’t prevent Elizabeth to perform brilliantly and detail to the audience various approaches in modelling the attractiveness and utility of search results from the user interactions.
If you are curious to learn more about click models I recommend this interesting survey:
Click Models for Web Search that explores in details some of the models introduced by Elizabeth.

Custom Solr Query Parser Design Option, and Pros & Cons

Last in the morning was “Custom Solr Query Parser Design Option, and Pros & Cons”[8] from Bertrand Rigaldies:  a live manual to customise Apache Solr query parsing capabilities to your needs, including a bit of coding to show the key components involved in writing a custom query parser.The example illustrated was about a slight customisation of proximity search behaviour (to parse the user query and build Lucene Span Queries to satisfy a specific requirement in distance tolerance) and capitalisation support.
The code and slides used in the presentation are available here : https://github.com/o19s/solr-query-parser-demo

Search Logs + Machine Learning = Auto-Tagging Inventory

After lunch John Berryman (co-author of Relevant Search) with “Search Logs + Machine Learning = Auto-Tagging Inventory” faced content tagging from a different perspective:
can we use query and clicks logs to guess tags for documents?
The idea makes sense, when given a query you interact with a document you are effectively generating a correlation between the two entities and this can definitely be used to help in the generation of tags!
In the talk John went through few iterative approaches (one based on just query-clicked docs training set and one based on query grouped by session), you find the Jupiter notebooks here for your reference, try them out!
First implementation
Query collapsing
Second implementation
Third implementation

Learning To Rank Panel

Following up the unfortunate absence of one of the speakers, a panel on Learning To Rank industry application took place, with interesting discussions about one of the hottest technologies right now that presents a lot of challenges still.
Various people were involved in the session and it was definitely pleasant to partecipate to the discussion.
The main takeaway from the panel has been that even if LTR is an extremely promising technology, few adopters are right now really ready to proceed with the integration:
garbage in, garbage out is still valid and extra care is needed when starting a LTR project.

Search with Vectors

Before the conference wrap up, the last session I attended was from Simon Hughes “Search with Vectors”, a beautiful survey of vectorised similarity calculation strategies and how to use them in search nowadays in correlation with word2vec and similar approaches.
The focus of the talk is to describe how vector based search can help with synonymy, polysemy, hyper/hypo-nyms and related concepts.
The related code and slides from previous talks are available in the Dice repo: https://github.com/DiceTechJobs/VectorsInSearch

London Information Retrieval Meetup

The London Information Retrieval Meetup is approaching (19/02/2019) and we are excited to add more details about the speakers and talks!
The event is free and you are invited to register :
https://www.eventbrite.com/e/information-retrieval-meetup-tickets-54542417840

After Sambhav Kothari, software engineer at Bloomberg and Elia Porciani, R&D software engineer at Sease, our last speaker is Andrea Gazzarini, founder and software engineer at Sease :

Andrea Gazzarini

Andrea Gazzarini is a curious software engineer, mainly focused on the Java language and Search technologies.
With more than 15 years of experience in various software engineering areas, his adventure with the search domain began in 2010, when he met Apache Solr and later Elasticsearch… and it was love at first sight. 
Since then, he has been involved in many projects across different fields (bibliographic, e-government, e-commerce, geospatial).

In 2015 he wrote “Apache Solr Essentials”, a book about Solr, published by Packt Publishing.
He’s an opensource lover; he’s currently involved in several (too many!) projects, always thinking about a “big” idea that will change his (developer) life.

Introduction to Music Information Retrieval

Music Information Retrieval is about retrieving information from music entities.
This high-level definition relates to a complex discipline with many real-world applications.     
Being a former bass player, Andrea will describe a high-level overview about Music Information Retrieval and it will analyse from a musician perspective a set of challenges that the topic offers.
We will introduce the basic concepts of the music language, then passing through different kind of music representations we will end up describing some useful low level features that are used when dealing with music entities. 

Elia Porciani

Elia is a Software Engineer passionate about algorithms and data structures concerning search engines and efficiency.
He is currently involved in many research projects at CNR (National Research Council, Italy ) and for personal purpose.
Before joining Sease he worked in Intecs and List where he could experience different fields and levels of computer science, by handling low level programming problems such as embedded and networking up to high level trading algorithms.
He graduated with a dissertation about data compression and query performance on search engines.
He is active part of the information retrieval research community, attending international conferences such as SIGIR and ECIR.
His most recent pubblication is : FASTER BLOCKMAX WAND WITH VARIABLE-SIZED BLOCKS SIGIR 2017 Proceedings of the 40th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2017

Improving top-k retrieval algorithms using dynamic programming and longer skipping

Modern search engines has to keep up with the enormous growth in the number of documents and queries submitted by users. One of the problem to deal with is finding the best k relevant documents for a given query. This operation has to be fast and this is possible only by using specialised technologies.
Block max wand is one of the best known algorithm for solving this problem without any effectiveness degradation of its ranking.
After a brief introduction, in this talk I’m going to show a strategy introduced in “Faster BlockMax WAND with Variable-sized Blocks” (SIGIR 2017), that applied to BlockMaxWand data has made possible to speed up the algorithm execution by almost 2x.
Then, will be presented another optimisation of the BlockMaxWand algorithm (“Faster BlockMax WAND with Longer Skipping”, ECIR 2019) for reducing the time execution of short queries.


Sambhav Kothari

Sambhav is a software engineer at Bloomberg, working in the News Search Experience team.



Learning To Rank: Explained for Dinosaurs

Internet search has long evolved from days when you had to string up your query in just the right way to get the results you were looking for. Search has to be smart and natural, and people expect it to “just work” and read what’s on their minds.

On the other hand, anyone who has worked behind-the-scenes with a search engine knows exactly how hard it is to get the right result to show up at the right time. Countless hours are spent tuning the boosts before your user can find his favorite two-legged tiny-armed dinosaur on the front page.

When your data is constantly evolving, updating, it’s only realistic that so do your search engines. Search teams thus are on a constant pursuit to refine and improve the ranking and relevance of their search results. But, working smart is not the same as working hard. There are many techniques we can employ, that can help us dynamically improve and automate this process. One such technique is Learning to Rank.

Learning to Rank was initially proposed in academia around 20 years ago and almost all commercial web search-engines utilize it in some form or other. At Bloomberg, we decided that it was time for an open source search-engine to support Learning to Rank, so we spent more than a year designing and implementing it. The result of our efforts has been accepted by the Solr community and our Learning to Rank plugin is now available in Apache Solr.

This talk will serve as an introduction to the LTR(Learning-to-Rank) module in Solr. No prior knowledge about Learning to Rank is needed, but attendees will be expected to know the basics of Python, Solr, and machine learning techniques. We will be going step-by-step through the process of shipping a machine-learned ranking model in Solr, including:

  • how you can engineer features and build a training data-set as per your needs
  • how you can train ranking models using popular Python ML(machine learning) libraries like scikit-learn
  • how you can use the above-learned ranking-models in Solr

Get ready for an interactive session where we learn to rank!


Apache Solr Facets and ACL Filters Using Tag and Exclusion

What happens with facets aggregations on fields when documents in the results have been filtered by Access Control Lists ?
In such scenarios it is important to use the facet mincount parameter.
That specifies the minimum count in the result set for a facet value to appear in the response:

  • mincount=0, all the facet values present in the corpus are returned in the response. This includes the ones related to documents that have been filtered out by the ACLs(0 counts facets). This could cause some nasty side effect: such as a user seeing a facet value that he/she’s not supposed to see(because ACL filtered out that document from the result set).
  • mincount=1, only facet values matching at least one document in the result set are returned. This configuration is safe, users are going to see only facet values regulated by the ACL. They will effectively see only what they are supposed to see.

But what happens if you like to see 0 counting facet values, but preserving ACL?
This may help you in having a better understanding of the distribution of the values in the entire corpus, but ACL are still valid, so that users still see only possible values that they are supposed to see.
Tags and Exclusion comes handy in such case.

Faceting Tag And Exclusion

Tag and Exclusion is an extremely important feature for faceting in Apache Solr and you would not believe how many times it is misused or completely ignored, causing an erratic experience for the user.
Let’s see how it works:

Tagging

You can tag a filter query using Solr local parameter syntax:

fq={!tag=docTypeFilter}doctype:pdf

The same applies to the main query(with some caveats if you are using an explicit query parser) :

q={!tag=mainQuery}I am the main query

q={!edismax qf=text title tag=mainQuery}I am the main query

When assigning a tag we give Solr the possibility of identifying separately the various search clauses (such the main query or filter queries).
Effectively it is a way to assign an identifier to a search query or filter.

Excluding in Legacy Faceting

When applying filter queries, Solr is reducing the result space eliminating documents that don’t satisfy the additional filters added.
Let’s assume we want to count the values for a facet on the result set, ignoring the additional filtering that was added by a filter query.
Effectively can be equivalent to the concept of counting the facet values on a result set status that precedes the application of the filter that reduced the result set.
Apache Solr allows you to do that, without affecting the final results returned.

This is called exclusion and can be applied on a facet by facet basis.

fq={!tag=docTypeFilter}doctype:pdf...&facet=true&
facet.field={!ex=docTypeFilter}doctype

This will calculate the ‘doctype’ field facet on the result set with the exclusion of the tagged filter (so for the matter of calculating such aggregation the “doctype:pdf” filter will not be applied and the counts will be calculated on an extended result set).
All other facets, aggregations and the result set itself will not be affected.

1.<Wanted Behaviour - applying tag and exclusion>
=== Document Type ===
[ ] Word (42)
[x] PDF (96)
[ ] Excel(11)
[ ] HTML (63)

This is especially useful for single valued fields:
when selecting a facet value and refreshing the search if you don’t apply tag and exclusion you will get just that value in the facets, defeating the refinement and exploration facet functionality for that field.

2.<Unwanted Behaviour - out of the box>
=== Document Type ===
[ ] Word (0)
[x] PDF (96)
[ ] Excel(0)
[ ] HTML (0)
3.<Unwanted Behaviour - mincount=1>
=== Document Type ===
[x] PDF (96)

As you see in 2. and 3. the facet become barely usable to further explore the results, this may bring the user experience to be fragmented with a lot of back and forth activity selecting and unselecting filters.

Excluding in Json Faceting

After the tagging of a filter, applying an exclusion with the json.facet approach is quite simple:

visibleValues: {
type: terms,
field: cat,
mincount: 1,
limit: 100,
domain: {
excludeTags: <tag>
}
}

When defining a json facet, applying exclusion is just adding the domain node with the excludeTags defined.

Tag and Exclusion to Preserve Acl Filtering in 0 counts

Problem

  • Users are subject to a set of ACL that limit their results visibility.
  • They would like to see also 0 count facets to have a better understanding of the result set and corpus.
  • You don’t want to invalidate the ACL control, so you don’t expect them to see sensible facet values.

Tagging the Main Query and Json Faceting

This is achievale with a combination of tagging and exclusion with Json faceting.
First of all, we want to tag the main query.
We assume the ACL control will be a filter query(and we recommend to apply ACL filtering with properly tuned filter queries).
Tagging the main query and excluding it from the facet calculation will allow us to get all the facet values in the ACL filtered corpus (the main query will be excluded but the ACL filter query will still be applied).

q={!edismax tag=mainQuery qf=name}query&fq=aclField:user1...
json.facet={visibleValues: {
type: terms,
field: cat,
mincount: 1,
limit: 100,
domain: {
excludeTags: mainQuery
}
}}

We are almost there, this facet aggregation will give the counts of all facet values visible to the user in the original corpus(with ACL applied).
But what we want is to have the correct counts based on the current result set and all the visible 0 count facets.
To do that we can add a block to the Json faceting request:

q={!edismax tag=mainQuery qf=name}query&fq=aclField:user1...
json.facet={
resultSetCounts: {
type: terms,
field: category,
mincount: 1
},
visibleValues: {
type: terms,
field: category,
mincount: 1,
domain: {
excludeTags: mainQuery
}
}
}
  • resultSetCounts –  are the counts in the result set, including only NOT 0 counts facet values. This is the list of values the user has visibility on the current result set with correct counts.
  • visibleValues – are all the facet values in the result set the user should have visibility

Then, depending on the user experience we want to provide, we could use these blocks of information to properly render a final response.
For example we may want to show all visible values and associate with them a count from the resultSetCounts when available.

=== Document Type - Result Counts ===   
[ ] Word (10)
[ ] PDF (7)
[ ] Excel(5)
[ ] HTML (2)
=== Document Type - Visible Values ===
[ ] Word (100)
[ ] PDF (75)
[ ] Excel(54)
[ ] HTML (34)
[ ] Jpeg (31)
[ ] Mp4 (14)
 [ ] SecretDocType1 (0) -> not visible, mincount=1 in visibleValues
 [ ] SecretDocType2 (0) -> not visible, mincount=1 in visibleValues

=== Document Type - Final Result for users ===
[ ] Word (10) -> count is replaced with effective result count
[ ] PDF (7) -> count is replaced with effective result count
[ ] Excel(5) -> count is replaced with effective result count
[ ] HTML (2)-> count is replaced with effective result count
[ ] Jpeg (+31)
[ ] Mp4 (+14)

Bonus: What if I Defined the Query Parser in the Solrconfig.xml

This solution is still valid if you are using your query parser defined in the solrconfig.xml .
Extra care is needed to tag the main query.
You can achieve that using the local params in Solr request parameters:

solrconfig.xml
<lst name="defaults">
...
<str name="q">{!type=edismax tag=mainQuery v=$qq}</str>
<str name="qq">*:*</str>
...

Query Time
.../solr/techproducts/browse?qq=ipod mini&fq=acl:user1&json.facet=...

Hope this helps when dealing with ACL or generic filter queries and faceting!

Apache Solr Distributed Facets

Apache Solr distributed faceting feature has been introduced back in 2008 with the first versions of Solr (1.3 according to this jira[1]) .
Until now, I always assumed it just worked, without diving too much into the details.
Nowadays distributed search and faceting are extremely popular, you can find them pretty much everywhere (in the legacy or SolrCloud form alike).
N.B. Although the mechanics are pretty much the same, Json faceting revisits this approach with some change, so we will now focus on legacy field faceting.

I think it’s time to get a better understanding of how it works:

Multiple Shard Requests

When dealing with distributed search and distributed aggregation calculations, you are going to see multiple requests going back and forth across the shards.
They have different focus and are meant to retrieve the different bits of information necessary to build the final response.
We are going to explore the different rounds of requests, focusing just for the faceting purpose.
N.B. Some of these requests are also carrying results for the distributed search calculation, this is used to minimise the network traffic.

For the sake of this blog let’s simulate a simple sharded index, white space tokenization on field1 and facet.field=field1

Shard 1 Shard 2
Doc0
{  “id”:”1”,
“field1”:”a b”
}
Doc3
{  “id”:”4”,
“field1”:”b c”
}
Doc1
{  “id”:”2”,
“field1”:”a”
}
Doc4
{  “id”:”5”,
“field1”:”b c”
}
Doc2
{  “id”:”3”,
“field1”:”b c”
}
Doc53
{  “id”:”6”,
“field1”:”c”
}

Global Facets : b(4), c(4), a(2)

Shard 1 Local Facets : a(2), b(2), c(1)

Shard 2 Local Facets : c(3), b(2)

Collection of Candidate Facet Field Values

The first round of requests is sent to each shard to identify the candidate top K global facet values.
To achieve this target each shard will be requested to respond with its local top K+J facet values and counts.
The reason we actually ask for more facets from each shard is to have a better term coverage, to avoid losing relevant facet values and to minimise the refinement requests.
How many more we request from each shard is regulated by the “overrequest” facet parameter, a factor that gives more accurate facets at the cost of additional computations[2].
Let’s assume we configure a facet.limit=2&facet.overrequest.count=0&facet.overrequest.ratio=1 to explain when refinement happens and how it works.

Shard 1 Returned Facets : a(2), b(2)

Shard 2 Returned Facets : c(3), b(2)

Global Merge of Collected Counts

The facet value counts collected from each shard are merged and the most occurring global top K is calculated.
These facet field values are the first candidates to be the final ones.
In addition to that, other candidates are extracted from the terms below the top K, based on the shards that didn’t return those values statistics.
At this point we have a candidate set of values and we are ready to refine their counts where necessary, asking back this information to the shards that didn’t include that in the first round.
This happens including the following specific facet parameter to the following refinement requests:

{!terms=$<field>__terms}<field>&<field>__terms=<values>
e.g.
{!terms=$field1__terms}field1&field1__terms=term1,term2

N.B. This request is specifically asking a Solr instance to return back the facet counts just for the terms specified[3]

Top 2 candidates = b(4), c(3)
Additional candidates = a(2)

The reason that a(2) is added to the potential candidates is because Shard 2 didn’t answer with a count for a, the potential missing count of 1 could bring a to the top K. So it is worth a verification.

Shard 1 didn’t return any value for the candidate c facet.
So the following request is built and sent to it:
facet.field={!terms=$field1__terms}field1&field1__terms=c

Shard 2 didn’t return any value for the candidate a facet.
So the following request is built and sent to it:
facet.field={!terms=$field1__terms}field1&field1__terms=a

Final Counts Refinement

The refinements counts returned by each shard can be used to finalise the global candidate facet values counts and to identify the final top K to be returned by the distributed request.
We are finally done!

Shard 1 Refinements Facets : c(1)

Shard 2 Refinements Facets : a(0)

Top K candidates updatedb(4), c(4), a(2)

GIven a facet.limit=2 the final global facets with correct results returned is :
b(4), c(4)

 

[1] https://issues.apache.org/jira/browse/SOLR-303

[2] https://lucene.apache.org/solr/guide/6_6/faceting.html#Faceting-Over-RequestParameters

[3] https://lucene.apache.org/solr/guide/7_5/faceting.html#limiting-facet-with-certain-terms