London Information Retrieval Meetup

The London Information Retrieval Meetup is approaching (19/02/2019) and we are excited to add more details about the speakers and talks!
The event is free and you are invited to register :

After Sambhav Kothari, software engineer at Bloomberg and Elia Porciani, R&D software engineer at Sease, our last speaker is Andrea Gazzarini, founder and software engineer at Sease :

Andrea Gazzarini

Andrea Gazzarini is a curious software engineer, mainly focused on the Java language and Search technologies.
With more than 15 years of experience in various software engineering areas, his adventure with the search domain began in 2010, when he met Apache Solr and later Elasticsearch… and it was love at first sight. 
Since then, he has been involved in many projects across different fields (bibliographic, e-government, e-commerce, geospatial).

In 2015 he wrote “Apache Solr Essentials”, a book about Solr, published by Packt Publishing.
He’s an opensource lover; he’s currently involved in several (too many!) projects, always thinking about a “big” idea that will change his (developer) life.

Introduction to Music Information Retrieval

Music Information Retrieval is about retrieving information from music entities.
This high-level definition relates to a complex discipline with many real-world applications.     
Being a former bass player, Andrea will describe a high-level overview about Music Information Retrieval and it will analyse from a musician perspective a set of challenges that the topic offers.
We will introduce the basic concepts of the music language, then passing through different kind of music representations we will end up describing some useful low level features that are used when dealing with music entities. 

Elia Porciani

Elia is a Software Engineer passionate about algorithms and data structures concerning search engines and efficiency.
He is currently involved in many research projects at CNR (National Research Council, Italy ) and for personal purpose.
Before joining Sease he worked in Intecs and List where he could experience different fields and levels of computer science, by handling low level programming problems such as embedded and networking up to high level trading algorithms.
He graduated with a dissertation about data compression and query performance on search engines.
He is active part of the information retrieval research community, attending international conferences such as SIGIR and ECIR.
His most recent pubblication is : FASTER BLOCKMAX WAND WITH VARIABLE-SIZED BLOCKS SIGIR 2017 Proceedings of the 40th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2017

Improving top-k retrieval algorithms using dynamic programming and longer skipping

Modern search engines has to keep up with the enormous growth in the number of documents and queries submitted by users. One of the problem to deal with is finding the best k relevant documents for a given query. This operation has to be fast and this is possible only by using specialised technologies.
Block max wand is one of the best known algorithm for solving this problem without any effectiveness degradation of its ranking.
After a brief introduction, in this talk I’m going to show a strategy introduced in “Faster BlockMax WAND with Variable-sized Blocks” (SIGIR 2017), that applied to BlockMaxWand data has made possible to speed up the algorithm execution by almost 2x.
Then, will be presented another optimisation of the BlockMaxWand algorithm (“Faster BlockMax WAND with Longer Skipping”, ECIR 2019) for reducing the time execution of short queries.

Sambhav Kothari

Sambhav is a software engineer at Bloomberg, working in the News Search Experience team.

Learning To Rank: Explained for Dinosaurs

Internet search has long evolved from days when you had to string up your query in just the right way to get the results you were looking for. Search has to be smart and natural, and people expect it to “just work” and read what’s on their minds.

On the other hand, anyone who has worked behind-the-scenes with a search engine knows exactly how hard it is to get the right result to show up at the right time. Countless hours are spent tuning the boosts before your user can find his favorite two-legged tiny-armed dinosaur on the front page.

When your data is constantly evolving, updating, it’s only realistic that so do your search engines. Search teams thus are on a constant pursuit to refine and improve the ranking and relevance of their search results. But, working smart is not the same as working hard. There are many techniques we can employ, that can help us dynamically improve and automate this process. One such technique is Learning to Rank.

Learning to Rank was initially proposed in academia around 20 years ago and almost all commercial web search-engines utilize it in some form or other. At Bloomberg, we decided that it was time for an open source search-engine to support Learning to Rank, so we spent more than a year designing and implementing it. The result of our efforts has been accepted by the Solr community and our Learning to Rank plugin is now available in Apache Solr.

This talk will serve as an introduction to the LTR(Learning-to-Rank) module in Solr. No prior knowledge about Learning to Rank is needed, but attendees will be expected to know the basics of Python, Solr, and machine learning techniques. We will be going step-by-step through the process of shipping a machine-learned ranking model in Solr, including:

  • how you can engineer features and build a training data-set as per your needs
  • how you can train ranking models using popular Python ML(machine learning) libraries like scikit-learn
  • how you can use the above-learned ranking-models in Solr

Get ready for an interactive session where we learn to rank!

Apache Solr Facets and ACL Filters Using Tag and Exclusion

What happens with facets aggregations on fields when documents in the results have been filtered by Access Control Lists ?
In such scenarios it is important to use the facet mincount parameter.
That specifies the minimum count in the result set for a facet value to appear in the response:

  • mincount=0, all the facet values present in the corpus are returned in the response. This includes the ones related to documents that have been filtered out by the ACLs(0 counts facets). This could cause some nasty side effect: such as a user seeing a facet value that he/she’s not supposed to see(because ACL filtered out that document from the result set).
  • mincount=1, only facet values matching at least one document in the result set are returned. This configuration is safe, users are going to see only facet values regulated by the ACL. They will effectively see only what they are supposed to see.

But what happens if you like to see 0 counting facet values, but preserving ACL?
This may help you in having a better understanding of the distribution of the values in the entire corpus, but ACL are still valid, so that users still see only possible values that they are supposed to see.
Tags and Exclusion comes handy in such case.

Faceting Tag And Exclusion

Tag and Exclusion is an extremely important feature for faceting in Apache Solr and you would not believe how many times it is misused or completely ignored, causing an erratic experience for the user.
Let’s see how it works:


You can tag a filter query using Solr local parameter syntax:


The same applies to the main query(with some caveats if you are using an explicit query parser) :

q={!tag=mainQuery}I am the main query

q={!edismax qf=text title tag=mainQuery}I am the main query

When assigning a tag we give Solr the possibility of identifying separately the various search clauses (such the main query or filter queries).
Effectively it is a way to assign an identifier to a search query or filter.

Excluding in Legacy Faceting

When applying filter queries, Solr is reducing the result space eliminating documents that don’t satisfy the additional filters added.
Let’s assume we want to count the values for a facet on the result set, ignoring the additional filtering that was added by a filter query.
Effectively can be equivalent to the concept of counting the facet values on a result set status that precedes the application of the filter that reduced the result set.
Apache Solr allows you to do that, without affecting the final results returned.

This is called exclusion and can be applied on a facet by facet basis.


This will calculate the ‘doctype’ field facet on the result set with the exclusion of the tagged filter (so for the matter of calculating such aggregation the “doctype:pdf” filter will not be applied and the counts will be calculated on an extended result set).
All other facets, aggregations and the result set itself will not be affected.

1.<Wanted Behaviour - applying tag and exclusion>
=== Document Type ===
[ ] Word (42)
[x] PDF (96)
[ ] Excel(11)
[ ] HTML (63)

This is especially useful for single valued fields:
when selecting a facet value and refreshing the search if you don’t apply tag and exclusion you will get just that value in the facets, defeating the refinement and exploration facet functionality for that field.

2.<Unwanted Behaviour - out of the box>
=== Document Type ===
[ ] Word (0)
[x] PDF (96)
[ ] Excel(0)
[ ] HTML (0)
3.<Unwanted Behaviour - mincount=1>
=== Document Type ===
[x] PDF (96)

As you see in 2. and 3. the facet become barely usable to further explore the results, this may bring the user experience to be fragmented with a lot of back and forth activity selecting and unselecting filters.

Excluding in Json Faceting

After the tagging of a filter, applying an exclusion with the json.facet approach is quite simple:

visibleValues: {
type: terms,
field: cat,
mincount: 1,
limit: 100,
domain: {
excludeTags: <tag>

When defining a json facet, applying exclusion is just adding the domain node with the excludeTags defined.

Tag and Exclusion to Preserve Acl Filtering in 0 counts


  • Users are subject to a set of ACL that limit their results visibility.
  • They would like to see also 0 count facets to have a better understanding of the result set and corpus.
  • You don’t want to invalidate the ACL control, so you don’t expect them to see sensible facet values.

Tagging the Main Query and Json Faceting

This is achievale with a combination of tagging and exclusion with Json faceting.
First of all, we want to tag the main query.
We assume the ACL control will be a filter query(and we recommend to apply ACL filtering with properly tuned filter queries).
Tagging the main query and excluding it from the facet calculation will allow us to get all the facet values in the ACL filtered corpus (the main query will be excluded but the ACL filter query will still be applied).

q={!edismax tag=mainQuery qf=name}query&fq=aclField:user1...
json.facet={visibleValues: {
type: terms,
field: cat,
mincount: 1,
limit: 100,
domain: {
excludeTags: mainQuery

We are almost there, this facet aggregation will give the counts of all facet values visible to the user in the original corpus(with ACL applied).
But what we want is to have the correct counts based on the current result set and all the visible 0 count facets.
To do that we can add a block to the Json faceting request:

q={!edismax tag=mainQuery qf=name}query&fq=aclField:user1...
resultSetCounts: {
type: terms,
field: category,
mincount: 1
visibleValues: {
type: terms,
field: category,
mincount: 1,
domain: {
excludeTags: mainQuery
  • resultSetCounts –  are the counts in the result set, including only NOT 0 counts facet values. This is the list of values the user has visibility on the current result set with correct counts.
  • visibleValues – are all the facet values in the result set the user should have visibility

Then, depending on the user experience we want to provide, we could use these blocks of information to properly render a final response.
For example we may want to show all visible values and associate with them a count from the resultSetCounts when available.

=== Document Type - Result Counts ===   
[ ] Word (10)
[ ] PDF (7)
[ ] Excel(5)
[ ] HTML (2)
=== Document Type - Visible Values ===
[ ] Word (100)
[ ] PDF (75)
[ ] Excel(54)
[ ] HTML (34)
[ ] Jpeg (31)
[ ] Mp4 (14)
 [ ] SecretDocType1 (0) -> not visible, mincount=1 in visibleValues
 [ ] SecretDocType2 (0) -> not visible, mincount=1 in visibleValues

=== Document Type - Final Result for users ===
[ ] Word (10) -> count is replaced with effective result count
[ ] PDF (7) -> count is replaced with effective result count
[ ] Excel(5) -> count is replaced with effective result count
[ ] HTML (2)-> count is replaced with effective result count
[ ] Jpeg (+31)
[ ] Mp4 (+14)

Bonus: What if I Defined the Query Parser in the Solrconfig.xml

This solution is still valid if you are using your query parser defined in the solrconfig.xml .
Extra care is needed to tag the main query.
You can achieve that using the local params in Solr request parameters:

<lst name="defaults">
<str name="q">{!type=edismax tag=mainQuery v=$qq}</str>
<str name="qq">*:*</str>

Query Time
.../solr/techproducts/browse?qq=ipod mini&fq=acl:user1&json.facet=...

Hope this helps when dealing with ACL or generic filter queries and faceting!

Apache Solr Distributed Facets

Apache Solr distributed faceting feature has been introduced back in 2008 with the first versions of Solr (1.3 according to this jira[1]) .
Until now, I always assumed it just worked, without diving too much into the details.
Nowadays distributed search and faceting are extremely popular, you can find them pretty much everywhere (in the legacy or SolrCloud form alike).
N.B. Although the mechanics are pretty much the same, Json faceting revisits this approach with some change, so we will now focus on legacy field faceting.

I think it’s time to get a better understanding of how it works:

Multiple Shard Requests

When dealing with distributed search and distributed aggregation calculations, you are going to see multiple requests going back and forth across the shards.
They have different focus and are meant to retrieve the different bits of information necessary to build the final response.
We are going to explore the different rounds of requests, focusing just for the faceting purpose.
N.B. Some of these requests are also carrying results for the distributed search calculation, this is used to minimise the network traffic.

For the sake of this blog let’s simulate a simple sharded index, white space tokenization on field1 and facet.field=field1

Shard 1 Shard 2
{  “id”:”1”,
“field1”:”a b”
{  “id”:”4”,
“field1”:”b c”
{  “id”:”2”,
{  “id”:”5”,
“field1”:”b c”
{  “id”:”3”,
“field1”:”b c”
{  “id”:”6”,

Global Facets : b(4), c(4), a(2)

Shard 1 Local Facets : a(2), b(2), c(1)

Shard 2 Local Facets : c(3), b(2)

Collection of Candidate Facet Field Values

The first round of requests is sent to each shard to identify the candidate top K global facet values.
To achieve this target each shard will be requested to respond with its local top K+J facet values and counts.
The reason we actually ask for more facets from each shard is to have a better term coverage, to avoid losing relevant facet values and to minimise the refinement requests.
How many more we request from each shard is regulated by the “overrequest” facet parameter, a factor that gives more accurate facets at the cost of additional computations[2].
Let’s assume we configure a facet.limit=2&facet.overrequest.count=0&facet.overrequest.ratio=1 to explain when refinement happens and how it works.

Shard 1 Returned Facets : a(2), b(2)

Shard 2 Returned Facets : c(3), b(2)

Global Merge of Collected Counts

The facet value counts collected from each shard are merged and the most occurring global top K is calculated.
These facet field values are the first candidates to be the final ones.
In addition to that, other candidates are extracted from the terms below the top K, based on the shards that didn’t return those values statistics.
At this point we have a candidate set of values and we are ready to refine their counts where necessary, asking back this information to the shards that didn’t include that in the first round.
This happens including the following specific facet parameter to the following refinement requests:


N.B. This request is specifically asking a Solr instance to return back the facet counts just for the terms specified[3]

Top 2 candidates = b(4), c(3)
Additional candidates = a(2)

The reason that a(2) is added to the potential candidates is because Shard 2 didn’t answer with a count for a, the potential missing count of 1 could bring a to the top K. So it is worth a verification.

Shard 1 didn’t return any value for the candidate c facet.
So the following request is built and sent to it:

Shard 2 didn’t return any value for the candidate a facet.
So the following request is built and sent to it:

Final Counts Refinement

The refinements counts returned by each shard can be used to finalise the global candidate facet values counts and to identify the final top K to be returned by the distributed request.
We are finally done!

Shard 1 Refinements Facets : c(1)

Shard 2 Refinements Facets : a(0)

Top K candidates updatedb(4), c(4), a(2)

GIven a facet.limit=2 the final global facets with correct results returned is :
b(4), c(4)