London Information Retrieval Meetup

The London Information Retrieval Meetup is approaching (19/02/2019) and we are excited to add more details about the speakers and talks!
The event is free and you are invited to register :
https://www.eventbrite.com/e/information-retrieval-meetup-tickets-54542417840

After Sambhav Kothari, software engineer at Bloomberg and Elia Porciani, R&D software engineer at Sease, our last speaker is Andrea Gazzarini, founder and software engineer at Sease :

Andrea Gazzarini

Andrea Gazzarini is a curious software engineer, mainly focused on the Java language and Search technologies.
With more than 15 years of experience in various software engineering areas, his adventure with the search domain began in 2010, when he met Apache Solr and later Elasticsearch… and it was love at first sight. 
Since then, he has been involved in many projects across different fields (bibliographic, e-government, e-commerce, geospatial).

In 2015 he wrote “Apache Solr Essentials”, a book about Solr, published by Packt Publishing.
He’s an opensource lover; he’s currently involved in several (too many!) projects, always thinking about a “big” idea that will change his (developer) life.

Introduction to Music Information Retrieval

Music Information Retrieval is about retrieving information from music entities.
This high-level definition relates to a complex discipline with many real-world applications.     
Being a former bass player, Andrea will describe a high-level overview about Music Information Retrieval and it will analyse from a musician perspective a set of challenges that the topic offers.
We will introduce the basic concepts of the music language, then passing through different kind of music representations we will end up describing some useful low level features that are used when dealing with music entities. 

Elia Porciani

Elia is a Software Engineer passionate about algorithms and data structures concerning search engines and efficiency.
He is currently involved in many research projects at CNR (National Research Council, Italy ) and for personal purpose.
Before joining Sease he worked in Intecs and List where he could experience different fields and levels of computer science, by handling low level programming problems such as embedded and networking up to high level trading algorithms.
He graduated with a dissertation about data compression and query performance on search engines.
He is active part of the information retrieval research community, attending international conferences such as SIGIR and ECIR.
His most recent pubblication is : FASTER BLOCKMAX WAND WITH VARIABLE-SIZED BLOCKS SIGIR 2017 Proceedings of the 40th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2017

Improving top-k retrieval algorithms using dynamic programming and longer skipping

Modern search engines has to keep up with the enormous growth in the number of documents and queries submitted by users. One of the problem to deal with is finding the best k relevant documents for a given query. This operation has to be fast and this is possible only by using specialised technologies.
Block max wand is one of the best known algorithm for solving this problem without any effectiveness degradation of its ranking.
After a brief introduction, in this talk I’m going to show a strategy introduced in “Faster BlockMax WAND with Variable-sized Blocks” (SIGIR 2017), that applied to BlockMaxWand data has made possible to speed up the algorithm execution by almost 2x.
Then, will be presented another optimisation of the BlockMaxWand algorithm (“Faster BlockMax WAND with Longer Skipping”, ECIR 2019) for reducing the time execution of short queries.


Sambhav Kothari

Sambhav is a software engineer at Bloomberg, working in the News Search Experience team.



Learning To Rank: Explained for Dinosaurs

Internet search has long evolved from days when you had to string up your query in just the right way to get the results you were looking for. Search has to be smart and natural, and people expect it to “just work” and read what’s on their minds.

On the other hand, anyone who has worked behind-the-scenes with a search engine knows exactly how hard it is to get the right result to show up at the right time. Countless hours are spent tuning the boosts before your user can find his favorite two-legged tiny-armed dinosaur on the front page.

When your data is constantly evolving, updating, it’s only realistic that so do your search engines. Search teams thus are on a constant pursuit to refine and improve the ranking and relevance of their search results. But, working smart is not the same as working hard. There are many techniques we can employ, that can help us dynamically improve and automate this process. One such technique is Learning to Rank.

Learning to Rank was initially proposed in academia around 20 years ago and almost all commercial web search-engines utilize it in some form or other. At Bloomberg, we decided that it was time for an open source search-engine to support Learning to Rank, so we spent more than a year designing and implementing it. The result of our efforts has been accepted by the Solr community and our Learning to Rank plugin is now available in Apache Solr.

This talk will serve as an introduction to the LTR(Learning-to-Rank) module in Solr. No prior knowledge about Learning to Rank is needed, but attendees will be expected to know the basics of Python, Solr, and machine learning techniques. We will be going step-by-step through the process of shipping a machine-learned ranking model in Solr, including:

  • how you can engineer features and build a training data-set as per your needs
  • how you can train ranking models using popular Python ML(machine learning) libraries like scikit-learn
  • how you can use the above-learned ranking-models in Solr

Get ready for an interactive session where we learn to rank!


Apache Solr Facets and ACL Filters Using Tag and Exclusion

What happens with facets aggregations on fields when documents in the results have been filtered by Access Control Lists ?
In such scenarios it is important to use the facet mincount parameter.
That specifies the minimum count in the result set for a facet value to appear in the response:

  • mincount=0, all the facet values present in the corpus are returned in the response. This includes the ones related to documents that have been filtered out by the ACLs(0 counts facets). This could cause some nasty side effect: such as a user seeing a facet value that he/she’s not supposed to see(because ACL filtered out that document from the result set).
  • mincount=1, only facet values matching at least one document in the result set are returned. This configuration is safe, users are going to see only facet values regulated by the ACL. They will effectively see only what they are supposed to see.

But what happens if you like to see 0 counting facet values, but preserving ACL?
This may help you in having a better understanding of the distribution of the values in the entire corpus, but ACL are still valid, so that users still see only possible values that they are supposed to see.
Tags and Exclusion comes handy in such case.

Faceting Tag And Exclusion

Tag and Exclusion is an extremely important feature for faceting in Apache Solr and you would not believe how many times it is misused or completely ignored, causing an erratic experience for the user.
Let’s see how it works:

Tagging

You can tag a filter query using Solr local parameter syntax:

fq={!tag=docTypeFilter}doctype:pdf

The same applies to the main query(with some caveats if you are using an explicit query parser) :

q={!tag=mainQuery}I am the main query

q={!edismax qf=text title tag=mainQuery}I am the main query

When assigning a tag we give Solr the possibility of identifying separately the various search clauses (such the main query or filter queries).
Effectively it is a way to assign an identifier to a search query or filter.

Excluding in Legacy Faceting

When applying filter queries, Solr is reducing the result space eliminating documents that don’t satisfy the additional filters added.
Let’s assume we want to count the values for a facet on the result set, ignoring the additional filtering that was added by a filter query.
Effectively can be equivalent to the concept of counting the facet values on a result set status that precedes the application of the filter that reduced the result set.
Apache Solr allows you to do that, without affecting the final results returned.

This is called exclusion and can be applied on a facet by facet basis.

fq={!tag=docTypeFilter}doctype:pdf...&facet=true&
facet.field={!ex=docTypeFilter}doctype

This will calculate the ‘doctype’ field facet on the result set with the exclusion of the tagged filter (so for the matter of calculating such aggregation the “doctype:pdf” filter will not be applied and the counts will be calculated on an extended result set).
All other facets, aggregations and the result set itself will not be affected.

1.<Wanted Behaviour - applying tag and exclusion>
=== Document Type ===
[ ] Word (42)
[x] PDF (96)
[ ] Excel(11)
[ ] HTML (63)

This is especially useful for single valued fields:
when selecting a facet value and refreshing the search if you don’t apply tag and exclusion you will get just that value in the facets, defeating the refinement and exploration facet functionality for that field.

2.<Unwanted Behaviour - out of the box>
=== Document Type ===
[ ] Word (0)
[x] PDF (96)
[ ] Excel(0)
[ ] HTML (0)
3.<Unwanted Behaviour - mincount=1>
=== Document Type ===
[x] PDF (96)

As you see in 2. and 3. the facet become barely usable to further explore the results, this may bring the user experience to be fragmented with a lot of back and forth activity selecting and unselecting filters.

Excluding in Json Faceting

After the tagging of a filter, applying an exclusion with the json.facet approach is quite simple:

visibleValues: {
type: terms,
field: cat,
mincount: 1,
limit: 100,
domain: {
excludeTags: <tag>
}
}

When defining a json facet, applying exclusion is just adding the domain node with the excludeTags defined.

Tag and Exclusion to Preserve Acl Filtering in 0 counts

Problem

  • Users are subject to a set of ACL that limit their results visibility.
  • They would like to see also 0 count facets to have a better understanding of the result set and corpus.
  • You don’t want to invalidate the ACL control, so you don’t expect them to see sensible facet values.

Tagging the Main Query and Json Faceting

This is achievale with a combination of tagging and exclusion with Json faceting.
First of all, we want to tag the main query.
We assume the ACL control will be a filter query(and we recommend to apply ACL filtering with properly tuned filter queries).
Tagging the main query and excluding it from the facet calculation will allow us to get all the facet values in the ACL filtered corpus (the main query will be excluded but the ACL filter query will still be applied).

q={!edismax tag=mainQuery qf=name}query&fq=aclField:user1...
json.facet={visibleValues: {
type: terms,
field: cat,
mincount: 1,
limit: 100,
domain: {
excludeTags: mainQuery
}
}}

We are almost there, this facet aggregation will give the counts of all facet values visible to the user in the original corpus(with ACL applied).
But what we want is to have the correct counts based on the current result set and all the visible 0 count facets.
To do that we can add a block to the Json faceting request:

q={!edismax tag=mainQuery qf=name}query&fq=aclField:user1...
json.facet={
resultSetCounts: {
type: terms,
field: category,
mincount: 1
},
visibleValues: {
type: terms,
field: category,
mincount: 1,
domain: {
excludeTags: mainQuery
}
}
}
  • resultSetCounts –  are the counts in the result set, including only NOT 0 counts facet values. This is the list of values the user has visibility on the current result set with correct counts.
  • visibleValues – are all the facet values in the result set the user should have visibility

Then, depending on the user experience we want to provide, we could use these blocks of information to properly render a final response.
For example we may want to show all visible values and associate with them a count from the resultSetCounts when available.

=== Document Type - Result Counts ===   
[ ] Word (10)
[ ] PDF (7)
[ ] Excel(5)
[ ] HTML (2)
=== Document Type - Visible Values ===
[ ] Word (100)
[ ] PDF (75)
[ ] Excel(54)
[ ] HTML (34)
[ ] Jpeg (31)
[ ] Mp4 (14)
 [ ] SecretDocType1 (0) -> not visible, mincount=1 in visibleValues
 [ ] SecretDocType2 (0) -> not visible, mincount=1 in visibleValues

=== Document Type - Final Result for users ===
[ ] Word (10) -> count is replaced with effective result count
[ ] PDF (7) -> count is replaced with effective result count
[ ] Excel(5) -> count is replaced with effective result count
[ ] HTML (2)-> count is replaced with effective result count
[ ] Jpeg (+31)
[ ] Mp4 (+14)

Bonus: What if I Defined the Query Parser in the Solrconfig.xml

This solution is still valid if you are using your query parser defined in the solrconfig.xml .
Extra care is needed to tag the main query.
You can achieve that using the local params in Solr request parameters:

solrconfig.xml
<lst name="defaults">
...
<str name="q">{!type=edismax tag=mainQuery v=$qq}</str>
<str name="qq">*:*</str>
...

Query Time
.../solr/techproducts/browse?qq=ipod mini&fq=acl:user1&json.facet=...

Hope this helps when dealing with ACL or generic filter queries and faceting!

Apache Solr Distributed Facets

Apache Solr distributed faceting feature has been introduced back in 2008 with the first versions of Solr (1.3 according to this jira[1]) .
Until now, I always assumed it just worked, without diving too much into the details.
Nowadays distributed search and faceting are extremely popular, you can find them pretty much everywhere (in the legacy or SolrCloud form alike).
N.B. Although the mechanics are pretty much the same, Json faceting revisits this approach with some change, so we will now focus on legacy field faceting.

I think it’s time to get a better understanding of how it works:

Multiple Shard Requests

When dealing with distributed search and distributed aggregation calculations, you are going to see multiple requests going back and forth across the shards.
They have different focus and are meant to retrieve the different bits of information necessary to build the final response.
We are going to explore the different rounds of requests, focusing just for the faceting purpose.
N.B. Some of these requests are also carrying results for the distributed search calculation, this is used to minimise the network traffic.

For the sake of this blog let’s simulate a simple sharded index, white space tokenization on field1 and facet.field=field1

Shard 1 Shard 2
Doc0
{  “id”:”1”,
“field1”:”a b”
}
Doc3
{  “id”:”4”,
“field1”:”b c”
}
Doc1
{  “id”:”2”,
“field1”:”a”
}
Doc4
{  “id”:”5”,
“field1”:”b c”
}
Doc2
{  “id”:”3”,
“field1”:”b c”
}
Doc53
{  “id”:”6”,
“field1”:”c”
}

Global Facets : b(4), c(4), a(2)

Shard 1 Local Facets : a(2), b(2), c(1)

Shard 2 Local Facets : c(3), b(2)

Collection of Candidate Facet Field Values

The first round of requests is sent to each shard to identify the candidate top K global facet values.
To achieve this target each shard will be requested to respond with its local top K+J facet values and counts.
The reason we actually ask for more facets from each shard is to have a better term coverage, to avoid losing relevant facet values and to minimise the refinement requests.
How many more we request from each shard is regulated by the “overrequest” facet parameter, a factor that gives more accurate facets at the cost of additional computations[2].
Let’s assume we configure a facet.limit=2&facet.overrequest.count=0&facet.overrequest.ratio=1 to explain when refinement happens and how it works.

Shard 1 Returned Facets : a(2), b(2)

Shard 2 Returned Facets : c(3), b(2)

Global Merge of Collected Counts

The facet value counts collected from each shard are merged and the most occurring global top K is calculated.
These facet field values are the first candidates to be the final ones.
In addition to that, other candidates are extracted from the terms below the top K, based on the shards that didn’t return those values statistics.
At this point we have a candidate set of values and we are ready to refine their counts where necessary, asking back this information to the shards that didn’t include that in the first round.
This happens including the following specific facet parameter to the following refinement requests:

{!terms=$<field>__terms}<field>&<field>__terms=<values>
e.g.
{!terms=$field1__terms}field1&field1__terms=term1,term2

N.B. This request is specifically asking a Solr instance to return back the facet counts just for the terms specified[3]

Top 2 candidates = b(4), c(3)
Additional candidates = a(2)

The reason that a(2) is added to the potential candidates is because Shard 2 didn’t answer with a count for a, the potential missing count of 1 could bring a to the top K. So it is worth a verification.

Shard 1 didn’t return any value for the candidate c facet.
So the following request is built and sent to it:
facet.field={!terms=$field1__terms}field1&field1__terms=c

Shard 2 didn’t return any value for the candidate a facet.
So the following request is built and sent to it:
facet.field={!terms=$field1__terms}field1&field1__terms=a

Final Counts Refinement

The refinements counts returned by each shard can be used to finalise the global candidate facet values counts and to identify the final top K to be returned by the distributed request.
We are finally done!

Shard 1 Refinements Facets : c(1)

Shard 2 Refinements Facets : a(0)

Top K candidates updatedb(4), c(4), a(2)

GIven a facet.limit=2 the final global facets with correct results returned is :
b(4), c(4)

 

[1] https://issues.apache.org/jira/browse/SOLR-303

[2] https://lucene.apache.org/solr/guide/6_6/faceting.html#Faceting-Over-RequestParameters

[3] https://lucene.apache.org/solr/guide/7_5/faceting.html#limiting-facet-with-certain-terms

Apache Lucene BlendedInfixSuggester : How It Works, Bugs And Improvements

The Apache Lucene/Solr suggesters are important to Sease : we explored the topic in the past[1] and we strongly believe the autocomplete feature to be vital for a lot of search applications.
This blog post explores in details the current status of the Lucene BlendedInfixSuggester, some bugs of the most recent version ( with the solution attached) and some possible improvements.

BlendedInfixSuggester

The BlendedInfixSuggester is an extension of the AnalyzingInfixSuggester with the additional functionality to weight prefix matches of your query across the matched documents.
It scores higher if a hit is closer to the start of the suggestion.
N.B. at the current stage only the first term in your query will affect the suggestion score

Let’s see some of the configuration parameters from the official wiki:

  • blenderType: used to calculate the positional weight coefficient using the position of the first matching word. Can be one of:
    • position_linear: weightFieldValue*(1 – 0.10*position): Matches to the start will be given a higher score (Default)
    • position_reciprocal: weightFieldValue/(1+position): Matches to the start will be given a score which decays faster than linear
    • position_exponential_reciprocal: weightFieldValue/pow(1+position,exponent): Matches to the start will be given a score which decays faster than reciprocal
      • exponent: an optional configuration variable for the position_reciprocal blenderType used to control how fast the score will increase or decrease. Default 2.0.
Description
Data Structure Auxiliary Lucene Index
Building For each Document, the stored content from the field is analyzed according to the suggestAnalyzerFieldType and then additionally EdgeNgram token filtered.

Finally an auxiliary index is built with those tokens.

Lookup strategy The query is analysed according to the suggestAnalyzerFieldType.

Than a phrase search is triggered against the Auxiliary Lucene index

The suggestions are identified starting at the beginning of each token in the field content.

Suggestions returned The entire content of the field .

This suggester is really common nowadays as it allows to provide suggestions in the middle of a field content, taking advantage of the analysis chain provided with the field.

It will be possible in this way to provide suggestions considering synonymsstop words, stemming and any other token filter used in the analysis and match the suggestion based on internal tokens.
Finally the suggestion is scored, based on the position match.

The simple corpus of document for the examples will be the following :

[
      {
        "id":"44",
        "title":"Video gaming: the history"},
      {
        "id":"11",
        "title":"Nowadays Video games are a phenomenal economic business"},
      {
        "id":"55",
        "title":"The new generation of PC and Console Video games"},
      {
        "id":"33",
        "title":"Video games: multiplayer gaming"}]

And a simple synonym mapping : multiplayer, online

Let’s see some example :

Query to autocomplete Suggestions Explanation
“gaming”
  • “Video gaming: the history”
  • “Video game: multiplayer gaming”
  • “Nowadays Video games are a phenomenal economic business”
The input query is analysed, and the tokens produced are the following : “game” .

In the Auxiliary Index , for each of the field content we have the EdgeNgram tokens:

“v”,”vi”,”vid”… , “g”,”ga”,”gam”,“game” .

So the match happens and the suggestion are returned.
N.B. First two suggestions are ranked higher as the matched term happen to be closer to the start of the suggestion

Let’s explore the score of each Suggestion given various Blender Types :

Query gaming
Suggestion First Position Match Position Linear Position Reciprocal Position Exponential Reciprocal
Video gaming: the history 1 1-0.1*position = 0.9 1/(1+position) = 1/2 = 0.5 1/(1+position)^2 = 1/4 = 0.25
Video game: multiplayer gaming 1 1-0.1*position = 0.9 1/(1+position) = 1/2 = 0.5 1/(1+position) = 1/4 = 0.25
Nowadays Video games are a phenomenal economic business 2 1-0.1*position = 0.8 1/(1+position) = 1/3 = 0.3 1/(1+position) = 1/9 = 0.1

The final score of the suggestion will be :

long score = (long) (weight * coefficient)

N.B. the reason I highlighted the data type is because it’s directly affecting the first bug we discuss.

Suggestion Score Approximation

The optional weightField parameter is extremely important for the Blended Infix Suggester.
It assigns the value of the suggestion weight ( extracted from the mentioned field).
e.g.
The suggestion may come from the product name field, but the suggestion weight depends on how profitable the product suggested is.

<str name=”field”>productName</str>
<str name=”weightField”>profit</str>

So far, so good, but unfortunately there are two problems with that.

Bug 1 – WeightField Not Defined -> Zero suggestion score

How To Reproduce It : Don’t define any weightField in the suggester config
Effect : the suggestion ranking is lost, all the suggestions have 0 score, position of the match doesn’t matter anymore
The weightField is not a mandatory configuration for the BlendedInfixSuggester.
Your use case could not involve any weight for your suggestions and you are just interested in the positional scoring (the main reason the BlendedInfixSuggester exists in the first place).
Unfortunately, this is not possible at the moment :
If the weightField is not defined, each suggestion will have a weight of 0.
This is because the weight associated to each document in the document dictionary is a long. If the field to extract the weight from, is not defined (null), the weight returned will just be 0.
This doesn’t allow to differentiate when a weight should be 0 ( value extracted from the field) or null ( no value at all ).
A solution has been proposed here[3].

Bug 2 – Bad Approximation Of Suggestion Score For Small Weights

There is a misleading data type casting in the score calculation for the suggestion :

long score = (long) (weight * coefficient)

This apparently innocent cast, actually brings very nasty effects if the weight associated to a suggestion is unitary or small enough.

Weight =1
Video gaming: the history
1-0.1*position = 0.9 * 1 =cast= 0
1/(1+position) = 1/2 = 0.5 * 1 =cast= 0
1/(1+position)^2 = 1/4 = 0.25 * 1 =cast= 0

Weight =2
Video gaming: the history
1-0.1*position = 0.9 * 2=cast= 1
1/(1+position) = 1/2 = 0.5 * 2=cast= 1
1/(1+position)^2 = 1/4 = 0.25 * 2=cast= 0

Basically you risk to lose the ranking of your suggestions reducing the score to only few possible values : 0 or 1 ( in edge cases)

A solution has been proposed here[3]

Multi Term Matches Handling

It is quite common to have multiple terms in the autocomplete query, so your suggester should be able to manage multiple matches in the suggestion accordingly.

Given a simple corpus (composed just by the following suggestions) and the query :
“Mini Bar Frid” 

You see these suggestions:

  • 1000 | Mini Bar something Fridge
  • 1000 | Mini Bar something else Fridge
  • 1000 | Mini Bar Fridge something
  • 1000 | Mini Bar Fridge something else
  • 1000 | Mini something Bar Fridge

This is because at the moment, the first matching term wins all ( and the other positions are ignored).
This brings a lot of possible ties (1000), that should be broken to give the user a nice and intuitive ranking.

But intuitively I would expect in the results something like (note that allTermsRequired=true and the schema weight field always returns 1000)

  • Mini Bar Fridge something
  • Mini Bar Fridge something else
  • Mini Bar something Fridge
  • Mini Bar something else Fridge
  • Mini something Bar Fridge

Let’s see a proposed Solution [4] :

Positional Coefficient

Instead of taking into account just the first term position in the suggestion, it’s possible to use all the matching positions from the matched terms [“mini”,”bar”,”fridge”].
Each position match will affect the score with :

  • How much the matched term position is distant from the ideal position match
    • Query : Mini Bar Fri, Ideal Positions : [0,1,2]
    • Suggestion 1Mini Bar something Fridge, Matched Positions:[0,1,3]
    • Suggestion 2Mini Bar something else Fridge, Matched Positions:[0,1,4]
    • Suggestion 2 will be penalised as “Fri” match happens farer (4 > 3) from the ideal position 2
  • Earlier the mis-position happened, stronger the penalty for the score to pay
    • Query : Mini Bar Fri, Ideal Positions : [0,1,2]
    • Suggestion 1Mini Bar something Fridge, Matched Positions:[0,1,3]
    • Suggestion 2Mini something Bar Fridge, Matched Positions:[0,2,3]
    • Suggestion 2 will be additionally penalised as the first mis-match in positions Bar happens closer to the beginning of the suggestion 

Considering only the discountinue position proved useful :

Query1: Bar some
Query2: some
Suggestion : Mini Bar something Fridge
Query 1 Suggestion Matched Terms positions : [1,2]
Query 2 Suggestion Matched Terms positions : [2]

If we compare the suggestion score for both these queries, it would seem unfair to penalise the first one just because it matches 2 terms ( consecutive) while the second query has just one match ( positioned worst than the first match in query1)

Introducing this advanced positional coefficient calculus helped in improving the overall behavior for the experimental test created.
The results obtained were quite promising :

Query : Mini Bar Fri
100 |Mini Bar Fridge something
100 |Mini Bar Fridge something else
100 |Mini Bar Fridge a a a a a a a a a a a a a a a a a a a a a a
26 |Mini Bar something Fridge
22 |Mini Bar something else Fridge
17 |Mini something Bar Fridge
8 |something Mini Bar Fridge
7 |something else Mini Bar Fridge

There is still a tie for the exact prefix matches, but let’s see if we can finalise that improvement as well .

Token Count Coefficient

Let’s focus on the first three ranking suggestions we just saw :

Query : Mini Bar Fri
100 |Mini Bar Fridge something
100 |Mini Bar Fridge something else
100 |Mini Bar Fridge a a a a a a a a a a a a a a a a a a a a a a

Intuitively we want this order to break the ties.
Closer the number of matched terms with the total number of terms for the suggestion, the better.
Ideally we want our top scoring suggestion to just have the matched terms if possible.
We also don’t want to bring strong inconsistencies for the other suggestions, we should ideally only affect the ties.
This is achievable calculating an additional coefficient, dependant on the term counts :
Token Count Coefficient = matched terms count / total terms count

Then we can scale this value accordingly :
90% of the final score will derive from the positional coefficient
10% of the final score will derive from the token count coefficient

Query : Mini Bar Fri
90 * 1.0 + 10*3/4 = 97|Mini Bar Fridge something
90 * 1.0 + 10*3/5 = 96|Mini Bar Fridge something else
90 * 1.0 + 10*3/25 = 91|Mini Bar Fridge a a a a a a a a a a a a a a a a a a a a a a

It will require some additional tuning but the overall idea should bring a better ranking function to the BlendedInfix when multiple terms matches are involved!
If you have any suggestion, feel free to leave a comment below!
Code is available in the Github Pull Request attached to the Lucene Jira issue[4]

[1] Solr Autocomplete
[2] Blended Infix Suggester Solr Wiki
[3] LUCENE-8343
[4] LUCENE-8347

SolrCloud exceptions with Apache Zookeeper

At the time we speak ( Solr 7.3.1 ) SolrCloud is a reliable and stable distributed architecture for Apache Solr.
But it is not perfect and failures happen.

Apache Zookeeper[1] is the system responsible of managing the communications across the SolrCloud cluster.
It contains the shared collections configurations and it has the view of the cluster status.
It is part of the brain of the cluster, a keeper that maintains the cluster healthy and functional.

It is able to answer questions such as :

• Who is the leader for this shard and collection?
• Is this node down ?
• Is this node recovering ?

The Solr nodes communicate with Zookeeper to understand who to contact when running SolrCloud operations.

This lightening blog post will present some practical tips to follow when your client application encounters some classic exceptions dealing with SolrCloud and Apache Zookeeper.
Special thanks to the Apache Solr user mailing list contributors and the Apache Solr community, this post is an aggregation of recommendations from there and from official code and documentation.

org.apache.solr.common.SolrException: Could not load collection from ZK: <collection name>

If you landed here with just that Exception I assume there is a missing :
“ Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /collections/<collection name>/state.json “ ?

Solr’s zkClientTimeout is used to set ZooKeeper’s sessionTimeout, and that’s what is exceeded when a session expires.
When this kind of exception happens, it means something has gone VERY wrong in the Solr-Zookeeper communication, 30 seconds ( the current default[2] ) is a REALLY long time when applications are trying to communicate.

Recommendation : take care of the different time outs around, don’t keep them too small !
i.e. for zkClientTimeout assign a value >= 30 seconds.
maxSessionTimeout (Zookeeper)
New in 3.3.0: the maximum session timeout in milliseconds that the server will allow the client to negotiate. Defaults to 20 times the tickTime.

zkClientTimeout (Solr)
Controls your client timeout.

Once checked the time outs, let’s explore some possible root causes.
A session expiry can be caused by:
1. Garbage collection on Solr node/Zookeeper – extreme GC pauses can happen with the heap being too small or VERY large
2. Slow IO on disk.
3. Network latency

Recommendations

  1. set up a JVM profiler to monitor closely your Solr and Zookeeper nodes, take particular attention to the garbage collection cycles and the memory usage in general : you don’t want Zk to swap too much! ( GCViewer[3] could be a nice tool for this)
  2. Verify that the Zookeeper node has a fast writing access to the disk : Zookeeper needs fast writes and ideally a separate disk allocated.
  3. Monitor your network and make sure the the solr nodes can talk effectively to the Zookeeper nodes

In case the suggestions are not solving your problem, you may be experiencing a Solr bug.
One of them is[4] which unfortunately has not been fixed yet.

org.apache.solr.client.solrj.SolrServerException: No live SolrServers available to handle this

From the official JavaDoc :

org/apache/solr/client/solrj/impl/LBHttpSolrClient.java:369
“Tries to query a live server from the list provided in Req. Servers in the dead pool are skipped.
* If a request fails due to an IOException, the server is moved to the dead pool for a certain period of
* time, or until a test request on that server succeeds.
*
* Servers are queried in the exact order given (except servers currently in the dead pool are skipped).
* If no live servers from the provided list remain to be tried, a number of previously skipped dead servers will be tried.
* Req.getNumDeadServersToTry() controls how many dead servers will be tried.
*
* If no live servers are found a SolrServerException is thrown.”

What was the status of the cluster at the moment the exception happened ?
Was any Solr server UP and running according to Zookeeper knowledge ?

The recommendation is to check the clusterstate.json when the exception happens.
From the Solr admin UI you can open Cloud->Tree and verify which nodes are up and running.

It could be very much related a node failure ( that could be related to any possible reason including GC)
I’ve seen situations where it was caused by a specific query, the real exception got hidden by a “No live SolrServers…” client exception.
Solr logs should help to identify the inner Solr problem and JVM monitoring could discard any memory/gc problem.
Some people saw this with wildcard queries (when every shard reported a “too many expansions…”
type error, but the exception in the client response was “No live SolrServers…”.

org.apache.solr.common.SolrException: Could not find a healthy node to handle the request

Pretty much same considerations as the “No Live Solr Server”.
This happens when the load balancer SolrJ side is unable to retrieve an alive node, from the cluster ( based on Zookeeper state).
This happens before the previous exception, so the request doesn’t even reach the LoadBalancinghttpSolrClient.

[1] https://zookeeper.apache.org
[2] SOLR-5565
[3] https://github.com/chewiebug/GCViewer
[4] SOLR-8868

SolrCloud Leader Election Failing

At the time we speak ( Solr 7.3.0 ) SolrCloud is a reliable and stable distributed architecture for Apache Solr.
But it is not perfect and failures happen.
This lightening blog post will present some practical tips to follow when a specific shard of a collection is down with no leader and the situation is stuck.
The following problem has been experienced with the following Solr versions :

  • 4.10.2
  • 5.4.0

Steps to solve the problem may involve manual interaction with the Zookeeper Ensemble[1].
The following steps are extracted from an interesting thread of the Solr User mailing list[2] and practical experience on the field.
In particular, thanks to Jeff Wartes for the suggestions, that proved useful for me in a couple of occasions.

Problem

  • All nodes for a Shard in a Collection are up and running
  • There is no leader for the shard
  • All the nodes are in a “Recovering” / “Recovery Failed” state
  • Search is down and the situation persist after many minutes (> 5)

Solution

A possible explanation for this problem to occur is when the node-local version of the Zookeeper clusterstate has diverged from the centralized Zookeeper cluster state.
One possible cause for the leader election to break is a Zookeeper failure : for example you lose >=50% of the ensemble nodes or the connectivity among the ensemble nodes for a certain period of time ( this is the scenario I experimented directly)
This failure, even if resolved later, can bring a corruption to the Zookeeper file system.
Some of the SolrCloud collections may remain in a not consistent status.

It may be necessary to manually delete corrupted files from Zookeeper :
Let’s start from :

collections/<collection>/leader_elect/shard<x>/election
An healthy SolrCloud cluster presents as many core_nodeX as the total replicas for the shard.
You don’t want duplicates or missing nodes here.
If you’re having trouble getting a sane election, you can try deleting the lowest-numbered entries (as well as any lower-numbered duplicates) and try to foce the election again. Possibly followed by restarting the node with that lowest-numbered entry.

collections/<collection>/leader/shard<x>
Make sure that this folder exists and has the expected replica as a leader.

collections/<collection>/leader_initiated_recovery
This folder can be informative too, this represents replicas that the *leader* thinks are out of sync, usually due to a failed update request.

After having completed the verification above, there a couple of Collection API endpoints that may be useful :

Force Leader Election
/admin/collections?action=FORCELEADER&collection=<collectionName>&shard=<shardName>

Force Leader Rebalance
/admin/collections?action=REBALANCELEADERS&collection=collectionName

N.B. rebalancing all the leader will affect all the shards

 

[1] Apache Zookeeper Solr Cli

[2] Solr Mailing List Thread

[3] Solr Collection API

 

ECIR 2018 Experience

Logo.png

This blog is a quick summary of my (subjective) experience at ECIR 2018 : the 40th European Conference on Information Retrieval, hosted in Grenoble (France) from 26/03/2018 to 29/03/2018.

Deep Learning and Explicability

Eight long papers accepted were about Deep Learning.
The topics “Neural Network” and “Word Embedding” were the most occurring in the accepted full papers (and rejected) at the conference. It is clear that deep learning technologies are strongly advancing as a de facto standard in Artificial Intelligence and this can be noticed also in Information Retrieval where these technologies can be used to better model the user behaviour and extract topics and semantic from documents.
But if with deep learning and the advanced capabilities of complex models you gain performance, on the other hand you  lose the ability of explaining and debugging why such output corresponds to a given input.
A recurring topic in the Deep Learning track ( and the Industry Day actually) was to find a balance in the performance gain of new techniques and the control over them.
It is an interesting topic and I believe it is just the other face of the coin, it is good though that Academia is noting the importance of such aspect : in the industry a technology requires control and maintenance much more than in the academic environment and most of the times the “debuggability” can affect the decision between a technology and another.

From Academic Papers to Production

The 2018 European Conference on Information Retrieval ended the 29/03 with a brilliant Industry Day focused on the perilous path from Research to Production.
This is the topic that most permeated the conference across different keynotes, sessions and informal discussions.
In Information Retrieval it is still difficult to converge successful researches into successful live systems : most of the times it is not clear which party should be interested in this process.
Okapi BM25 was first published and implemented in the 1980s and 1990s; it became the default similarity in Apache Lucene 6.0 in 2016 .
Academia is focused on finding new interesting problems to solve with inventive techniques and Industry is focused on finding the quickest solution that works (usually).
This brings a gap :  new solutions are not easily reproducible from academic papers and they are far from being practically ready to use outside the experimental controlled environment.
Brilliant researchers crave new interesting problems they can reason on and solve: their focus is to build a rough implementation and validate it with few metrics ( ideally with an accepted related publication ).
After that, their job is done, problem solved, the challenge is not interesting anymore and they can pass to the next problem.
Researchers get bored easily but they risk to never see their researches fulfilled, applied and used in real life.
Often Academia creates its own problems to solve them and get a publication : Publish or Perish -> no publications bring less funds.
This may be or may be not a personal problem, depending on the individual.
Industry is usually seen by academics as a boring place where you just apply consolidated techniques to just get the best result with a minimum effort.
Sometimes industry is just where you end up when you want to make some money and actually see your effort to bring benefits to some population.
And this was(is) true most of the times but with the IT explosion we are living and the boom of competition the situation nowadays is open to change, where a stronger connection between Academia and Industry can ( and should !) happen and conferences such as ECIR is a perfect ground to settle the basis.
So, building on the introduction let’s see a quick summary of the keynotes, topics and sessions that impressed me most from the conference!

From Academic Papers to Production : A Learning To Rank Story[1]

Let’s start from Sease contribution to the conference : a story about the adoption of Learning To Rank in a real world e-commerce scenario.
The session took place at the Industry Day (29/03) and focused on the challenges and pitfalls of moving from the research papers to production.
The entire journey was possible through Open Source software : Apache Solr and RankLib as main actors.
Learning To Rank is becoming extremely popular in the industry and it is a very promising and useful technology to improve the relevancy function leveraging the user behaviour.
But from an open source perspective is still quite a young technology and effort is required to get it right.
The main message that I wanted to transmit is : don’t be scared to fail, if something doesn’t work immediately out of the box, it doesn’t mean it’s not a valid technology, No Pain No Gain, Learning To Rank open source implementations are valid but require tuning and care to bring them to production with success ( and the improvement these technologies can bring is extremely valuable).

The Harsh Reality of Production Information Access Systems[2]

Recipient of the “Karen Spärck Jones Award” Fernando Diaz in this talk focused on the problems derived from the adoption of Information Retrieval technologies in production environments where a deep understanding of individuals, groups and society is required.
Building on the technical aspect involved in applying research in real world systems, the focus switched to the ethical side : are current IR systems moving in the direction of providing the user with a fair and equally accessible information or the monetisation process behind is producing just addicted users who see (and buy) ad hoc tailored information ?

Statistical Stemmers : A Reproducibility Study[3]

This paper is a sort of symbol of the conference trend : reproducibility is as important as the innovation that a research brings.
Winner of the Best Paper Award, it may have caused some perplexity among the audience ( why to reward a paper that is not innovating but just reproducing past techniques ?), but the message that transmits is clear : a research needs to be easily reproducible and effort is required in that direction for an healthy Research & Development flow that doesn’t target just the publication but a real world application.

Entity-centric Topic Extraction and Exploration: A Network-based Approach[4]

Interesting talk, it explores topic modelling over time in a network based fashion :
Instead of modelling a topic as ranked list of terms it uses a network (weighted graph) representation.
This may be interesting for an advanced More Like This implementation, it’s definitely worth an investigation.

Information Scent, Searching and Stopping : Modelling SERP Level Stopping Behaviour[5]

This talk focused on the entire Search Result Page as a possible signal that affects user stopping behaviour i.e. when a search result page is returned, the overall page quality affects the user perception of relevancy and may drive an immediate query reformulation OR a good abandonment ( when the information need is satisfied).
This something that I personally experimented : sometimes from a quick look to the result page you may realise if the search engine understood ( or misunderstood) your information need.
Different factors are involved in what the author calls “Information Scent” but definitely the perceived relevance (modelled through different User Experience approaches) is definitely an interesting topic that sits along the real relevance.
Further studies in this area may affect the way Search Results pages are rendered, to maximise the fruition of information.

Employing Document Embeddings to Solve the “New Catalog” Problem in User Targeting, and provide Explanations to the Users[6]

The new catalog problem is a practical problem for modern recommender systems and platforms. There are a lot of use cases where you have collection of items that you would like to recommend and this ranges over Music Streaming Platforms ( Playlists, Albums, ect), Video Streaming Platform ( Tv Series genres, To View lists ect) and many other domains.
This paper explores both the algorithm behind such recommendations and the explanation need : explaining to the user why a catalog may be relevant for his/her taste is as important as providing a relevant catalog of items.

Anatomy of an Idea: Mixing Open Source, research and Business

This keynote summarises the cornerstone of Sease culture :
Open Source as a bridge between Academia and Industry.
If every research paper should be implemented in a production ready open source platform as part of the publication process, the community would get a direct and immense benefit out of it.
Iterative improvement would get a great traction and generally speaking the entire scientific community will get a strong boost with a better accessibility .
Implementing researches in state of the art production ready open source systems ( where possible) would cut the adoption time at industry level, triggering an healthy process of utilisation and bug fixing.

Industry Day[7]

The industry day was the coronation of the overall trend that was permeating the conference : there is a strong need to build a better connection between the Academic and Industrial world.
And the good audience reception shown ( the organisers had to move the track to the main venue) is a proof that there is an increasingly stronger need of seeing interesting researches applied ( with success or failure) to the real world with the related lessons learned.
Plenty of talks in this session were brilliant, my favourites :

  • Fabrizio Silvestri (Facebook)
    Query Embeddings: From Research to Production and Back!
  • Manos Tsagkias (904Labs)
    A.I. for Search: Lessons Learned
  • Marc Bron (Schibsted Media Group)
    Managment of Industry Research: Experiences of a Research Scientist

In conclusion, the conference was perfectly organised in an excellent venue, the balance in topics and talks was fairly good( both academic and industrial) and I really enjoyed my time in Grenoble, see you next year in Cologne !

[1] https://www.slideshare.net/AlessandroBenedetti/ecir-2018-alessandro
[2] https://www.ecir2018.org/programme/keynote-speakers/
[3] http://www.dei.unipd.it/~silvello/papers/ECIR2018_SA.pdf
[4] https://dbs.ifi.uni-heidelberg.de/files/Team/aspitz/publications/Spitz_Gertz_2018_Entity-centric_Topic_Extraction.pdf
[5] https://strathprints.strath.ac.uk/62856/
[6] https://link.springer.com/chapter/10.1007%2F978-3-319-76941-7_28
[7] https://www.ecir2018.org/industry-day/

Distributed Search Tips for Apache Solr

Distributed search is the foundation for Apache Solr Scalability :

It’s possible to distributed search across different Apache Solr nodes of the same collection ( both in a  legacy[1] or SolrCloud[2] architecture), but it is also possible to distribute search across different collections in a SolrCloud cluster.
Aggregating results from different collections may be useful when you put in place different systems ( that were meant to be separate ) and you later realize that aggregating the results may be an additional useful use case.
This blog will focus on some tricky situations that can happen when running distributed search ( for configuration or details you can refer to the Solr wiki ).

IDF

Inverse Document Frequency affects the score.
This means that a document coming from a big collection can obtain a boost from IDF, in comparison to a similar document from a smaller collection.
This is because the maxDoc count is taken into account as corpus size, so even if a term has the same document frequency, IDF will be strongly affected by the collection size.
Distributed IDF[3] partially solved the problem :

When distributing the search across different shards of the same collection, it works quite well.
But using the ExactStatCache and alternating single collection distribution and multi collection distribution in the same SolrCloud cluster will create some caching conflict.

Specifically if we first execute the inter collection query, the global stats cached will be the inter collection global stats,  so if we then execute a single collection distributed search, the preview global stats will remain cached ( viceversa applies).

Debug Scoring

Real score and debug score is not aligned with the distributed IDF, this means that the debug query will not show the correct distributed IDF and correct scoring calculus for distributed searches[4].

Relevancy tuning

Lucene/Solr score is not probabilistic or normalised.
For the same collection we can have completely different score scales just with different queries.
The situation becomes more complicated when we tune our relevancy adding multiplicative or additive boosts.
Different collections may imply completely different boosting logic that could cause the score of a collection to be on a completely different scale in comparison to another.
We need to be extra careful when tuning relevancy for searches across different collections and try to configure the distributed request handler in the most compatible way as possible.

Request handler

It is important to carefully specify the request handler to be used when using distributed search.
The request will hit one collection in one node and then when distributing the same request handler will be called on the other collections across the other nodes.
If necessary it is possible to configure the aggregator request handler and local request handlers ( this may be useful if we want to use a different scoring formulas per collection, using local parameters) :

Aggregator Request Handler

It is executed on the first node receiving the request.
It will distribute the request and then aggregate the results.
It must describe parameters that are in common across all the collections involved in the search.
It is the one specified in the main request.
e.g.
http://localhost:8983/solr/collection1/select?

Local Request Handler

It is specified passing the parameter : shards.qt=
It is execute on each node that receive the distributed query AFTER the first one.
This can be used to use specific fields or parameter on a per collection basis.
A local request handler may use fields and search components that are local to the collection interested.

e.g.
http://localhost:8983/solr/collection1/select?q=*:*&collection=collection1,collection2&shards.qt=localSelect

N.B. the use of local request handler may be useful in case you want to define local query parser rules, such as local edismax configuration to affect the score.

Unique Key

The unique key field must be the same across the different collections.
Furthermore the value should be unique across the different collections to guarantee a proper behaviour.
If we don’t comply with this rule, Solr will fail in aggregating the results and raise an exception.

[1] https://lucene.apache.org/solr/guide/6_6/distributed-search-with-index-sharding.html
[2] https://lucene.apache.org/solr/guide/6_6/solrcloud.html
[3] https://lucene.apache.org/solr/guide/6_6/distributed-requests.html#DistributedRequests-ConfiguringstatsCache_DistributedIDF_
[4] https://issues.apache.org/jira/browse/SOLR-7759

Solr Is Learning To Rank Better – Part 4 – Solr Integration

Last Stage Of The Journey

This blog post is about the Apache Solr Learning To Rank ( LTR ) integration.

We modelled our dataset, we collected the data and refined it in Part 1 .
Trained the model in Part 2 .
Analysed and evaluate the model and training set in Part 3 .
We are ready to rock and deploy the model and feature definitions to Solr.
I will focus in this blog post on the Apache Solr Learning To Rank ( LTR ) integration from Bloomberg [1] .
The contribution is completed and available from Apache Solr 6.4.
This blog is heavily based on the Learning To Rank ( LTR ) Bloomberg contribution readme [2].

Apache Solr Learning To Rank ( LTR ) integration

The Apache Solr Learning To Rank ( LTR ) integration allows Solr to rerank the search results evaluating a provided Learning To Rank model.
Main responsabilties of the plugin are :

– storage of feature definitions
– storage of models
– feature extraction and caching
– search result rerank

Features Definition

As we learnt from the previous posts, the feature vector is the mathematical representation of each document/query pair and the model will score each search result according to that vector.
Of course we need to tell Solr how to generate the feature vector for each document in the search results.
Here comes the Feature Definition file.
A Json array describing all the relevant features necessary to score our documents through the machine learned LTR model.

e.g.

[{ "name": "isBook",
  "class": "org.apache.solr.ltr.feature.SolrFeature",
  "params":{ "fq": ["{!terms f=category}book"] }
},
{
  "name":  "documentRecency",
  "class": "org.apache.solr.ltr.feature.SolrFeature",
  "params": {
      "q": "{!func}recip( ms(NOW,publish_date), 3.16e-11, 1, 1)"
  }
},
{
  "name" : "userTextTitleMatch",
  "class" : "org.apache.solr.ltr.feature.SolrFeature",
  "params" : { "q" : "{!field f=title}${user_text}" }
},
{
  "name":"book_price",
  "class":"org.apache.solr.ltr.feature.FieldValueFeature",
  "params":{"field":"book_price"}
},
{
  "name":"originalScore",
  "class":"org.apache.solr.ltr.feature.OriginalScoreFeature",
  "params":{}
},
{
   "name" : "userFromMobile",
   "class" : "org.apache.solr.ltr.feature.ValueFeature",
   "params" : { "value" : "${userFromMobile:}", "required":true }
}]  
SolrFeature
– Query Dependent
– Query Independent
A Solr feature is defined by a Solr query following the Solr sintax.
The value of the Solr feature is calculated as the return value of the query run against the document we are scoring.
This feature can depend from query time parameters or can be query independent ( see examples)
e.g.
“params”:{“fq”: [“{!terms f=category}book”] }
– Query Independent
– Boolean feature
If the document match the term ‘book’ in the field ‘category’ the feature value will be 1.
It is query independent as no query param affects this calculation.
“params”:{“q”: “{!func}recip( ms(NOW,publish_date), 3.16e-11, 1, 1)”}
– Query Dependent
– Ordinal feature
The feature value will be calculated as the result of the function query, more recent the document, closer to 1 the value.
It is query dependent as ‘NOW’ affects the feature value.
“params”:{“q”: “{!field f=title}${user_text}” }
– Query Dependent
– Ordinal feature
The feature value will be calculated as the result of the query, more relevant the title content for the user query, higher the value.
It is query dependent as the ‘user_text’ query param affects the calculation.
FieldValueFeature
– Query Independent
A Fiel Value feature is defined by a Solr field.
The value of the feature is calculated as the content of the field for the document we are scoring.
The field must be STORED or DOC-VALUED . This feature is query independent ( see examples)
e.g.
“params”:{“field”:”book_price”}
– Query Independent
– Ordinal feature
The value of the feature will be the content of the ‘book_price’ field for a given document.
It is query independent as no query param affects this calculation.
ValueFeature
– Query Level
– Constant
A Value feature is defined by a constant or an external query parameter.
The value of the feature is calculated as the value passed in the solr request as an efi(External Feature Information) parameter or as a constant.
This feature depends only on the param configured( see examples)
e.g.
“params” : { “value” : “${user_from_mobile:}”, “required”:false }
– Query Level
– Boolean feature
The user will pass the ‘userFromMobile’ request param as an efi
The value of the feature will be the value of the parameter
The default value will be assigned if the parameter is missing in the request
If it is required an exception will be thrown if the parameter is missing in the request“params” : { “value” : “5“, “required”:false }
– Constant
– Ordinal feature
The feature value will be calculated as the constant value of ‘5’ .Except the constant, nothing affect the calculation.
OriginalScoreFeature
– Query Dependent
An Original Score feature is defined with no additional parameters.
The value of the feature is calculated as the original lucene score of the document given the input query.
This feature depends from query time parameters ( see examples)
e.g.
“params”:{}
— Query Dependent
— Ordinal feature
The feature value will be the original lucene score given the input query.
It is query dependent as the entire input query affect this calculation.

EFI ( External Feature Information )

As you noticed in the feature definition json, external request parameters can affect the feature extraction calculation.
When running a rerank query it is possible to pass additional request parameters that will be used at feature extraction time.
We see this in details in the related section.

e.g.
rq={!ltr reRankDocs=3 model=externalmodel efi.user_from_mobile=1}

 

Deploy Features definition

Good, we defined all the features we require for our model, we can now send them to Solr :

curl -XPUT 'http://localhost:8983/solr/collection1/schema/feature-store' --data-binary @/path/features.json -H 'Content-type:application/json'  
 

View Features Definition

To visualise the features just sent, we can access the feature store:

curl -XGET 'http://localhost:8983/solr/collection1/schema/feature-store'  
 

Models Definition

We extensively explored how to train models and how models look like in the format the Solr plugin is expecting.
For details I suggest you reading : Part 2
Let’s have a quick summary anyway  :

 

Linear Model (Ranking SVM, Pranking)

e.g.

 {
    "class":"org.apache.solr.ltr.model.LinearModel",
    "name":"myModelName",
    "features":[
        { "name": "userTextTitleMatch"},
        { "name": "originalScore"},
        { "name": "isBook"}
    ],
    "params":{
        "weights": {
            "userTextTitleMatch": 1.0,
            "originalScore": 0.5,
            "isBook": 0.1
        }
    }
} 

 

Multiple Additive Trees (LambdaMART, Gradient Boosted Regression Trees )

e.g.

{
    "class":"org.apache.solr.ltr.model.MultipleAdditiveTreesModel",
    "name":"lambdamartmodel",
    "features":[
        { "name": "userTextTitleMatch"},
        { "name": "originalScore"}
    ],
    "params":{
        "trees": [
            {
                "weight" : 1,
                "root": {
                    "feature": "userTextTitleMatch",
                    "threshold": 0.5,
                    "left" : {
                        "value" : -100
                    },
                    "right": {
                        "feature" : "originalScore",
                        "threshold": 10.0,
                        "left" : {
                            "value" : 50
                        },
                        "right" : {
                            "value" : 75
                        }
                    }
                }
            },
            {
                "weight" : 2,
                "root": {
                    "value" : -10
                }
            }
        ]
    }
}  

Heuristic Boosted Model (experimental)

The Heuristic Boosted Model is an experimental model that combines linear boosting to any model.
It is currently available in the experimental branch [3].
This capability is currently supported only by the : org.apache.solr.ltr.ranking.HeuristicBoostedLambdaMARTModel .
The reason behind this approach is that sometimes, at training time we don’t have available all the features we want to use at query time.
e.g.
Your training set is not built on clicks of the search results and contains legacy data, but you want to include the original score as a boosting factor
Let’s see the configuration in details :
Given :

"features":[ { "name": "userTextTitleMatch"}, { "name": "originalScoreFeature"} ]
"boost":{ "feature":"originalScoreFeature", "weight":0.1, "type":"SUM" }  

The original score feature value, weighted by a factor of 0.1, will be added to the score produced by the LambdaMART trees.

 "boost":{ "feature":"originalScoreFeature", "weight":0.1, "type":"PRODUCT" }  
 

The original score feature value, weighted by a factor of 0.1, will be multiplied to the score produced by the LambdaMART trees.

N.B. Take extra care when using this approach. This introduces a manual boosting to the score calculation, which adds flexibility when you don’t have much data for training. However, you will loose some of the benefits of a machine learned model, which was optimized to rerank your results. As you get more data and your model becomes better, you should shift off the manual boosting.

 

e.g

{
    "class":"org.apache.solr.ltr.ranking.HeuristicBoostedLambdaMARTModel",
    "name":"lambdamartmodel",
    "features":[
        { "name": "userTextTitleMatch"},
        { "name": "originalScoreFeature"}
    ],
    "params":{
    "boost": {
          "feature": "originalScoreFeature",
          "weight": 0.5,
          "type": "SUM"
        },
        "trees": [
            {
                "weight" : 1,
                "root": {
                    "feature": "userTextTitleMatch",
                    "threshold": 0.5,
                    "left" : {
                        "value" : -100
                    },
                    "right": {
                        "value" : 10}
 }
 },
 {
 "weight" : 2,
 "root": {
 "value" : -10
 }
 }
 ]
 }
 }

Deploy Model

As we saw for the features definition, deploying the model is quite straightforward :

curl -XPUT 'http://localhost:8983/solr/collection1/schema/model-store' --data-binary @/path/model.json -H 'Content-type:application/json' 
 

View Model

The model will be stored in an easily accessible json store:

curl -XGET 'http://localhost:8983/solr/collection1/schema/model-store'
 

Rerank query

To rerank your search results using a machine learned LTR model it is required to call the rerank component using the Apache Solr Learning To Rank ( LTR ) query parser.

Query Re-Ranking allows you to run an initial query(A) for matching documents and then re-rank the top N documents re-scoring them based on a second query (B).
Since the more costly ranking from query B is only applied to the top N documents it will have less impact on performance then just using the complex query B by itself – the trade off is that documents which score very low using the simple query A may not be considered during the re-ranking phase, even if they would score very highly using query B.  Solr Wiki

The Apache Solr Learning To Rank ( LTR ) integration defines an additional query parser that can be used to define the rerank strategy.
In particular, when rescoring a document in the search results :

  • Features are extracted from the document
  • Score is calculated evaluating the model against the extracted feature vector
  • Final search results are reranked according to the new score
rq={!ltr model=myModelName reRankDocs=25}

!ltr – will use the ltr query parser
model=myModelName – specifies which model in the model-store to use to score the documents
reRankDocs=25 – specifies that only the top 25 search results from the original ranking, will be scored and reranked

When passing external feature information (EFI) that will be used to extract the feature vector, the syntax is pretty similar :

rq={!ltr reRankDocs=3 model=externalmodel efi.parameter1=’value1′ efi.parameter2=’value2′}

e.g.

rq={!ltr reRankDocs=3 model=externalModel efi.user_input_query=’Casablanca’ efi.user_from_mobile=1}

Sharding

When using sharding, each shard will rerank, so the reRankDocs will be considered per shard.

e.g.
10 shards
You run distributed query with :
rq={!ltr reRankDocs=10 …
You will get a total of 100 documents re-ranked .

Pagination

Pagination is delicate[4].

Let’s explore the scenario on a single Solr node and on a sharded architecture.

 

Single Solr node

reRankDocs=15
rows=10

This means each page is composed by 10 results.
What happens when we hit the page 2 ?
The first 5 documents in the search results will have been rescored and affected by the reranking.
The latter 5 documents will preserve the original score and original ranking.

e.g.
Doc 11 – score= 1.2
Doc 12 – score= 1.1
Doc 13 – score= 1.0
Doc 14 – score= 0.9
Doc 15 – score= 0.8
Doc 16 – score= 5.7
Doc 17 – score= 5.6
Doc 18 – score= 5.5
Doc 19 – score= 4.6
Doc 20 – score= 2.4

This means that score(15) could be < score(16), but document 15 and 16 are still in the expected order.
The reason is that the top 15 documents are rescored and reranked and the rest is left unchanged.

Sharded architecture

reRankDocs=15
rows=10
Shards number=2

When looking for the page 2, Solr will trigger queries to she shards to collect 2 pages per shard :
Shard1 : 10 ReRanked docs (page1) + 10 OriginalScored docs (page2)
Shard2 : 10 ReRanked docs (page1) + 10 OriginalScored docs (page2)

The the results will be merged, and possibly, original scored search results can top up reranked docs.
A possible solution could be to normalise the scores to prevent any possibility that a reranked result is surpassed by original scored ones.

Note: The problem is going to happen after you reach rows * page > reRankDocs. In situations when reRankDocs is quite high , the problem will occur only in deep paging.

Feature Extraction And Caching

Extracting the features from the search results document is the most onerous task while reranking using LTR.
The LTRScoringQuery will take care of computing the feature values in the feature vector and then delegate the final score generation to the LTRScoringModel.
For each document the definitions in the feature-store are applied to generate the vector.
The vector can be generate in parallel, leveraging a multi-threaded approach.
Extra care must be taken into account when configuring the number of threads in the game.
The feature vector is currently cached in toto in the QUERY_DOC_FV cache.

This means that given the query and EFIs, we cache the entire feature vector for the document.

Simply giving in input a different efi request parameter will imply a different hashcode for the feature vector and consequentially invalidate the cached one.
This bit can be potentially improved, managing separately caches for the query independent, query dependent and query level features[5].

Solr Is Learning To Rank Better – Part 3 – Ltr tools

Apache Solr Learning to Rank – Things Get Serious

This blog post is about the Apache Solr Learning to Rank Tools : a set of tools to ease the utilisation of the Apache Solr Learning To Rank integration.

The model has been trained in Part 2, we are ready to deploy it to Solr, but first it would be useful to have a better understanding of what we just created.
A LambdaMART model in a real world scenario is a massive ensemble of regression trees, not the most readable structure for a human.
More we understand the model, easier will be to find anomalies and to fix/improve it.
But the most important benefit of having a clearer picture of the training set and the model is the fact that it can dramatically improves the communication with the business layer :
  • What are the most important features in our domain ?
  • What kind of document should score high according to the model ?
  • Why this document (feature vector) is scoring that high ?
These are only examples, but a lot of similar questions can rise, and we need the tools to answer.

Apache Solr Learning to Rank Tools

This is how the Ltr tools project [1] was born ( LTR stands for Learning To Rank ).
Target of the project is to use the power of Apache Solr to visualise and understand a Learning To Rank model.
It is a set of simple tools specifically thought for LambdaMart models, represented in the Json format supported by the Bloomberg Apache Solr Learning To Rank integration.
Of course it is open source so feel free to extend it introducing additional models and functionalities.
All the tools provided are meant to work with a Solr backend in order to index data that we can later search easily.
The tools currently available provide the support to :
– index the model  in a Solr collection
– index the training set in a Solr collection
– print the top scoring leaves from a LambdaMART model

Preparation

To use the Learning To Rank ( LTR ) tools you must proceed with these simple steps :

  • set up the Solr backend – this will be a fresh Solr instance with 2 collections : models, trainingSet,  the simple configuration is available in : ltr-tools/configuration
  • gradle build – this will package the executable jar in : ltr-tools/ltr-tools/build/libs

Usage

Let’s briefly take a look to the parameters of the executable command line interface :
Parameter Description
-help Print the help message
-tool 
The tool to execute (possible values):
– modelIndexer
– trainingSetIndexer
– topScoringLeavesViewer
-solrURL
The Solr base URL to use for the search backend
-model 
The path to the model.json file
-topK
The number of top scoring leaves to return ( sorted by score descendant)
-trainingSet
The path to the training set file
-features
The path to the feature-mapping.json.
A file containing a mapping between the feature Id and the feature name.
-categoricalFeatures
The path to a file containing the list of categorical feature names.
 

N.B. all the following examples will assume the model in input is a LambdaMART model, in the json format the Bloomberg Solr Plugin expects.

 

Model Indexer

Requirement : Backend Solr collection <models> must be UP & RUNNING

The Model Indexer is a tool that indexes a lambdaMART model in Solr to better visualize the structure of the trees ensemble.
In particular the tool will index each branch split of the trees belonging to the lambdaMART ensemble as Solr documents.
Let’s take a look the solr schema:

configuration/solr/models/conf
 <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false"/>  
 <field name="modelName" type="string" indexed="true" stored="true"/>  
 <field name="feature" type="string" indexed="true" stored="true" docValues="true"/>  
 <field name="threshold" type="double" indexed="true" stored="true" docValues="true"/>  
 ...  
 

So giving in input a lambdaMART model :

 

e.g. lambdaMARTModel1.json

 {  
   "class":"org.apache.solr.ltr.ranking.LambdaMARTModel",  
   "name":"lambdaMARTModel1",  
   "features":[  
    {  
      "name":"feature1"  
    },  
    {  
      "name":"feature2"  
    }  
   ],  
   "params":{  
    "trees":[  
      {  
       "weight":1,  
       "root":{  
         "feature":"feature1",  
         "threshold":0.5,  
         "left":{  
          "value":80  
         },  
         "right":{  
          "feature":"feature2",  
          "threshold":10.0,  
          "left":{  
            "value":50  
          },  
          "right":{  
            "value":75  
          }  
         }  
       }  
      }  
    ]  
   }  
 }  
 

N.B. a branching split is where the tree split in 2 branches:

 "feature":"feature2",   
      "threshold":10.0,   
      "left":{   
       "value":50   
      },   
      "right":{   
       "value":75   
      }   

A split happens on a threshold of the feature value.
We can use the tool to start the indexing process :

 java -jar ltr-tools-1.0.jar -tool modelIndexer -model /models/lambdaMARTModel1.json  -solrURL http://localhost:8983/solr/models

After the indexing process has finished we can access Solr and start searching !
e.g.
This query will return in response for each feature :

  • the number of times the feature appears at a branch split
  • the top 10 occurring thresholds for that feature
  • the number of unique thresholds that appear in the model for that feature
 http://localhost:8983/solr/models/select?indent=on&q=*:*&wt=json&facet=true&json.facet={  
      Features: {  
           type: terms,  
           field: feature,  
           limit: -1,  
           facet: {  
                Popular_Thresholds: {  
                     type: terms,  
                     field: threshold,  
                     limit: 10  
                },  
                uniques: "unique(threshold)"  
           }  
      }  
 }&rows=0&fq=modelName:lambdaMARTModel1  


Let’s see how it is possible to interprete the Solr response :

 facets": {  
   "count": 3479, //number of branch splits in the entire model  
   "Features": {  
     "buckets": [  
       {  
         "val": "product_price",  
         "count": 317, //the feature "product_price" is occurring in the model in 317 splits  
         "uniques": 28, //the feature "product_price" is occurring in the splits with 28 unique threshold values  
         "Popular_Thresholds": {  
           "buckets": [  
             {  
               "val": "250.0", //threshold value  
               "count": 45 //the feature "product_price" is occurring in the splits 45 times with threshold "250.0"  
             },  
             {  
               "val": "350.0",  
               "count": 45  
             },  
             ...  

 

TrainingSet Indexer

Requirement : Backend Solr collection <trainingSet> must be UP & RUNNING

The Training set Indexer is a tool that indexes a Learning To Rank traning set (in RankLib format) in Solr to better visualize the data.
In particular the tool will index each training sample of the trainign set as a Solr document.
Let’s see the Solr schema :

configuration/solr/models/conf
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false"/>
<field name="relevancy" type="tdouble" indexed="true" stored="true" docValues="true"/> 
<dynamicField name="cat_*" type="string" indexed="true" stored="true" docValues="true"/>
<dynamicField name="*" type="tdouble" indexed="true" stored="true" docValues="true"/> 

As you can notice the main point here is definition of dynamic fields.
Indeed we don’t know beforehand the names of the features, but we can distinguish between categorical features ( which we can index as strings) and ordinal features (which we can index as double).

We require now 3 inputs :

 

1) the training set in the RankLib format: 

 

e.g. training1.txt

1 qid:419267 1:300 2:4.0 3:1 6:1
4 qid:419267 1:250 2:4.5 4:1 7:1
5 qid:419267 1:450 2:5.0 5:1 6:1
2 qid:419267 1:200 2:3.5 3:1 8:1 
 

2) the feature mapping to translate the feature Id to a human readable feature name

e.g. features-mapping1.json

 {"1":"product_price","2":"product_rating","3":"product_colour_red","4":"product_colour_green","5":"product_colour_blue","6":"product_size_S","7":"product_size_M","8":"product_size_L"}  

N.B. the mapping must be a json object on a single line

This input file is optional, it is possible to index directly the feature Ids as names.

3) the list of categorical features

e.g. categoricalFeatures1.txt

product_colour
product_size 

This list ( one feature per line) will clarify to the tool which features are categorical, to index the category as a string value for the feature.
This input file is optional, it is possible to index the categorical features as binary one hot encoded features.

To start the indexing process :

 java -jar ltr-tools-1.0.jar -tool trainingSetIndexer -trainingSet /trainingSets/training1.txt -features /featureMappings/feature-mapping1.json -categoricalFeatures /feature/categoricalFeatures1.txt -solrURL http://localhost:8983/solr/trainingSet

After the indexing process has finished we can access Solr and start searching !
e.g.
This query will return in response all the training samples filtered and then faceted on the relevancy field.

This can be an indication of the distribution of the relevancy score in specific subsets of the training set

 http://localhost:8983/solr/trainingSet/select?indent=on&q=*:*&wt=json&fq=cat_product_colour:red&rows=0&facet=true&facet.field=relevancy


N.B. this is a quick and dirty way to explore the training set. I deeply suggest you to use it as a quick resource. Advance data plotting is more suitable to visualize big data and identify patterns.

Top Scoring Leaves Viewer

The top scoring leaves viewer is a tool to print the path of the top scoring leaves in the model.
Thanks to this tool will be easier to answer to questions like :
” How a document (feature vector) should look like to get an high score?”

The tool will simply visit the ensemble of trees in the model and keep track of the scores of each leaf.

So giving in input a lambdaMART model : 

 

e.g. lambdaMARTModel1.json

 {  
   "class":"org.apache.solr.ltr.ranking.LambdaMARTModel",  
   "name":"lambdaMARTModel1",  
   "features":[  
    {  
      "name":"feature1"  
    },  
    {  
      "name":"feature2"  
    }  
   ],  
   "params":{  
    "trees":[  
      {  
       "weight":1,  
       "root":{  
         "feature":"feature1",  
         "threshold":0.5,  
         "left":{  
          "value":80  
         },  
         "right":{  
          "feature":"feature2",  
          "threshold":10.0,  
          "left":{  
            "value":50  
          },  
          "right":{  
            "value":75  
          }  
         }  
       }  
      }, ...  
    ]  
   }  
 }  

To start the process :

 java -jar ltr-tools-1.0.jar -tool topScoringLeavesViewer -model /models/lambdaMARTModel1.json -topK 10  
This will print the top scoring 10 leaves (with related path in the tree):
1000.0 -> feature2 > 0.8, feature1 <= 100.0
200.0 -> feature2 <= 0.8, 
80.0 -> feature1 <= 0.5, 
75.0 -> feature1 > 0.5, feature2 > 10.0, 
60.0 -> feature2 > 0.8, feature1 > 100.0, 
50.0 -> feature1 > 0.5, feature2 <= 10.0,  

Conclusion

The Apache Solr Learning To Rank tools are quick and dirty solutions to help people understanding better and working better with Learning To Rank models.
They are far from being optimal but I hope they will be helpful for people working on similar problems.
Any contribution, improvement, bugfix is welcome !

[1] Learning To Rank tools