Synonyms and Stopwords: Vademecum

In this post we’ll cover two additional synonyms scenarios and we’ll try to summarise all previous tips in a coincise form. Following the approach of the previous posts [1] [2] [3], everything can be applied both to Apache Solr and Elasticsearch.

Preconditions

  • Synonyms and stopwords at query time: this is not just a “theoretical” constraint; imagine if you have to manage a deployment context belonging to the same customer with a lot of small / medium indexes: you cannot re-build from scratch everything each time a synonym or a stopword changes.
  • Synonyms, not hypernyms or hyponyms: or better, we aren’t talking about what a thesaurus calls broader, narrower or related terms. Although some of the things below could be also valid in those contexts, the broader or narrower scope introduced with hypernyms, hyponyms or related concepts can have some weird side-effect on the scoring phase.

Test data

Let’s start with the test data.

  • synonyms = [“out of warranty, oow”, “transfer phone number, port number”]
  • stopwords = [“of”, “my”]
  • query analyzer = [ “standard_tokenizer”, “lowercase filter”, “synonyms (graph) filter”, “stopwords filter”]

#1: How can I define Multi-terms Concepts?

If you want to manage a multi-terms concept as a whole, regardless it has synonyms or not, you can use the synonyms file. Here’s a couple of examples: the first is a concept with one synonym, the second one doesn’t have any synonym:

Multimedia Messaging Service,Multimedia Text Message,MMS
Apache Cassandra, Apache Cassandra

As you can see, when a concept doesn’t have any available synonym, we can just repeat it.

Solr users only: don’t forget the following things:

  • the request handler should use an edismax or lucene query parser, and the SplitOnWhiteSpace flag (sow) must be set to true
  • the field type which includes the synonyms graph filter must have the autoGeneratePhraseQueries set to true

You can read more here [1] about this approach.

Note: this will work until the Lucene SynonymMap uses a List/Array for collecting the synonyms associated with a given concept. When and if the implementation will switch to a Set-like approach, there’s a high chance this trick will stop working.

#2: What if the query contains multi-terms concepts with stopwords?

Imagine a query like this

q=my car is out of warranty. What can I do?

Well, with the configuration above the stopwords removal after the synonyms detection causes a weird effect on the generated query: the “what” term is wrongly added to the synonym phrase query: “out ? warranty what”.

While the issue affects the FilteringTokenFilter (the superclass of StopFilter) and therefore it has a wider scope, for this specific problem we proposed a solution [2], consisting of a specialised StopFilter which is aware about synonym tokens. The result is that terms which are part of a previously detected synonym are not removed, even if they are stopwords. The query analyzer of our field becomes something like this:

<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SynonymGraphFilterFactory" 
        synonyms="synonyms.txt" 
        ignoreCase="false" 
        expand="true"/>
<filter class="io.sease.SynonymAwareStopFilterFactory" 
        words="stopwords.txt" 
        ignoreCase="true"/>

#3: What if the document contains multi-terms concepts with “intruder” stopwords?

We have a document like this:

{
  "id": 1,
  "title": "how do I transfer my phone number?"
}

and the query:

q=transfer phone number procedure

at query time, the synonym is correctly detected and phrase clauses are generated, but unfortunately it doesn’t match the document above because the intermediate “my” stopwords:

You can read here [3] the proposed solution for this scenario, which basically consists of a two-steps query plan: in the first, the detected synonyms generate phrase clauses, while in the second they are destructured in term clauses.

#4: What if the query contains multi-terms concepts with “intruder” stopwords?

And here we are in the opposite case. We have a document like this:

{ 
  "id": 1, 
  "title": "transfer phone number procedure" 
}

and the query:

q=how do I transfer my phone number?

As you can see, at query time the synonym is not detected because the “my” stopword between terms. While the document above could be still be part of the response of the generated query, here we are focusing on the missing synonym detection.

A possible solution is to double the synonym filter before and after the stopwords filter:

<fieldtype 
       name="text_with_synonyms_phrases" 
       class="solr.TextField" autoGeneratePhraseQueries="true">
       
       <analyzer type="index">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
       </analyzer>
       <analyzer type="query">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
           <filter class="solr.SynonymGraphFilterFactory" 
                   synonyms="synonyms.txt" 
                   ignoreCase="true" 
                   expand="true"/>
           <filter class="io.sease.SynonymAwareStopFilterFactory" 
                   words="stopwords.txt" 
                   ignoreCase="true"/>
           <filter class="solr.SynonymGraphFilterFactory" 
                   synonyms="synonyms.txt" 
                   ignoreCase="true" 
                   expand="true"/>
       </analyzer>
</fieldtype>

In the first iteration the synonym is not detected, then the StopFilter removes the “my” stopword so in the second iteration the synonym will be correctly recognized. Note the StopFilter is still the custom class we introduced in #2 because we want to cover also that scenario.

What is the drawback of this approach? This is something which worked in my specific case, but be aware that the SynonymGraphFilter documentation states this explicit warning:

NOTE: this cannot consume an incoming graph; results will be undefined.

#5 (UNSOLVED) What if the query contains multi-terms concepts more than one “intruder” stopwords?

This is the worst case, where we have a query like this:

q=out of my warranty

That is: we have a couple of terms which have been declared as stopwords, but the first (of) is potentially part of a synonym (out of warranty) while the second (my) isn’t.

We’re still working on this case so unfortunately there’s no a proposal here, if you got some idea or feedback, it is warmly welcome.


[1] Multi-terms concepts in Apache Solr / Elasticsearch
[2] SynonymAwareStopFilter
[3] https://sease.io/2018/08/still-synonyms-stopwords-mamma-mia.html

Still Synonyms + Stopwords?? Mamma mia!

The Context

Brief recap of where we arrived in the preceding article: we had the following synonyms and stopwords settings:

  • synonyms = {“out of warranty”,”oow”}
  • stopwords = {“of”}

Both of those filters were configured exclusively at query-time; the synonym filter first and then the stopwords filter.

Using the built-in StopFilter we had a synonym detection issue because the removal of the “of” term in the query string (e.g. “my device ran out of warranty“). For that reason, we introduced a custom StopFilter subclass which was aware about stopwords in synonyms.

The other scenario we are going to describe is a little bit different: let’s suppose we have the following data:

  • synonyms = {test code, tdd, testing}
  • stopwords = {my, your, how ,to, in}

Still here, we want to manage synonyms and stopwords only at query time.
We have this document indexed:

   {
      "id": 1,
      "title": "Java programmer: do you want to test your code?"
   }

And a query like this:

"how to test code in Java?"

The Problem: missing synonym match

The query parser matches the “test code” synonym in the query and produces a query like this:

(title:tdd title:testing PhraseQuery(title:"test code")) title:java

unfortunately there’s no match, because the document title contains an intruder: the “your” term between the “test” and “code”.

A Solution: invisible queries with and without synonym phrases

In the preceding article we’ve underlined the role of the autoGeneratePhraseQueries flag. It is the responsible of creating phrase clauses for all detected multi-terms synonyms. In case this flag is set to false (or even missing) the generated query won’t have any phrase, even if a multi-term synonym is detected.

While ordinarily this is not what you would expect, in this specific case it could be a valid alternative for dealing with such mismatching: a first request would require the “synonym phrasing” behaviour, a second one wouldn’t. The first query would be:

(title:tdd title:testing PhraseQuery(title:"test code")) title:java

After receiving an empty response, a second query will be sent, targeting another (similar) field related to a field type which has the autoGeneratePhraseQueries parameter will be set to false. That would generates the following query:

(title:testing title:tdd (+title:test +title:code)) title:java

and here we would get a match!

A couple of notes:

  • in the second try we are requiring the disjoint presence of those two terms (“test” and “code”) in whatever order, with whatever proximity, so the increased recall could produce some unexpected results. In case we are using the edismax query parser, a “pf” parameter would be helpful for moving up those results which adhere better to the entered query, in terms of proximity and terms order.
  • we could put the stop filter at index time, but that violates the precondition: we want a pure query-time management.

How to implement such search workflow? In Solr, we need a couple of fields, the first one is exactly the field + field type we described in the preceding article, the second is similar, the only difference is in the autoGeneratePhraseQueries parameter, which is set to false:

<fieldtype 
       name="text_with_synonyms_phrases" 
       class="solr.TextField" autoGeneratePhraseQueries="true">
       
       <analyzer type="index">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
       </analyzer>
       <analyzer type="query">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
           <filter class="solr.SynonymGraphFilterFactory" 
                   synonyms="synonyms.txt" 
                   ignoreCase="false" 
                   expand="true"/>
           <filter class="sc.SynonymAwareStopFilterFactory" 
                   words="stopwords.txt" 
                   ignoreCase="true"/>
       </analyzer>
</fieldtype>
<fieldtype 
       name="text_without_synonyms_phrases" 
       class="solr.TextField" autoGeneratePhraseQueries="false">
       
       <analyzer type="index">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
       </analyzer>
       <analyzer type="query">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
           <filter class="solr.SynonymGraphFilterFactory" 
                   synonyms="synonyms.txt" 
                   ignoreCase="false" 
                   expand="true"/>
           <filter class="sc.SynonymAwareStopFilterFactory" 
                   words="stopwords.txt" 
                   ignoreCase="true"/>
       </analyzer>
</fieldtype>

<field 
      name="title_with_synonyms_phrases" 
      type="text_with_synonyms_phrases .../>
<field 
      name="title_without_synonyms_phrases" 
      type="text_without_synonyms_phrases .../>

then, here is the minimal request handler:

<requestHandler name="/search" class="solr.SearchHandler" default="true">
       <lst name="defaults">
           <bool name="sow">false</bool>
           <str name="df">title_with_synonyms_phrases</str>
           <str name="defType">lucene</str> 
       </lst>
   </requestHandler>

A client would send first a request like this:

/search?q=how to test code in Java

And, after receiving an empty response, it will send a second query:

/search?q=how to test code in Java&df=text_without_synonyms_phrases

Another option, which moves the search workflow on Solr side, is our CompositeRequestHandler [1], a Solr component which invokes in chain a set of RequestHandler instances: a first request handler, targeting the title_with_synonyms_phrases would be invoked and, in case of zero results, the same query will be sent to another request handler, which would target the title_without_synonyms_phrases.

Note for Elasticsearch users: you will find some difference in applying what is described above. Although the auto_generate_phrase_queries attribute is also present in Elasticsearch, it doesn’t have the same effect. What you’re looking for is an attribute which is not related with field types, it is a query attribute [2] [3]  and it is called auto_generate_synonyms_phrase_query.


[1] https://github.com/SeaseLtd/composite-request-handler
[2] Match Query / Synonyms
[3] Query String Query / Synonyms

Synonyms + Stopwords?? OMG!

The Context

The scenario description is quite simple: we want to use synonyms and stopwords.

Following the path of our previous article, we will introduce an additional component in the analysis chain: a StopFilter, which, as the name suggests, removes a set of words from an incoming token stream.

We will use the following data through the examples:

  • synonyms = [“out of warranty”,”oow”]
  • stopwords = [“of”]

Token filters can be configured at index and/or query time. In this context we are focused on the query side: both synonyms and stopwords will be configured only in the query analyzer.

Working exclusively at query time has a great benefit: we can change things at runtime without any reindex need. At the same time, no stopwords filtering will be executed at index time so those terms will be uselessly part of the dictionary.

The Problem: synonyms followed by stopwords

We have the following analyzers:

  • index analyzer
    • standard-tokenizer
    • lowercase
  • query analyzer
    • standard-tokenizer
    • lowercase + synonyms + stopwords

Theoretically, in the query analyzer we would have two options: the stopwords filter could be defined before or after the synonym filter. However, the first way (before) doesn’t make so much sense, because terms that are stopwords and that are, at the same time, part of a synonym will be removed before the synonym detection. As consequence of that those synonym won’t be detected: in the example data, issuing a query like

q=out of warranty

the “of” term will be removed by the StopFilter, the subsequent filter would receive [“out”, “warranty”], which doesn’t match the configured synonym (“out of warranty”).

Elasticsearch users: Elasticsearch doesn’t allow this scenario at all; if you try to use the PUT Settings API with a chain defined as above (first stopwords then synonyms with some term intersection), it will throw an illegal argument exception saying “term: out of warranty analyzed to a token (warranty) with position increment != 1 (got: 2)” .

Apache Solr instead uses a lenient approach: no errors at index creation, but the problem remains (personally I prefer the Elasticsearch approach)

So the obvious choice is to postpone the stopwords management after the synonym filter. Unfortunately, here there’s an issue: the stopword(s) removal has some unwanted side-effect in the generated token graph and the query parser generates a wrong query because it consumes the token stream at the end of the chain.

Let’s imagine we have the following query:

q=tv went out of warranty something of

it will generate the following:

title:tv title:went (title:oow PhraseQuery(title:"out ? warranty something"))

As you can see, the synonym (out of warranty -> oow) is correctly detected but the stopwords filter removes all the “of” tokens, even if the first occurrence is part of a synonym. In the generated query you can see the sneaky effect: the “hole” created by the first “of” occurrence removal, produces the inclusion, in the phrase query, of the next available token in the stream (“something”, in the example).

In other words, the oow token synonym is marked with a positionLength = 3, which correctly means it spans three tokens (1=out, 2=of, 3=warranty); later, the query parser will include the next three available terms for generating a synonym phrase queries but since we no longer have the 2nd token (of), such count includes also “something”, which is the 3rd available token in the stream.

Before proceeding: this is a known problem, a long-standing issue [1] in Lucene which has a broader domain because it is related with the FilteringTokenFilter, the superclass of StopFilter.

The problem we will try to solve is: how can we manage synonyms and stopwords at query time without generating the conflict above?

A Solution

A note first: the token filter we are going to create is something that deals only with Lucene classes. However, when things need to be plugged in a runtime container (e.g. Apache Solr or Elasticsearch) the deployment procedure depends on the target platform: we won’t cover this part here.

The proposed solution is to create a StopFilter subclass which will be “synonym-aware”; it will check the tokenType and positionLength attributes before deciding if a token needs to be removed from the stream. The goal is to avoid removing those terms which have been defined in the stopwords list but are part of a synonym definition.

The class that we are going to extends is org.apache.lucene.analysis.core.StopFlter. This is an empty class, because all the filtering logic is in the superclasses (org.apache.lucene.analysis.StopFilter and the more generic org.apache.lucene.analysis.FilteringTokenFilter). The stopwords logic resides in the accept() method, which as you can see is very simple:

protected boolean accept() {
  return !stopWords.contains(termAtt.buffer(), 0, termAtt.length());
}

If the stopwords list contains the current term, it will be removed. So far, so good. We need to extend (actually we could also decorate) the StopFilter class for doing something else before calling the logic above.

First we need to check the token type: if a token has been marked as a SYNONYM then our filter doesn’t have to remove it. Then we need to check the positionLength attribute, because, within a synonym detection context, a position length greater than 1 means we have traversing a multi-term synonym:

public class SynonymAwareStopFilter extends StopFilter {

  private TypeAttribute tAtt = 
                             addAttribute(TypeAttribute.class);
  private PositionLengthAttribute plAtt = 
                             addAttribute(PositionLengthAttribute.class);

  private int synonymSpans;

  protected SynonymAwareStopFilter(
                         TokenStream in, CharArraySet stopwords) {
    super(in, stopwords);
  }

  @Override
  protected boolean accept() {
    if (isSynonymToken()) {
      synonymSpans = plAtt.getPositionLength() > 1 
                             ? plAtt.getPositionLength() 
                             : 0;
      return true;
    }

    return (--synonymSpans > 0) || super.accept();
  }

  private boolean isSynonymToken() {
    return "SYNONYM".equals(tAtt.type());
  }

Let’s do some test. We will use Apache Solr 7.4.0 for checking the results. Here is the field type definition, where you can see our SynonymAwareStopFilter:

<fieldtype name="text" class="solr.TextField" autoGeneratePhraseQueries="true">
       <analyzer type="index">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
       </analyzer>
       <analyzer type="query">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
           <filter class="solr.SynonymGraphFilterFactory" 
                   synonyms="synonyms.txt" 
                   ignoreCase="false" 
                   expand="true"/>
           <filter class="sc.SynonymAwareStopFilterFactory" 
                   words="stopwords.txt" 
                   ignoreCase="true"/>
       </analyzer>
</fieldtype>

and this is a minimal request handler:

<requestHandler name="/def" class="solr.SearchHandler" default="true">
       <lst name="defaults">
           <bool name="sow">false</bool>
           <str name="df">title</str>
           <str name="defType">lucene</str>
           <bool name="debug">true</bool>
       </lst>
   </requestHandler>

Running the previous query:

q=tv went out of warranty something of

we have the following:

title:tv title:went (title:oow PhraseQuery(title:"out of warranty")) title:something

if we use instead the other synonym variant:

q=tv went oow something of

we have the following:

title:tv title:went (PhraseQuery(title:"out of warranty") title:oow) title:something

Everything seems working as expected! This is probably just one specific scenario among those addressed by LUCENE-4065; however, it helped me a lot because this is (at least in my experience) a frequent use case.

As usual, any feedback is warmly welcome. See you next time!

 


[1] https://issues.apache.org/jira/browse/LUCENE-4065

Rated Ranking Evaluator: Help the poor (Search Engineer)

A Software Engineer is always required to give his customers a concrete evidence about deliverables quality. A Search Engineer deals with a specialisation of such generic Software Quality, which is called Search Quality.

What is Search Quality? And why is it so important in a search infrastructure? After all, the “Software Quality” should be omni-comprensive, it should always include everything (and actually it is), but when we are dealing with search systems, the quality is a very abstract term, which is very hard to define in advance.

The functional correctness of a search infrastructure (assuming the correctness is the only factor which influences the system quality – and it isn’t) is naturally associated with human judgments, with opinions, and unfortunately we know opinions can be different among people.

The business stakeholders, which will get a value from a search system, can belong to different categories, can have different expectations, and they can have in mind a different idea about the expected system correctness.

In this scenario a Search Engineer is facing many challenges in terms of choices, and at the end, he has to provide concrete evidences about the functional coverage of those choices.

This is the context where we developed the Rated Ranking Evaluator (hereafter RRE).

What it is?

The Rated Ranking Evaluator (RRE) is a search quality evaluation tool which evaluates the quality of results coming from a search infrastructure.

It helps a Search Engineer in his daily job. Are you a Search Engineer? Are you tuning/implementing/changing/configuring a search infrastructure? Do you want to have something that gives you an evidence about the improvements between changes? RRE could give you a hand on that.

RRE formalises how well a search system satisfies the user information needs, at “technical” level, combining a rich tree-like domain model with several evaluation measures, but also at “functional” level, providing human-readable outputs that could target the business stakeholders.

It encourages an incremental/iterative/immutable approach during the deveoopment and the evolution of a search system: assuming we’re starting our system from version x.y: when it’s time to apply some relevant change to its configuration, instead of applying changes to x.y, is better to clone it and apply those changes to the new fresh version.

In this way, RRE will execute the evaluation process on all available versions, it will provide the delta/trend between  subsequent versions, so you can immediately get a fine-grained picture about where the system is going, in terms of relevance.

This post is only a brief summary about RRE. You can find more detailed information in the project Wiki.

In a few words, what can I get from RRE?

You can configure RRE as a compounding part of your project build cycle. That means, every time a build is triggered, an evaluation process will be executed.

RRE is not tied to a given search platform: it provides a mini-framework for plugging-in different search platforms. At the moment we have two available bindings: Apache Solr and Elasticsearch  (see here for supported versions).

The output evaluation data will be available:

  • as a JSON file: for further elaborations
  • as a spreadsheet: for delivering the evaluation results to someone else (e.g. a business stakeholder)
  • in a Web Console where metrics and their values get refreshed in real time (after each build)

How it works

RRE provides a rich, composite, tree-like, domain model, where the evaluation concept can be seen at different levels.

RRE Domain Model

The Evaluation at the top level is just a container of the nested entities. Note that all entities relationships are 1 to many. In this context, a Corpus is defined as a test dataset. RRE will use it for executing the evaluation process; in a single evaluation process you can have multiple datasets.

A Topic is an information need: it defines a functional requirement on the end-user perspective. Within a topic we can have several queries, which express the same need but more close to a technical layer. RRE provides a further abstraction in the middle: query groups. A Query Group is a group of queries which are supposed to produce the same results (and therefore are associated with the same judgments set).

Queries, which are the technical leaves of RRE domain model, are furtherly decomposed in several perspectives, one for each available version of our system. A query itself is of course a single entity, but during an evaluation session, its concrete execution happens several times, one for each available version. That because RRE needs to measure the search results (i.e. the query executions) against all versions.

For each version we will finally have one or more metrics, depending on the configuration. Last but not least, even if metrics are computed at query/version level, RRE will aggregate those values at upper levels (see the dashed vertical lines in the diagram) so each entity/level in the domain model will offer an aggregate perspective of all available metrics (i.e I could be interested in the NDCG for a given query, or I could just stop my analysis at a topic level).

Input

In order to execute an evaluation process, RRE needs the following things:

  • One or more corpus / test collection: these are the representative datasets of a specific domain, that will be used for populating and querying a target search platform
  • One or more configuration sets: although there’s nothing against having one single configuration, a minimum of two versions are required in order to provide a comparison between evaluation measures.
  • One or more ratings sets: this is where judgments are defined, in terms of relevant documents for each query group.

Output

The RRE concrete output depends on the runtime container where it is running. The RRE core itself is just a library, so when used programmatically within a project, it outputs a set of objects corresponding to the domain model described above.

When it is used as a Maven plugin, it primarily outputs the same structure in JSON format. This data is then used for producing further outputs, like a spreadsheet. The same payload can be sent to another module called RRE Server, which offers an AngularJS based web console that gets automatically refreshed.

The RRE console is very useful when we are doing internal iterations / tries around some issue, which usually requires very short edit-and-immediately-check cycles. Imagine if you can have a couple of monitors on your desk: in the first there’s your favourite IDE, where you change things, run builds. In the second there’s the RRE Console (see below). After each build, just have a look on the console in order to get an immediate feedback of your changes.

Where can I start?

The project repository in Github offers all what you need: a detailed documentation about how it works and how to quick start with RRE.

If you need some help, feel free to contact us! We appreciate any feedback, suggestion and, last but not least, contribution.

Future works

As you can imagine, the topic is quite huge. We have a lot of interesting ideas about the platform evolution.

These are some examples:

  • integration with some tool for building the relevance judgments. That could be some UI or a more sophisticated user interaction collector (which will automatically generates the ratings sets on top of computed online metrics like click through rate, sales rate)
  • Jenkins plugin: for a better integration of RRE into the popular CI tool
  • Gradle plugin
  • Apache Solr Rank Eval API: using the RRE core we could implement a Rank Eval endpoint in Solr, similar to the Rank Eval API provided in Elasticsearch
  • ??? Other? Any suggestion is warmly welcome!

Links

Apache Lucene BlendedInfixSuggester : How It Works, Bugs And Improvements

The Apache Lucene/Solr suggesters are important to Sease : we explored the topic in the past[1] and we strongly believe the autocomplete feature to be vital for a lot of search applications.
This blog post explores in details the current status of the Lucene BlendedInfixSuggester, some bugs of the most recent version ( with the solution attached) and some possible improvements.

BlendedInfixSuggester

The BlendedInfixSuggester is an extension of the AnalyzingInfixSuggester with the additional functionality to weight prefix matches of your query across the matched documents.
It scores higher if a hit is closer to the start of the suggestion.
N.B. at the current stage only the first term in your query will affect the suggestion score

Let’s see some of the configuration parameters from the official wiki:

  • blenderType: used to calculate the positional weight coefficient using the position of the first matching word. Can be one of:
    • position_linear: weightFieldValue*(1 – 0.10*position): Matches to the start will be given a higher score (Default)
    • position_reciprocal: weightFieldValue/(1+position): Matches to the start will be given a score which decays faster than linear
    • position_exponential_reciprocal: weightFieldValue/pow(1+position,exponent): Matches to the start will be given a score which decays faster than reciprocal
      • exponent: an optional configuration variable for the position_reciprocal blenderType used to control how fast the score will increase or decrease. Default 2.0.
Description
Data Structure Auxiliary Lucene Index
Building For each Document, the stored content from the field is analyzed according to the suggestAnalyzerFieldType and then additionally EdgeNgram token filtered.

Finally an auxiliary index is built with those tokens.

Lookup strategy The query is analysed according to the suggestAnalyzerFieldType.

Than a phrase search is triggered against the Auxiliary Lucene index

The suggestions are identified starting at the beginning of each token in the field content.

Suggestions returned The entire content of the field .

This suggester is really common nowadays as it allows to provide suggestions in the middle of a field content, taking advantage of the analysis chain provided with the field.

It will be possible in this way to provide suggestions considering synonymsstop words, stemming and any other token filter used in the analysis and match the suggestion based on internal tokens.
Finally the suggestion is scored, based on the position match.

The simple corpus of document for the examples will be the following :

[
      {
        "id":"44",
        "title":"Video gaming: the history"},
      {
        "id":"11",
        "title":"Nowadays Video games are a phenomenal economic business"},
      {
        "id":"55",
        "title":"The new generation of PC and Console Video games"},
      {
        "id":"33",
        "title":"Video games: multiplayer gaming"}]

And a simple synonym mapping : multiplayer, online

Let’s see some example :

Query to autocomplete Suggestions Explanation
“gaming”
  • “Video gaming: the history”
  • “Video game: multiplayer gaming”
  • “Nowadays Video games are a phenomenal economic business”
The input query is analysed, and the tokens produced are the following : “game” .

In the Auxiliary Index , for each of the field content we have the EdgeNgram tokens:

“v”,”vi”,”vid”… , “g”,”ga”,”gam”,“game” .

So the match happens and the suggestion are returned.
N.B. First two suggestions are ranked higher as the matched term happen to be closer to the start of the suggestion

Let’s explore the score of each Suggestion given various Blender Types :

Query gaming
Suggestion First Position Match Position Linear Position Reciprocal Position Exponential Reciprocal
Video gaming: the history 1 1-0.1*position = 0.9 1/(1+position) = 1/2 = 0.5 1/(1+position)^2 = 1/4 = 0.25
Video game: multiplayer gaming 1 1-0.1*position = 0.9 1/(1+position) = 1/2 = 0.5 1/(1+position) = 1/4 = 0.25
Nowadays Video games are a phenomenal economic business 2 1-0.1*position = 0.8 1/(1+position) = 1/3 = 0.3 1/(1+position) = 1/9 = 0.1

The final score of the suggestion will be :

long score = (long) (weight * coefficient)

N.B. the reason I highlighted the data type is because it’s directly affecting the first bug we discuss.

Suggestion Score Approximation

The optional weightField parameter is extremely important for the Blended Infix Suggester.
It assigns the value of the suggestion weight ( extracted from the mentioned field).
e.g.
The suggestion may come from the product name field, but the suggestion weight depends on how profitable the product suggested is.

<str name=”field”>productName</str>
<str name=”weightField”>profit</str>

So far, so good, but unfortunately there are two problems with that.

Bug 1 – WeightField Not Defined -> Zero suggestion score

How To Reproduce It : Don’t define any weightField in the suggester config
Effect : the suggestion ranking is lost, all the suggestions have 0 score, position of the match doesn’t matter anymore
The weightField is not a mandatory configuration for the BlendedInfixSuggester.
Your use case could not involve any weight for your suggestions and you are just interested in the positional scoring (the main reason the BlendedInfixSuggester exists in the first place).
Unfortunately, this is not possible at the moment :
If the weightField is not defined, each suggestion will have a weight of 0.
This is because the weight associated to each document in the document dictionary is a long. If the field to extract the weight from, is not defined (null), the weight returned will just be 0.
This doesn’t allow to differentiate when a weight should be 0 ( value extracted from the field) or null ( no value at all ).
A solution has been proposed here[3].

Bug 2 – Bad Approximation Of Suggestion Score For Small Weights

There is a misleading data type casting in the score calculation for the suggestion :

long score = (long) (weight * coefficient)

This apparently innocent cast, actually brings very nasty effects if the weight associated to a suggestion is unitary or small enough.

Weight =1
Video gaming: the history
1-0.1*position = 0.9 * 1 =cast= 0
1/(1+position) = 1/2 = 0.5 * 1 =cast= 0
1/(1+position)^2 = 1/4 = 0.25 * 1 =cast= 0

Weight =2
Video gaming: the history
1-0.1*position = 0.9 * 2=cast= 1
1/(1+position) = 1/2 = 0.5 * 2=cast= 1
1/(1+position)^2 = 1/4 = 0.25 * 2=cast= 0

Basically you risk to lose the ranking of your suggestions reducing the score to only few possible values : 0 or 1 ( in edge cases)

A solution has been proposed here[3]

Multi Term Matches Handling

It is quite common to have multiple terms in the autocomplete query, so your suggester should be able to manage multiple matches in the suggestion accordingly.

Given a simple corpus (composed just by the following suggestions) and the query :
“Mini Bar Frid” 

You see these suggestions:

  • 1000 | Mini Bar something Fridge
  • 1000 | Mini Bar something else Fridge
  • 1000 | Mini Bar Fridge something
  • 1000 | Mini Bar Fridge something else
  • 1000 | Mini something Bar Fridge

This is because at the moment, the first matching term wins all ( and the other positions are ignored).
This brings a lot of possible ties (1000), that should be broken to give the user a nice and intuitive ranking.

But intuitively I would expect in the results something like (note that allTermsRequired=true and the schema weight field always returns 1000)

  • Mini Bar Fridge something
  • Mini Bar Fridge something else
  • Mini Bar something Fridge
  • Mini Bar something else Fridge
  • Mini something Bar Fridge

Let’s see a proposed Solution [4] :

Positional Coefficient

Instead of taking into account just the first term position in the suggestion, it’s possible to use all the matching positions from the matched terms [“mini”,”bar”,”fridge”].
Each position match will affect the score with :

  • How much the matched term position is distant from the ideal position match
    • Query : Mini Bar Fri, Ideal Positions : [0,1,2]
    • Suggestion 1Mini Bar something Fridge, Matched Positions:[0,1,3]
    • Suggestion 2Mini Bar something else Fridge, Matched Positions:[0,1,4]
    • Suggestion 2 will be penalised as “Fri” match happens farer (4 > 3) from the ideal position 2
  • Earlier the mis-position happened, stronger the penalty for the score to pay
    • Query : Mini Bar Fri, Ideal Positions : [0,1,2]
    • Suggestion 1Mini Bar something Fridge, Matched Positions:[0,1,3]
    • Suggestion 2Mini something Bar Fridge, Matched Positions:[0,2,3]
    • Suggestion 2 will be additionally penalised as the first mis-match in positions Bar happens closer to the beginning of the suggestion 

Considering only the discountinue position proved useful :

Query1: Bar some
Query2: some
Suggestion : Mini Bar something Fridge
Query 1 Suggestion Matched Terms positions : [1,2]
Query 2 Suggestion Matched Terms positions : [2]

If we compare the suggestion score for both these queries, it would seem unfair to penalise the first one just because it matches 2 terms ( consecutive) while the second query has just one match ( positioned worst than the first match in query1)

Introducing this advanced positional coefficient calculus helped in improving the overall behavior for the experimental test created.
The results obtained were quite promising :

Query : Mini Bar Fri
100 |Mini Bar Fridge something
100 |Mini Bar Fridge something else
100 |Mini Bar Fridge a a a a a a a a a a a a a a a a a a a a a a
26 |Mini Bar something Fridge
22 |Mini Bar something else Fridge
17 |Mini something Bar Fridge
8 |something Mini Bar Fridge
7 |something else Mini Bar Fridge

There is still a tie for the exact prefix matches, but let’s see if we can finalise that improvement as well .

Token Count Coefficient

Let’s focus on the first three ranking suggestions we just saw :

Query : Mini Bar Fri
100 |Mini Bar Fridge something
100 |Mini Bar Fridge something else
100 |Mini Bar Fridge a a a a a a a a a a a a a a a a a a a a a a

Intuitively we want this order to break the ties.
Closer the number of matched terms with the total number of terms for the suggestion, the better.
Ideally we want our top scoring suggestion to just have the matched terms if possible.
We also don’t want to bring strong inconsistencies for the other suggestions, we should ideally only affect the ties.
This is achievable calculating an additional coefficient, dependant on the term counts :
Token Count Coefficient = matched terms count / total terms count

Then we can scale this value accordingly :
90% of the final score will derive from the positional coefficient
10% of the final score will derive from the token count coefficient

Query : Mini Bar Fri
90 * 1.0 + 10*3/4 = 97|Mini Bar Fridge something
90 * 1.0 + 10*3/5 = 96|Mini Bar Fridge something else
90 * 1.0 + 10*3/25 = 91|Mini Bar Fridge a a a a a a a a a a a a a a a a a a a a a a

It will require some additional tuning but the overall idea should bring a better ranking function to the BlendedInfix when multiple terms matches are involved!
If you have any suggestion, feel free to leave a comment below!
Code is available in the Github Pull Request attached to the Lucene Jira issue[4]

[1] Solr Autocomplete
[2] Blended Infix Suggester Solr Wiki
[3] LUCENE-8343
[4] LUCENE-8347

Apache Solr: orchestrating Known item and Full-text search

Scenario

You’re working as a search engineer for XYZ Ltd, a company which sells electric components. XYZ provided you the application logs of the last six months, and some business requirements.

Two kinds of customers, two kinds of requirements, two kinds of search

The log analysis shows that XYZ has mainly two kinds of customers: a first group, the “expert” users (e.g. electricians, resellers, shops) whose members are querying the system by product identifiers, codes (e.g. SKU, model codes, thinks like Y-M8GB, 140-213/A and ABD9881); it’s clear, at least it seems so, they already know what they want and what they are looking for. However, you noticed a lot of such queries produce no results. After investigating, the problem seems to be that codes and identifiers are definitely hard to remember: queries use a lot of disparate forms for pointing to the same product. For example:

  • y-m8gb (lowercase)
  • YM8GB (no delimiters)
  • YM-8GB (delimiter in a wrong place)
  • Y/M8GB (wrong delimiter)
  • Y M8GB (whitespace instead of delimiter)
  • y M8/gb (a combination of cases above)

This kind of scenario, where there’s only one relevant document in the collection, is usually referred to as “Known Item Search”: our first requirement is to make sure this “product identifier intent” is satisfied.

The other group of customers are end-users, like me and you. Being not so familiar with product specs like codes or model codes, the behaviour here is different: they use a plain keyword search, trying to match products by entering terms which represents names, brands, manufacturer. An here it comes the second requirement which can be summarized as follows: people must be able to find products by entering plain free-text queries.

As you can imagine, in this case search requirements are different from the other scenario: the focus here is more “term-centric”, therefore involving different considerations about the text analysis we’d need to apply.

While the expert group query is supposed to point to one and only one product (we are in a black / white scenario: match or not), the needs on the other side require the system to provide a list of “relevant” documents, according to the terms entered.

An important thing / assumption before proceeding: for illustration purposes we will consider those two queries / user groups as disjoint: that is, a given user belongs only to one of the mentioned groups, not both. Better, a given user query will contain product identifiers or terms, not both. 

Schema & configuration notes

The expert group, and the “Known Item Search”

The “product identifier” intent, which is assumed to be implicit in the query behaviour of this group, can be captured, both at index and query time, by applying the following analyzer, which basically treats the incoming value as a whole, normalizes it to lower case, removes all delimiters and finally collapses everything in a single output token.

<fieldtype name="identifier" class="solr.TextField" omitNorms="true">
    <analyzer>
        <tokenizer class="solr.KeywordTokenizerFactory" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.WordDelimiterGraphFilterFactory"
                generateWordParts="0"
                generateNumberParts="0"
                catenateWords="0"
                catenateNumbers="0"
                catenateAll="1"
                splitOnCaseChange="0" />
    </analyzer>
</fieldtype>
<field name="product_id" type="identifier" indexed="true" ... />

In the following table you can see the analyzer in action with some example:

As you can see, the analyzer doesn’t declare a type attribute because it is supposed to be applied both at index and query time. However, there’s a difference in the incoming value: at index time the analyzer is dealing with a field content (i.e. the value of a field of an incoming document), while at query time the value which flows through the pipeline is composed by one or more terms entered by the user (a query, briefly).

While at index time everything works as expected, at query time the analyzer above requires a feature that has been introduced in Solr 6.5: the “Split On Whitespace” flag [1]. When it is set to “false” (as we need here in this context), it causes the incoming query text to be kept as a single whole unit, when sent to the analyzer.

Prior to Solr 6.5 we didn’t have such control, and the analyzers were receiving a “pre-tokenized-by-whitespaces” tokens; in other words, the unit of work of the query-time analysis was the single term: the analyzer chain (including the tokenizer itself) was invoked for each term outputted by that pre-whitespace-tokenization. As consequence of that our analyzer, at query time, couldn’t work as expected: if we take the example #5 and #6 from the table above, you can see the user entered a whitespace. With the “Split on Whitespace” flag set to true (explicitly, or using a Solr < 6.5), the pre-tokenization described above produces two tokens:

  • #5 = {“Y”, ”M8GB”}
  • #6 = {“y”, “M8/gb”}

so our analyzer would receive 2 tokens (for each case) and there won’t be any match with the single term ym8gb stored in the index. So, prior to Solr 6.5 we had two ways for dealing with this requirement:

  • client side: wrapping the whole query with double quotes, escaping whitespaces with “\”, or replacing them with a delimiter like “-“. Easy, but it requires a control on the client code, and this is not always possible.
  • Solr side: applying to the incoming query the same transformations as above but this time at query parser level. Easy, if you know some Lucene / Solr internals. In addition it requires a context where you have permissions for installing custom plugins in Solr. A similar effect could be obtained also using an UpdateRequestProcessor which would create a new field with the same value of the original field but without any whitespace.

The end-users group, and the full-text search query

In this case we are within a “plain” full-text search context, where the analysis identified a couple of target fields: product names and brands.

Differently from the previous scenario, here we don’t have a unique and deterministic way to satisfy the search requirement. It depends on a lot of factors: the catalog, the terms distribution, the implementor experience, the customer expectations in terms of user search experience. All these things can lead to different answers. Just for example, here’s a possible option:

<fieldType name="brand" class="solr.TextField" omitNorms="true">
    <analyzer>
        <charFilter 
                class="solr.MappingCharFilterFactory" 
                mapping="mapping-FoldToASCII.txt"/>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory" 
                ignoreCase="true" 
                words="lang/en/brand_stopwords.txt"/>
    </analyzer>
</fieldType>

<fieldType name="name" class="solr.TextField">
    <analyzer>
        <charFilter 
                  class="solr.MappingCharFilterFactory" 
                  mapping="mapping-FoldToASCII.txt"/>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter 
                class="solr.StopFilterFactory" 
                ignoreCase="false" 
                words="lang/en/product_name_stopwords.txt"/>
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.EnglishMinimalStemFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        <filter class="solr.LengthFilterFactory" min="2" max="50" />
    </analyzer>
</fieldType>

The focus here is not on the schema design itself: the important thing to underline is that this requirement needs a completely different configuration from the “Known Item Search” previously described.

Specifically, let’s assume we ended up following a “term-centric” approach for satisfying the second requirement. The approach requires a different value for the “Split on Whitespace” parameter, which has to be set to true, in this case.

The “sow” parameter can be set at SearchHandler level, so it is applied at query time. It can be declared within the solrconfig.xml and, depending on the configuration, it can be overridden using a named (HTTP) query parameter.

A “split on whitespace” pre-tokenisation leads us on a scenario which is really different from the “Known Item Search”, where instead we “should” be in a field-centric search; “should” is double-quoted because if, from one side, we are actually using a field-centric search, on the other side we are on an edge case where we’re querying one single field with one single query term (the first analyzer in this post always outputs one term).

The implementation

Where?

Although one could think the first thing is about how to combine those two different query strategies, prior to that, the question we need to answer is where to implement the solution? Clearly, regardless the way we will decide to follow, we will have to implement a (search) workflow, which can be summarised in the following diagram:

Known Item Search in Apache Solr

On Solr side, each “search” task needs to be executed in a different SearchHandler, so returning to our question: where do we want to implement such workflow? We have three options: outside, between or inside Solr.

#1: Client-side implementation

The first option is to implement the flow depicted above in the client application. That assumes you have the required control and programming skills on that side. If this assumption is true, then it’s relatively easy to code the workflow: you can choose one of the client API binding available for your language and then implement the double + conditional search illustrated above.

  • Pros: easy to implement. It requires a minimal Solr (functional) knowledge.
  • Cons: the search workflow / logic is moved on the client side. Programming is required, so you must be in a context where this can be done and where the client application code is under your control.

#2: Man-in-the-middle

Moving things outside the client sphere, another popular option, which can be still seen as a client-side alternative (from the Solr perspective), is a proxy / adapter / facade. Whatever is the name you want to give to this stuff, this is a new module which sits between the client application and Solr; it would intercept all requests and it would implement the custom logic by orchestrating the search endpoints exposed in Solr.

Being a new module, it has several advantages:

  • it can be coded using your preferred language
  • it is completely decoupled from the client application, and from Solr as well

but for the same reason, it has also some disadvantages:

  • it must be created: designed, implemented, tested, installed and maintained
  • it is a new piece in your system, which necessarily increases the overall complexity of the architecture
  • Solr exposes a lot of (index & search) services. With this option, all those services should be proxied, therefore resulting in a lot of unnecessary delegations (i.e. delegate services that don’t add any value to the execution chain).

#3: In Solr

The last option moves the workflow implementation (and the search logic) in the place where, in my opinion, it should be: in Solr.

Note that this option is usually not only a “philosophical” choice: if you are a search engineer, most probably you will be hired for designing, implementing and tuning the “search-side of the cake”. That means it’s perfectly possible that, for a lot of reasons, you must think to the client application as an external (sub)system, where you don’t have any kind of control.

The main drawback of this approach is that, as you can imagine, it requires programming skills plus a knowledge about the Solr internals.

In Solr, a search request is consumed by a SearchHandler, a component which is in charge of executing the logic associated with a given search endpoint. In our example, we would have the following search handlers matching the two requirements:

<!-- Known Item search -->
<requestHandler name="/known_item_search" class="solr.SearchHandler">
   <lst name="invariants">
        <str name="defType">lucene</str>
        <bool name="sow">false</bool> <!-- No whitespace split -->
        <str name="df">product_id</str>
   </lst>
</requestHandler>

<!-- Full-text search -->
<requestHandler name="/full-text-search" class="solr.SearchHandler">
    <lst name="invariants">
         <bool name="sow">true</bool> <!--Whitespace split -->
         <str name="defType">edismax</str>
         <str name="df">product_name</str>
         <str name="qf">
            product^0.7
            brand^1.5

On top of that, we would need a third component, which would be in charge to orchestrate the two search handlers above. I’ll call this component a “Composite Request Handler”.

The composite handler would also provide the public search endpoint called by clients. Once a request is received, the composite request handler implements the search workflow: it invokes all the handlers that compose its chain, and it will stop when one the invocation target produces the expected result.

The composite handler configuration looks like this:

<requestHandler name="/search" class=".....">
    <str name="chain">/know_item_search,/full_text_search</str>
</requestHandler>

On the client side, that would require only one request because the entire workflow will be implemented in Solr, by means of the composite request handler. In other words, imagining a GUI with a search bar, the client application, when the search button is pressed, would have to retrieve the term(s) entered by the user and send just one request (to the composite handler endpoint), regardless the intent of the user (i.e. regardless the group the user belongs to).

The composite request handler introduced in this section has been already implemented, you can find it in our Github account, here[2].

Enjoy and, as usual, any feedback is warmly welcome!

[1] https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support

[2] https://github.com/SeaseLtd/invisible-queries-request-handler

Give the height the right weight: quantities detection in Apache Solr

Quantity detection? What is a quantity? And why do we need to detect it?

A quantity, as described by Martin Fowler in his “Analysis Patterns” [1] is defined as a pair which combines an amount and unit (such as 30 litres, 0.25 cl, or 140 cm). In search-based applications, there are many cases where you may want to classify your searchable dataset using dimensioned attributes, because such quantities have a special meaning within the business context you are working on. The first example that comes in my mind?

Apache Solr Quantity Detection Plugin

Beer is offered in several containers (e.g. cans, bottles); each of them is available in multiple sizes (e.g. 25 cl, 50 cl, 75 cl or 0.25 lt, 0.50 lt, 0.75 lt). A good catalog would capture these information in dedicated fields, like “container” (bottle, can) and “capacity” (25cl, 50cl, 75cl in the example above): in this way the search logic can properly make use of them. Faceting (and subsequent filtering) is a good example of what the user can do after a first search has been executed: he can filter and refine results, hopefully finding what he was looking for.

But if we start from the beginning of a user interaction, there’s no result at all: only the blank textfield where the user is going to type something. “Something” could be whatever, anything (in his mind) related with the product he wants to find: a brand, a container type, a model name, a quantity. In few words: anything which represents one or more relevant features of the product he’s looking for.

So one of the main challenge, when implementing a search logic, is to get the point about the meaning of the entered terms. This is in general a very hard topic, often involving complicated stuff (e.g. machine learning), but sometimes things move on an easier side, especially when concepts, we want to detect, follow a common and regular pattern: like a quantity.

The main idea behind the quantity detection plugin [2] we developed at Sease is the following: starting from the user entered query, first it detects the quantities (i.e. the amounts and the corresponding units); then, these information will be isolated from the main query and they will be used for boosting up all products relevant to those quantities. Relevancy here can be meant in different ways:

  • exact match: all bottles with a capacity of 25cl
  • range match: all bottles with a capacity between 50cl and 75cl.
  • equivalence exact match: all bottles with a capacity of 0.5 litre (1lt = 100cl)
  • equivalence range match: all bottles with a capacity between 0.5 and 1 litre (1lt = 100cl)

The following is a short list with a brief description of all supported features:

  • variants: a unit can have a preferred form and (optionally) several variants. This can include different forms of the same unit (e.g. mt, meter) or an equivalent unit in a different metric system (e.g. cl, once)
  • equivalences: it’s possible to define an equivalence table so units can be converted at runtime (“beer 0.25 lt” will have the same meaning of “beer 25cl”). An equivalence table maps a unit with a conversion factor.
  • boost: each unit can have a dedicated boost, especially useful for weighting multiple matching units.
  • ranges: each unit can have a configured gap, which triggers a range query where the detected amount can be in the middle (PIVOT), at the beginning (MIN) or at the end (MAX) of the generated range
  • multi-fields: in case we have more than one attribute using the same unit (e.g. height, width, depth)
  • assumptions: in case an “orphan” amount (i.e an amount without a unit) is detected, it’s possible to define an assumption table and let Solr guess the unit.

Feel free to have a try, and if you think it could be useful, please share with us your idea and / or your feedback.

[1] https://martinfowler.com/books/ap.html

[2] https://github.com/SeaseLtd/solr-quantities-detection-qparsers

ECIR 2018 Experience

Logo.png

This blog is a quick summary of my (subjective) experience at ECIR 2018 : the 40th European Conference on Information Retrieval, hosted in Grenoble (France) from 26/03/2018 to 29/03/2018.

Deep Learning and Explicability

Eight long papers accepted were about Deep Learning.
The topics “Neural Network” and “Word Embedding” were the most occurring in the accepted full papers (and rejected) at the conference. It is clear that deep learning technologies are strongly advancing as a de facto standard in Artificial Intelligence and this can be noticed also in Information Retrieval where these technologies can be used to better model the user behaviour and extract topics and semantic from documents.
But if with deep learning and the advanced capabilities of complex models you gain performance, on the other hand you  lose the ability of explaining and debugging why such output corresponds to a given input.
A recurring topic in the Deep Learning track ( and the Industry Day actually) was to find a balance in the performance gain of new techniques and the control over them.
It is an interesting topic and I believe it is just the other face of the coin, it is good though that Academia is noting the importance of such aspect : in the industry a technology requires control and maintenance much more than in the academic environment and most of the times the “debuggability” can affect the decision between a technology and another.

From Academic Papers to Production

The 2018 European Conference on Information Retrieval ended the 29/03 with a brilliant Industry Day focused on the perilous path from Research to Production.
This is the topic that most permeated the conference across different keynotes, sessions and informal discussions.
In Information Retrieval it is still difficult to converge successful researches into successful live systems : most of the times it is not clear which party should be interested in this process.
Okapi BM25 was first published and implemented in the 1980s and 1990s; it became the default similarity in Apache Lucene 6.0 in 2016 .
Academia is focused on finding new interesting problems to solve with inventive techniques and Industry is focused on finding the quickest solution that works (usually).
This brings a gap :  new solutions are not easily reproducible from academic papers and they are far from being practically ready to use outside the experimental controlled environment.
Brilliant researchers crave new interesting problems they can reason on and solve: their focus is to build a rough implementation and validate it with few metrics ( ideally with an accepted related publication ).
After that, their job is done, problem solved, the challenge is not interesting anymore and they can pass to the next problem.
Researchers get bored easily but they risk to never see their researches fulfilled, applied and used in real life.
Often Academia creates its own problems to solve them and get a publication : Publish or Perish -> no publications bring less funds.
This may be or may be not a personal problem, depending on the individual.
Industry is usually seen by academics as a boring place where you just apply consolidated techniques to just get the best result with a minimum effort.
Sometimes industry is just where you end up when you want to make some money and actually see your effort to bring benefits to some population.
And this was(is) true most of the times but with the IT explosion we are living and the boom of competition the situation nowadays is open to change, where a stronger connection between Academia and Industry can ( and should !) happen and conferences such as ECIR is a perfect ground to settle the basis.
So, building on the introduction let’s see a quick summary of the keynotes, topics and sessions that impressed me most from the conference!

From Academic Papers to Production : A Learning To Rank Story[1]

Let’s start from Sease contribution to the conference : a story about the adoption of Learning To Rank in a real world e-commerce scenario.
The session took place at the Industry Day (29/03) and focused on the challenges and pitfalls of moving from the research papers to production.
The entire journey was possible through Open Source software : Apache Solr and RankLib as main actors.
Learning To Rank is becoming extremely popular in the industry and it is a very promising and useful technology to improve the relevancy function leveraging the user behaviour.
But from an open source perspective is still quite a young technology and effort is required to get it right.
The main message that I wanted to transmit is : don’t be scared to fail, if something doesn’t work immediately out of the box, it doesn’t mean it’s not a valid technology, No Pain No Gain, Learning To Rank open source implementations are valid but require tuning and care to bring them to production with success ( and the improvement these technologies can bring is extremely valuable).

The Harsh Reality of Production Information Access Systems[2]

Recipient of the “Karen Spärck Jones Award” Fernando Diaz in this talk focused on the problems derived from the adoption of Information Retrieval technologies in production environments where a deep understanding of individuals, groups and society is required.
Building on the technical aspect involved in applying research in real world systems, the focus switched to the ethical side : are current IR systems moving in the direction of providing the user with a fair and equally accessible information or the monetisation process behind is producing just addicted users who see (and buy) ad hoc tailored information ?

Statistical Stemmers : A Reproducibility Study[3]

This paper is a sort of symbol of the conference trend : reproducibility is as important as the innovation that a research brings.
Winner of the Best Paper Award, it may have caused some perplexity among the audience ( why to reward a paper that is not innovating but just reproducing past techniques ?), but the message that transmits is clear : a research needs to be easily reproducible and effort is required in that direction for an healthy Research & Development flow that doesn’t target just the publication but a real world application.

Entity-centric Topic Extraction and Exploration: A Network-based Approach[4]

Interesting talk, it explores topic modelling over time in a network based fashion :
Instead of modelling a topic as ranked list of terms it uses a network (weighted graph) representation.
This may be interesting for an advanced More Like This implementation, it’s definitely worth an investigation.

Information Scent, Searching and Stopping : Modelling SERP Level Stopping Behaviour[5]

This talk focused on the entire Search Result Page as a possible signal that affects user stopping behaviour i.e. when a search result page is returned, the overall page quality affects the user perception of relevancy and may drive an immediate query reformulation OR a good abandonment ( when the information need is satisfied).
This something that I personally experimented : sometimes from a quick look to the result page you may realise if the search engine understood ( or misunderstood) your information need.
Different factors are involved in what the author calls “Information Scent” but definitely the perceived relevance (modelled through different User Experience approaches) is definitely an interesting topic that sits along the real relevance.
Further studies in this area may affect the way Search Results pages are rendered, to maximise the fruition of information.

Employing Document Embeddings to Solve the “New Catalog” Problem in User Targeting, and provide Explanations to the Users[6]

The new catalog problem is a practical problem for modern recommender systems and platforms. There are a lot of use cases where you have collection of items that you would like to recommend and this ranges over Music Streaming Platforms ( Playlists, Albums, ect), Video Streaming Platform ( Tv Series genres, To View lists ect) and many other domains.
This paper explores both the algorithm behind such recommendations and the explanation need : explaining to the user why a catalog may be relevant for his/her taste is as important as providing a relevant catalog of items.

Anatomy of an Idea: Mixing Open Source, research and Business

This keynote summarises the cornerstone of Sease culture :
Open Source as a bridge between Academia and Industry.
If every research paper should be implemented in a production ready open source platform as part of the publication process, the community would get a direct and immense benefit out of it.
Iterative improvement would get a great traction and generally speaking the entire scientific community will get a strong boost with a better accessibility .
Implementing researches in state of the art production ready open source systems ( where possible) would cut the adoption time at industry level, triggering an healthy process of utilisation and bug fixing.

Industry Day[7]

The industry day was the coronation of the overall trend that was permeating the conference : there is a strong need to build a better connection between the Academic and Industrial world.
And the good audience reception shown ( the organisers had to move the track to the main venue) is a proof that there is an increasingly stronger need of seeing interesting researches applied ( with success or failure) to the real world with the related lessons learned.
Plenty of talks in this session were brilliant, my favourites :

  • Fabrizio Silvestri (Facebook)
    Query Embeddings: From Research to Production and Back!
  • Manos Tsagkias (904Labs)
    A.I. for Search: Lessons Learned
  • Marc Bron (Schibsted Media Group)
    Managment of Industry Research: Experiences of a Research Scientist

In conclusion, the conference was perfectly organised in an excellent venue, the balance in topics and talks was fairly good( both academic and industrial) and I really enjoyed my time in Grenoble, see you next year in Cologne !

[1] https://www.slideshare.net/AlessandroBenedetti/ecir-2018-alessandro
[2] https://www.ecir2018.org/programme/keynote-speakers/
[3] http://www.dei.unipd.it/~silvello/papers/ECIR2018_SA.pdf
[4] https://dbs.ifi.uni-heidelberg.de/files/Team/aspitz/publications/Spitz_Gertz_2018_Entity-centric_Topic_Extraction.pdf
[5] https://strathprints.strath.ac.uk/62856/
[6] https://link.springer.com/chapter/10.1007%2F978-3-319-76941-7_28
[7] https://www.ecir2018.org/industry-day/

Apache Solr: Chaining SearchHandler instances: the CompositeRequestHandler

What are “Invisible Queries”?

This is an extract of an article [1] on Lucidworks.com, by Grant Ingersoll, talking about invisible queries:

“It is often necessary in many applications to execute more than one query for any given user query.  For instance, in applications that require very high precision (only good results, forgoing marginal results), the app. may have several fields, one for exact matches, one for case-insensitve matches and yet another with stemming.  Given a user query, the app may try the query against the exact match field first and if there is a result, return only that set.  If there are no results, then the app would proceed to search the next field, and so on.”

(source: https://lucidworks.com/blog/2009/08/12/fake-and-invisible-queries)

The sentence above assumes a scenario where the (client) application issues to Solr several and subsequent requests on top of a user query (i.e. one user query => many search engine queries). What about you don’t have such control? Imagine you’re the search engineer of an e-commerce portal that has been built using Magento, which, in this scenario, acts as the Solr client; someone installed and configured the Solr connector and ok, everything is working: when the user submits a search, the connector forwards the request to Solr, which in turns executes a (single) query according with the configuration.

The context

Now, imagine that the query above returns no results. The whole request / response interaction is gone, the user will see something like “Sorry, no results for your search”. Although this sounds perfectly reasonable, in this post we will focus on a different approach, based on the “invisible queries” thing you can read in the extract above. The main point here is a precondition: I cannot change the client code; that because (for example):

  • I don’t want to introduce custom code in my Magento / Drupal instance
  • I don’t know PHP
  • I’m strictly responsible for the search infrastructure and the frontend developer doesn’t want / is not able to properly implement this feature on the client side
  • I want to move as much as possible the search logic in Solr
What I’d like to do is to provide a single entry point (i.e. one single request handler) to my clients, being able to execute a workflow like this:
Invisible Queries Apache Solr

The CompositeRequestHandler

The underlying idea is to provide a Facade which is able to chain several handlers; something like this:
<requestHandler name="/search" class="...CompositeRequestHandler">
    <str name="chain">/rh1,/rh2,/rh3</str>
</requestHandler> 
where /rh1, /rh2 and /rh3 are standard SearchHandler instances you’ve already declared, that you want to chain in the workflow described in the diagram above.

The CompositeRequestHandler implementation is actually simple: its handleRequestBody method will execute, sequentially, the configured handler references, and it will break the chain after receiving the first positive query response (usually that is a query response with numFound > 0, but the last version of the component allows you to configure also other predicates). The logic would be something like this:

chain.stream()
    // Get the request handler associated with a given name
    .map(refName -> requestHandler(request, refName))
    // Only SearchHandler instances are allowed in the chain
    .filter(SearchHandler.class::isInstance) 
    // executes the handler logic 
    .map(handler -> executeQuery(request, response, params, handler))
    .filter(qresponse -> howManyFound(qresponse) > 0)
    // Stop the iteration when the first condition above has been satisfied
    .findFirst()
    // or, if we don’t have any positive executions, just returns an empty response.
    .orElse(emptyResponse(request, response)));
You can find the source code of CompositeRequestHandler in our Sease GitHub repository. As usual, any feedback is warmly welcome.

Solr Is Learning To Rank Better – Part 4 – Solr Integration

Last Stage Of The Journey

This blog post is about the Apache Solr Learning To Rank ( LTR ) integration.

We modelled our dataset, we collected the data and refined it in Part 1 .
Trained the model in Part 2 .
Analysed and evaluate the model and training set in Part 3 .
We are ready to rock and deploy the model and feature definitions to Solr.
I will focus in this blog post on the Apache Solr Learning To Rank ( LTR ) integration from Bloomberg [1] .
The contribution is completed and available from Apache Solr 6.4.
This blog is heavily based on the Learning To Rank ( LTR ) Bloomberg contribution readme [2].

Apache Solr Learning To Rank ( LTR ) integration

The Apache Solr Learning To Rank ( LTR ) integration allows Solr to rerank the search results evaluating a provided Learning To Rank model.
Main responsabilties of the plugin are :

– storage of feature definitions
– storage of models
– feature extraction and caching
– search result rerank

Features Definition

As we learnt from the previous posts, the feature vector is the mathematical representation of each document/query pair and the model will score each search result according to that vector.
Of course we need to tell Solr how to generate the feature vector for each document in the search results.
Here comes the Feature Definition file.
A Json array describing all the relevant features necessary to score our documents through the machine learned LTR model.

e.g.

[{ "name": "isBook",
  "class": "org.apache.solr.ltr.feature.SolrFeature",
  "params":{ "fq": ["{!terms f=category}book"] }
},
{
  "name":  "documentRecency",
  "class": "org.apache.solr.ltr.feature.SolrFeature",
  "params": {
      "q": "{!func}recip( ms(NOW,publish_date), 3.16e-11, 1, 1)"
  }
},
{
  "name" : "userTextTitleMatch",
  "class" : "org.apache.solr.ltr.feature.SolrFeature",
  "params" : { "q" : "{!field f=title}${user_text}" }
},
{
  "name":"book_price",
  "class":"org.apache.solr.ltr.feature.FieldValueFeature",
  "params":{"field":"book_price"}
},
{
  "name":"originalScore",
  "class":"org.apache.solr.ltr.feature.OriginalScoreFeature",
  "params":{}
},
{
   "name" : "userFromMobile",
   "class" : "org.apache.solr.ltr.feature.ValueFeature",
   "params" : { "value" : "${userFromMobile:}", "required":true }
}]  
SolrFeature
– Query Dependent
– Query Independent
A Solr feature is defined by a Solr query following the Solr sintax.
The value of the Solr feature is calculated as the return value of the query run against the document we are scoring.
This feature can depend from query time parameters or can be query independent ( see examples)
e.g.
“params”:{“fq”: [“{!terms f=category}book”] }
– Query Independent
– Boolean feature
If the document match the term ‘book’ in the field ‘category’ the feature value will be 1.
It is query independent as no query param affects this calculation.
“params”:{“q”: “{!func}recip( ms(NOW,publish_date), 3.16e-11, 1, 1)”}
– Query Dependent
– Ordinal feature
The feature value will be calculated as the result of the function query, more recent the document, closer to 1 the value.
It is query dependent as ‘NOW’ affects the feature value.
“params”:{“q”: “{!field f=title}${user_text}” }
– Query Dependent
– Ordinal feature
The feature value will be calculated as the result of the query, more relevant the title content for the user query, higher the value.
It is query dependent as the ‘user_text’ query param affects the calculation.
FieldValueFeature
– Query Independent
A Fiel Value feature is defined by a Solr field.
The value of the feature is calculated as the content of the field for the document we are scoring.
The field must be STORED or DOC-VALUED . This feature is query independent ( see examples)
e.g.
“params”:{“field”:”book_price”}
– Query Independent
– Ordinal feature
The value of the feature will be the content of the ‘book_price’ field for a given document.
It is query independent as no query param affects this calculation.
ValueFeature
– Query Level
– Constant
A Value feature is defined by a constant or an external query parameter.
The value of the feature is calculated as the value passed in the solr request as an efi(External Feature Information) parameter or as a constant.
This feature depends only on the param configured( see examples)
e.g.
“params” : { “value” : “${user_from_mobile:}”, “required”:false }
– Query Level
– Boolean feature
The user will pass the ‘userFromMobile’ request param as an efi
The value of the feature will be the value of the parameter
The default value will be assigned if the parameter is missing in the request
If it is required an exception will be thrown if the parameter is missing in the request“params” : { “value” : “5“, “required”:false }
– Constant
– Ordinal feature
The feature value will be calculated as the constant value of ‘5’ .Except the constant, nothing affect the calculation.
OriginalScoreFeature
– Query Dependent
An Original Score feature is defined with no additional parameters.
The value of the feature is calculated as the original lucene score of the document given the input query.
This feature depends from query time parameters ( see examples)
e.g.
“params”:{}
— Query Dependent
— Ordinal feature
The feature value will be the original lucene score given the input query.
It is query dependent as the entire input query affect this calculation.

EFI ( External Feature Information )

As you noticed in the feature definition json, external request parameters can affect the feature extraction calculation.
When running a rerank query it is possible to pass additional request parameters that will be used at feature extraction time.
We see this in details in the related section.

e.g.
rq={!ltr reRankDocs=3 model=externalmodel efi.user_from_mobile=1}

 

Deploy Features definition

Good, we defined all the features we require for our model, we can now send them to Solr :

curl -XPUT 'http://localhost:8983/solr/collection1/schema/feature-store' --data-binary @/path/features.json -H 'Content-type:application/json'  
 

View Features Definition

To visualise the features just sent, we can access the feature store:

curl -XGET 'http://localhost:8983/solr/collection1/schema/feature-store'  
 

Models Definition

We extensively explored how to train models and how models look like in the format the Solr plugin is expecting.
For details I suggest you reading : Part 2
Let’s have a quick summary anyway  :

 

Linear Model (Ranking SVM, Pranking)

e.g.

 {
    "class":"org.apache.solr.ltr.model.LinearModel",
    "name":"myModelName",
    "features":[
        { "name": "userTextTitleMatch"},
        { "name": "originalScore"},
        { "name": "isBook"}
    ],
    "params":{
        "weights": {
            "userTextTitleMatch": 1.0,
            "originalScore": 0.5,
            "isBook": 0.1
        }
    }
} 

 

Multiple Additive Trees (LambdaMART, Gradient Boosted Regression Trees )

e.g.

{
    "class":"org.apache.solr.ltr.model.MultipleAdditiveTreesModel",
    "name":"lambdamartmodel",
    "features":[
        { "name": "userTextTitleMatch"},
        { "name": "originalScore"}
    ],
    "params":{
        "trees": [
            {
                "weight" : 1,
                "root": {
                    "feature": "userTextTitleMatch",
                    "threshold": 0.5,
                    "left" : {
                        "value" : -100
                    },
                    "right": {
                        "feature" : "originalScore",
                        "threshold": 10.0,
                        "left" : {
                            "value" : 50
                        },
                        "right" : {
                            "value" : 75
                        }
                    }
                }
            },
            {
                "weight" : 2,
                "root": {
                    "value" : -10
                }
            }
        ]
    }
}  

Heuristic Boosted Model (experimental)

The Heuristic Boosted Model is an experimental model that combines linear boosting to any model.
It is currently available in the experimental branch [3].
This capability is currently supported only by the : org.apache.solr.ltr.ranking.HeuristicBoostedLambdaMARTModel .
The reason behind this approach is that sometimes, at training time we don’t have available all the features we want to use at query time.
e.g.
Your training set is not built on clicks of the search results and contains legacy data, but you want to include the original score as a boosting factor
Let’s see the configuration in details :
Given :

"features":[ { "name": "userTextTitleMatch"}, { "name": "originalScoreFeature"} ]
"boost":{ "feature":"originalScoreFeature", "weight":0.1, "type":"SUM" }  

The original score feature value, weighted by a factor of 0.1, will be added to the score produced by the LambdaMART trees.

 "boost":{ "feature":"originalScoreFeature", "weight":0.1, "type":"PRODUCT" }  
 

The original score feature value, weighted by a factor of 0.1, will be multiplied to the score produced by the LambdaMART trees.

N.B. Take extra care when using this approach. This introduces a manual boosting to the score calculation, which adds flexibility when you don’t have much data for training. However, you will loose some of the benefits of a machine learned model, which was optimized to rerank your results. As you get more data and your model becomes better, you should shift off the manual boosting.

 

e.g

{
    "class":"org.apache.solr.ltr.ranking.HeuristicBoostedLambdaMARTModel",
    "name":"lambdamartmodel",
    "features":[
        { "name": "userTextTitleMatch"},
        { "name": "originalScoreFeature"}
    ],
    "params":{
    "boost": {
          "feature": "originalScoreFeature",
          "weight": 0.5,
          "type": "SUM"
        },
        "trees": [
            {
                "weight" : 1,
                "root": {
                    "feature": "userTextTitleMatch",
                    "threshold": 0.5,
                    "left" : {
                        "value" : -100
                    },
                    "right": {
                        "value" : 10}
 }
 },
 {
 "weight" : 2,
 "root": {
 "value" : -10
 }
 }
 ]
 }
 }

Deploy Model

As we saw for the features definition, deploying the model is quite straightforward :

curl -XPUT 'http://localhost:8983/solr/collection1/schema/model-store' --data-binary @/path/model.json -H 'Content-type:application/json' 
 

View Model

The model will be stored in an easily accessible json store:

curl -XGET 'http://localhost:8983/solr/collection1/schema/model-store'
 

Rerank query

To rerank your search results using a machine learned LTR model it is required to call the rerank component using the Apache Solr Learning To Rank ( LTR ) query parser.

Query Re-Ranking allows you to run an initial query(A) for matching documents and then re-rank the top N documents re-scoring them based on a second query (B).
Since the more costly ranking from query B is only applied to the top N documents it will have less impact on performance then just using the complex query B by itself – the trade off is that documents which score very low using the simple query A may not be considered during the re-ranking phase, even if they would score very highly using query B.  Solr Wiki

The Apache Solr Learning To Rank ( LTR ) integration defines an additional query parser that can be used to define the rerank strategy.
In particular, when rescoring a document in the search results :

  • Features are extracted from the document
  • Score is calculated evaluating the model against the extracted feature vector
  • Final search results are reranked according to the new score
rq={!ltr model=myModelName reRankDocs=25}

!ltr – will use the ltr query parser
model=myModelName – specifies which model in the model-store to use to score the documents
reRankDocs=25 – specifies that only the top 25 search results from the original ranking, will be scored and reranked

When passing external feature information (EFI) that will be used to extract the feature vector, the syntax is pretty similar :

rq={!ltr reRankDocs=3 model=externalmodel efi.parameter1=’value1′ efi.parameter2=’value2′}

e.g.

rq={!ltr reRankDocs=3 model=externalModel efi.user_input_query=’Casablanca’ efi.user_from_mobile=1}

Sharding

When using sharding, each shard will rerank, so the reRankDocs will be considered per shard.

e.g.
10 shards
You run distributed query with :
rq={!ltr reRankDocs=10 …
You will get a total of 100 documents re-ranked .

Pagination

Pagination is delicate[4].

Let’s explore the scenario on a single Solr node and on a sharded architecture.

 

Single Solr node

reRankDocs=15
rows=10

This means each page is composed by 10 results.
What happens when we hit the page 2 ?
The first 5 documents in the search results will have been rescored and affected by the reranking.
The latter 5 documents will preserve the original score and original ranking.

e.g.
Doc 11 – score= 1.2
Doc 12 – score= 1.1
Doc 13 – score= 1.0
Doc 14 – score= 0.9
Doc 15 – score= 0.8
Doc 16 – score= 5.7
Doc 17 – score= 5.6
Doc 18 – score= 5.5
Doc 19 – score= 4.6
Doc 20 – score= 2.4

This means that score(15) could be < score(16), but document 15 and 16 are still in the expected order.
The reason is that the top 15 documents are rescored and reranked and the rest is left unchanged.

Sharded architecture

reRankDocs=15
rows=10
Shards number=2

When looking for the page 2, Solr will trigger queries to she shards to collect 2 pages per shard :
Shard1 : 10 ReRanked docs (page1) + 10 OriginalScored docs (page2)
Shard2 : 10 ReRanked docs (page1) + 10 OriginalScored docs (page2)

The the results will be merged, and possibly, original scored search results can top up reranked docs.
A possible solution could be to normalise the scores to prevent any possibility that a reranked result is surpassed by original scored ones.

Note: The problem is going to happen after you reach rows * page > reRankDocs. In situations when reRankDocs is quite high , the problem will occur only in deep paging.

Feature Extraction And Caching

Extracting the features from the search results document is the most onerous task while reranking using LTR.
The LTRScoringQuery will take care of computing the feature values in the feature vector and then delegate the final score generation to the LTRScoringModel.
For each document the definitions in the feature-store are applied to generate the vector.
The vector can be generate in parallel, leveraging a multi-threaded approach.
Extra care must be taken into account when configuring the number of threads in the game.
The feature vector is currently cached in toto in the QUERY_DOC_FV cache.

This means that given the query and EFIs, we cache the entire feature vector for the document.

Simply giving in input a different efi request parameter will imply a different hashcode for the feature vector and consequentially invalidate the cached one.
This bit can be potentially improved, managing separately caches for the query independent, query dependent and query level features[5].