Apache Solr: orchestrating Known item and Full-text search

Scenario

You’re working as a search engineer for XYZ Ltd, a company which sells electric components. XYZ provided you the application logs of the last six months, and some business requirements.

Two kinds of customers, two kinds of requirements, two kinds of search

The log analysis shows that XYZ has mainly two kinds of customers: a first group, the “expert” users (e.g. electricians, resellers, shops) whose members are querying the system by product identifiers, codes (e.g. SKU, model codes, thinks like Y-M8GB, 140-213/A and ABD9881); it’s clear, at least it seems so, they already know what they want and what they are looking for. However, you noticed a lot of such queries produce no results. After investigating, the problem seems to be that codes and identifiers are definitely hard to remember: queries use a lot of disparate forms for pointing to the same product. For example:

  • y-m8gb (lowercase)
  • YM8GB (no delimiters)
  • YM-8GB (delimiter in a wrong place)
  • Y/M8GB (wrong delimiter)
  • Y M8GB (whitespace instead of delimiter)
  • y M8/gb (a combination of cases above)

This kind of scenario, where there’s only one relevant document in the collection, is usually referred to as “Known Item Search”: our first requirement is to make sure this “product identifier intent” is satisfied.

The other group of customers are end-users, like me and you. Being not so familiar with product specs like codes or model codes, the behaviour here is different: they use a plain keyword search, trying to match products by entering terms which represents names, brands, manufacturer. An here it comes the second requirement which can be summarized as follows: people must be able to find products by entering plain free-text queries.

As you can imagine, in this case search requirements are different from the other scenario: the focus here is more “term-centric”, therefore involving different considerations about the text analysis we’d need to apply.

While the expert group query is supposed to point to one and only one product (we are in a black / white scenario: match or not), the needs on the other side require the system to provide a list of “relevant” documents, according to the terms entered.

An important thing / assumption before proceeding: for illustration purposes we will consider those two queries / user groups as disjoint: that is, a given user belongs only to one of the mentioned groups, not both. Better, a given user query will contain product identifiers or terms, not both. 

Schema & configuration notes

The expert group, and the “Known Item Search”

The “product identifier” intent, which is assumed to be implicit in the query behaviour of this group, can be captured, both at index and query time, by applying the following analyzer, which basically treats the incoming value as a whole, normalizes it to lower case, removes all delimiters and finally collapses everything in a single output token.

<fieldtype name="identifier" class="solr.TextField" omitNorms="true">
    <analyzer>
        <tokenizer class="solr.KeywordTokenizerFactory" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.WordDelimiterGraphFilterFactory"
                generateWordParts="0"
                generateNumberParts="0"
                catenateWords="0"
                catenateNumbers="0"
                catenateAll="1"
                splitOnCaseChange="0" />
    </analyzer>
</fieldtype>
<field name="product_id" type="identifier" indexed="true" ... />

In the following table you can see the analyzer in action with some example:

As you can see, the analyzer doesn’t declare a type attribute because it is supposed to be applied both at index and query time. However, there’s a difference in the incoming value: at index time the analyzer is dealing with a field content (i.e. the value of a field of an incoming document), while at query time the value which flows through the pipeline is composed by one or more terms entered by the user (a query, briefly).

While at index time everything works as expected, at query time the analyzer above requires a feature that has been introduced in Solr 6.5: the “Split On Whitespace” flag [1]. When it is set to “false” (as we need here in this context), it causes the incoming query text to be kept as a single whole unit, when sent to the analyzer.

Prior to Solr 6.5 we didn’t have such control, and the analyzers were receiving a “pre-tokenized-by-whitespaces” tokens; in other words, the unit of work of the query-time analysis was the single term: the analyzer chain (including the tokenizer itself) was invoked for each term outputted by that pre-whitespace-tokenization. As consequence of that our analyzer, at query time, couldn’t work as expected: if we take the example #5 and #6 from the table above, you can see the user entered a whitespace. With the “Split on Whitespace” flag set to true (explicitly, or using a Solr < 6.5), the pre-tokenization described above produces two tokens:

  • #5 = {“Y”, ”M8GB”}
  • #6 = {“y”, “M8/gb”}

so our analyzer would receive 2 tokens (for each case) and there won’t be any match with the single term ym8gb stored in the index. So, prior to Solr 6.5 we had two ways for dealing with this requirement:

  • client side: wrapping the whole query with double quotes, escaping whitespaces with “\”, or replacing them with a delimiter like “-“. Easy, but it requires a control on the client code, and this is not always possible.
  • Solr side: applying to the incoming query the same transformations as above but this time at query parser level. Easy, if you know some Lucene / Solr internals. In addition it requires a context where you have permissions for installing custom plugins in Solr. A similar effect could be obtained also using an UpdateRequestProcessor which would create a new field with the same value of the original field but without any whitespace.

The end-users group, and the full-text search query

In this case we are within a “plain” full-text search context, where the analysis identified a couple of target fields: product names and brands.

Differently from the previous scenario, here we don’t have a unique and deterministic way to satisfy the search requirement. It depends on a lot of factors: the catalog, the terms distribution, the implementor experience, the customer expectations in terms of user search experience. All these things can lead to different answers. Just for example, here’s a possible option:

<fieldType name="brand" class="solr.TextField" omitNorms="true">
    <analyzer>
        <charFilter 
                class="solr.MappingCharFilterFactory" 
                mapping="mapping-FoldToASCII.txt"/>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory" 
                ignoreCase="true" 
                words="lang/en/brand_stopwords.txt"/>
    </analyzer>
</fieldType>

<fieldType name="name" class="solr.TextField">
    <analyzer>
        <charFilter 
                  class="solr.MappingCharFilterFactory" 
                  mapping="mapping-FoldToASCII.txt"/>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter 
                class="solr.StopFilterFactory" 
                ignoreCase="false" 
                words="lang/en/product_name_stopwords.txt"/>
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.EnglishMinimalStemFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        <filter class="solr.LengthFilterFactory" min="2" max="50" />
    </analyzer>
</fieldType>

The focus here is not on the schema design itself: the important thing to underline is that this requirement needs a completely different configuration from the “Known Item Search” previously described.

Specifically, let’s assume we ended up following a “term-centric” approach for satisfying the second requirement. The approach requires a different value for the “Split on Whitespace” parameter, which has to be set to true, in this case.

The “sow” parameter can be set at SearchHandler level, so it is applied at query time. It can be declared within the solrconfig.xml and, depending on the configuration, it can be overridden using a named (HTTP) query parameter.

A “split on whitespace” pre-tokenisation leads us on a scenario which is really different from the “Known Item Search”, where instead we “should” be in a field-centric search; “should” is double-quoted because if, from one side, we are actually using a field-centric search, on the other side we are on an edge case where we’re querying one single field with one single query term (the first analyzer in this post always outputs one term).

The implementation

Where?

Although one could think the first thing is about how to combine those two different query strategies, prior to that, the question we need to answer is where to implement the solution? Clearly, regardless the way we will decide to follow, we will have to implement a (search) workflow, which can be summarised in the following diagram:

Known Item Search in Apache Solr

On Solr side, each “search” task needs to be executed in a different SearchHandler, so returning to our question: where do we want to implement such workflow? We have three options: outside, between or inside Solr.

#1: Client-side implementation

The first option is to implement the flow depicted above in the client application. That assumes you have the required control and programming skills on that side. If this assumption is true, then it’s relatively easy to code the workflow: you can choose one of the client API binding available for your language and then implement the double + conditional search illustrated above.

  • Pros: easy to implement. It requires a minimal Solr (functional) knowledge.
  • Cons: the search workflow / logic is moved on the client side. Programming is required, so you must be in a context where this can be done and where the client application code is under your control.

#2: Man-in-the-middle

Moving things outside the client sphere, another popular option, which can be still seen as a client-side alternative (from the Solr perspective), is a proxy / adapter / facade. Whatever is the name you want to give to this stuff, this is a new module which sits between the client application and Solr; it would intercept all requests and it would implement the custom logic by orchestrating the search endpoints exposed in Solr.

Being a new module, it has several advantages:

  • it can be coded using your preferred language
  • it is completely decoupled from the client application, and from Solr as well

but for the same reason, it has also some disadvantages:

  • it must be created: designed, implemented, tested, installed and maintained
  • it is a new piece in your system, which necessarily increases the overall complexity of the architecture
  • Solr exposes a lot of (index & search) services. With this option, all those services should be proxied, therefore resulting in a lot of unnecessary delegations (i.e. delegate services that don’t add any value to the execution chain).

#3: In Solr

The last option moves the workflow implementation (and the search logic) in the place where, in my opinion, it should be: in Solr.

Note that this option is usually not only a “philosophical” choice: if you are a search engineer, most probably you will be hired for designing, implementing and tuning the “search-side of the cake”. That means it’s perfectly possible that, for a lot of reasons, you must think to the client application as an external (sub)system, where you don’t have any kind of control.

The main drawback of this approach is that, as you can imagine, it requires programming skills plus a knowledge about the Solr internals.

In Solr, a search request is consumed by a SearchHandler, a component which is in charge of executing the logic associated with a given search endpoint. In our example, we would have the following search handlers matching the two requirements:

<!-- Known Item search -->
<requestHandler name="/known_item_search" class="solr.SearchHandler">
   <lst name="invariants">
        <str name="defType">lucene</str>
        <bool name="sow">false</bool> <!-- No whitespace split -->
        <str name="df">product_id</str>
   </lst>
</requestHandler>

<!-- Full-text search -->
<requestHandler name="/full-text-search" class="solr.SearchHandler">
    <lst name="invariants">
         <bool name="sow">true</bool> <!--Whitespace split -->
         <str name="defType">edismax</str>
         <str name="df">product_name</str>
         <str name="qf">
            product^0.7
            brand^1.5

On top of that, we would need a third component, which would be in charge to orchestrate the two search handlers above. I’ll call this component a “Composite Request Handler”.

The composite handler would also provide the public search endpoint called by clients. Once a request is received, the composite request handler implements the search workflow: it invokes all the handlers that compose its chain, and it will stop when one the invocation target produces the expected result.

The composite handler configuration looks like this:

<requestHandler name="/search" class=".....">
    <str name="chain">/know_item_search,/full_text_search</str>
</requestHandler>

On the client side, that would require only one request because the entire workflow will be implemented in Solr, by means of the composite request handler. In other words, imagining a GUI with a search bar, the client application, when the search button is pressed, would have to retrieve the term(s) entered by the user and send just one request (to the composite handler endpoint), regardless the intent of the user (i.e. regardless the group the user belongs to).

The composite request handler introduced in this section has been already implemented, you can find it in our Github account, here[2].

Enjoy and, as usual, any feedback is warmly welcome!

[1] https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support

[2] https://github.com/SeaseLtd/invisible-queries-request-handler

Give the height the right weight: quantities detection in Apache Solr

Quantity detection? What is a quantity? And why do we need to detect it?

A quantity, as described by Martin Fowler in his “Analysis Patterns” [1] is defined as a pair which combines an amount and unit (such as 30 litres, 0.25 cl, or 140 cm). In search-based applications, there are many cases where you may want to classify your searchable dataset using dimensioned attributes, because such quantities have a special meaning within the business context you are working on. The first example that comes in my mind?

Apache Solr Quantity Detection Plugin

Beer is offered in several containers (e.g. cans, bottles); each of them is available in multiple sizes (e.g. 25 cl, 50 cl, 75 cl or 0.25 lt, 0.50 lt, 0.75 lt). A good catalog would capture these information in dedicated fields, like “container” (bottle, can) and “capacity” (25cl, 50cl, 75cl in the example above): in this way the search logic can properly make use of them. Faceting (and subsequent filtering) is a good example of what the user can do after a first search has been executed: he can filter and refine results, hopefully finding what he was looking for.

But if we start from the beginning of a user interaction, there’s no result at all: only the blank textfield where the user is going to type something. “Something” could be whatever, anything (in his mind) related with the product he wants to find: a brand, a container type, a model name, a quantity. In few words: anything which represents one or more relevant features of the product he’s looking for.

So one of the main challenge, when implementing a search logic, is to get the point about the meaning of the entered terms. This is in general a very hard topic, often involving complicated stuff (e.g. machine learning), but sometimes things move on an easier side, especially when concepts, we want to detect, follow a common and regular pattern: like a quantity.

The main idea behind the quantity detection plugin [2] we developed at Sease is the following: starting from the user entered query, first it detects the quantities (i.e. the amounts and the corresponding units); then, these information will be isolated from the main query and they will be used for boosting up all products relevant to those quantities. Relevancy here can be meant in different ways:

  • exact match: all bottles with a capacity of 25cl
  • range match: all bottles with a capacity between 50cl and 75cl.
  • equivalence exact match: all bottles with a capacity of 0.5 litre (1lt = 100cl)
  • equivalence range match: all bottles with a capacity between 0.5 and 1 litre (1lt = 100cl)

The following is a short list with a brief description of all supported features:

  • variants: a unit can have a preferred form and (optionally) several variants. This can include different forms of the same unit (e.g. mt, meter) or an equivalent unit in a different metric system (e.g. cl, once)
  • equivalences: it’s possible to define an equivalence table so units can be converted at runtime (“beer 0.25 lt” will have the same meaning of “beer 25cl”). An equivalence table maps a unit with a conversion factor.
  • boost: each unit can have a dedicated boost, especially useful for weighting multiple matching units.
  • ranges: each unit can have a configured gap, which triggers a range query where the detected amount can be in the middle (PIVOT), at the beginning (MIN) or at the end (MAX) of the generated range
  • multi-fields: in case we have more than one attribute using the same unit (e.g. height, width, depth)
  • assumptions: in case an “orphan” amount (i.e an amount without a unit) is detected, it’s possible to define an assumption table and let Solr guess the unit.

Feel free to have a try, and if you think it could be useful, please share with us your idea and / or your feedback.

[1] https://martinfowler.com/books/ap.html

[2] https://github.com/SeaseLtd/solr-quantities-detection-qparsers

ECIR 2018 Experience

Logo.png

This blog is a quick summary of my (subjective) experience at ECIR 2018 : the 40th European Conference on Information Retrieval, hosted in Grenoble (France) from 26/03/2018 to 29/03/2018.

Deep Learning and Explicability

Eight long papers accepted were about Deep Learning.
The topics “Neural Network” and “Word Embedding” were the most occurring in the accepted full papers (and rejected) at the conference. It is clear that deep learning technologies are strongly advancing as a de facto standard in Artificial Intelligence and this can be noticed also in Information Retrieval where these technologies can be used to better model the user behaviour and extract topics and semantic from documents.
But if with deep learning and the advanced capabilities of complex models you gain performance, on the other hand you  lose the ability of explaining and debugging why such output corresponds to a given input.
A recurring topic in the Deep Learning track ( and the Industry Day actually) was to find a balance in the performance gain of new techniques and the control over them.
It is an interesting topic and I believe it is just the other face of the coin, it is good though that Academia is noting the importance of such aspect : in the industry a technology requires control and maintenance much more than in the academic environment and most of the times the “debuggability” can affect the decision between a technology and another.

From Academic Papers to Production

The 2018 European Conference on Information Retrieval ended the 29/03 with a brilliant Industry Day focused on the perilous path from Research to Production.
This is the topic that most permeated the conference across different keynotes, sessions and informal discussions.
In Information Retrieval it is still difficult to converge successful researches into successful live systems : most of the times it is not clear which party should be interested in this process.
Okapi BM25 was first published and implemented in the 1980s and 1990s; it became the default similarity in Apache Lucene 6.0 in 2016 .
Academia is focused on finding new interesting problems to solve with inventive techniques and Industry is focused on finding the quickest solution that works (usually).
This brings a gap :  new solutions are not easily reproducible from academic papers and they are far from being practically ready to use outside the experimental controlled environment.
Brilliant researchers crave new interesting problems they can reason on and solve: their focus is to build a rough implementation and validate it with few metrics ( ideally with an accepted related publication ).
After that, their job is done, problem solved, the challenge is not interesting anymore and they can pass to the next problem.
Researchers get bored easily but they risk to never see their researches fulfilled, applied and used in real life.
Often Academia creates its own problems to solve them and get a publication : Publish or Perish -> no publications bring less funds.
This may be or may be not a personal problem, depending on the individual.
Industry is usually seen by academics as a boring place where you just apply consolidated techniques to just get the best result with a minimum effort.
Sometimes industry is just where you end up when you want to make some money and actually see your effort to bring benefits to some population.
And this was(is) true most of the times but with the IT explosion we are living and the boom of competition the situation nowadays is open to change, where a stronger connection between Academia and Industry can ( and should !) happen and conferences such as ECIR is a perfect ground to settle the basis.
So, building on the introduction let’s see a quick summary of the keynotes, topics and sessions that impressed me most from the conference!

From Academic Papers to Production : A Learning To Rank Story[1]

Let’s start from Sease contribution to the conference : a story about the adoption of Learning To Rank in a real world e-commerce scenario.
The session took place at the Industry Day (29/03) and focused on the challenges and pitfalls of moving from the research papers to production.
The entire journey was possible through Open Source software : Apache Solr and RankLib as main actors.
Learning To Rank is becoming extremely popular in the industry and it is a very promising and useful technology to improve the relevancy function leveraging the user behaviour.
But from an open source perspective is still quite a young technology and effort is required to get it right.
The main message that I wanted to transmit is : don’t be scared to fail, if something doesn’t work immediately out of the box, it doesn’t mean it’s not a valid technology, No Pain No Gain, Learning To Rank open source implementations are valid but require tuning and care to bring them to production with success ( and the improvement these technologies can bring is extremely valuable).

The Harsh Reality of Production Information Access Systems[2]

Recipient of the “Karen Spärck Jones Award” Fernando Diaz in this talk focused on the problems derived from the adoption of Information Retrieval technologies in production environments where a deep understanding of individuals, groups and society is required.
Building on the technical aspect involved in applying research in real world systems, the focus switched to the ethical side : are current IR systems moving in the direction of providing the user with a fair and equally accessible information or the monetisation process behind is producing just addicted users who see (and buy) ad hoc tailored information ?

Statistical Stemmers : A Reproducibility Study[3]

This paper is a sort of symbol of the conference trend : reproducibility is as important as the innovation that a research brings.
Winner of the Best Paper Award, it may have caused some perplexity among the audience ( why to reward a paper that is not innovating but just reproducing past techniques ?), but the message that transmits is clear : a research needs to be easily reproducible and effort is required in that direction for an healthy Research & Development flow that doesn’t target just the publication but a real world application.

Entity-centric Topic Extraction and Exploration: A Network-based Approach[4]

Interesting talk, it explores topic modelling over time in a network based fashion :
Instead of modelling a topic as ranked list of terms it uses a network (weighted graph) representation.
This may be interesting for an advanced More Like This implementation, it’s definitely worth an investigation.

Information Scent, Searching and Stopping : Modelling SERP Level Stopping Behaviour[5]

This talk focused on the entire Search Result Page as a possible signal that affects user stopping behaviour i.e. when a search result page is returned, the overall page quality affects the user perception of relevancy and may drive an immediate query reformulation OR a good abandonment ( when the information need is satisfied).
This something that I personally experimented : sometimes from a quick look to the result page you may realise if the search engine understood ( or misunderstood) your information need.
Different factors are involved in what the author calls “Information Scent” but definitely the perceived relevance (modelled through different User Experience approaches) is definitely an interesting topic that sits along the real relevance.
Further studies in this area may affect the way Search Results pages are rendered, to maximise the fruition of information.

Employing Document Embeddings to Solve the “New Catalog” Problem in User Targeting, and provide Explanations to the Users[6]

The new catalog problem is a practical problem for modern recommender systems and platforms. There are a lot of use cases where you have collection of items that you would like to recommend and this ranges over Music Streaming Platforms ( Playlists, Albums, ect), Video Streaming Platform ( Tv Series genres, To View lists ect) and many other domains.
This paper explores both the algorithm behind such recommendations and the explanation need : explaining to the user why a catalog may be relevant for his/her taste is as important as providing a relevant catalog of items.

Anatomy of an Idea: Mixing Open Source, research and Business

This keynote summarises the cornerstone of Sease culture :
Open Source as a bridge between Academia and Industry.
If every research paper should be implemented in a production ready open source platform as part of the publication process, the community would get a direct and immense benefit out of it.
Iterative improvement would get a great traction and generally speaking the entire scientific community will get a strong boost with a better accessibility .
Implementing researches in state of the art production ready open source systems ( where possible) would cut the adoption time at industry level, triggering an healthy process of utilisation and bug fixing.

Industry Day[7]

The industry day was the coronation of the overall trend that was permeating the conference : there is a strong need to build a better connection between the Academic and Industrial world.
And the good audience reception shown ( the organisers had to move the track to the main venue) is a proof that there is an increasingly stronger need of seeing interesting researches applied ( with success or failure) to the real world with the related lessons learned.
Plenty of talks in this session were brilliant, my favourites :

  • Fabrizio Silvestri (Facebook)
    Query Embeddings: From Research to Production and Back!
  • Manos Tsagkias (904Labs)
    A.I. for Search: Lessons Learned
  • Marc Bron (Schibsted Media Group)
    Managment of Industry Research: Experiences of a Research Scientist

In conclusion, the conference was perfectly organised in an excellent venue, the balance in topics and talks was fairly good( both academic and industrial) and I really enjoyed my time in Grenoble, see you next year in Cologne !

[1] https://www.slideshare.net/AlessandroBenedetti/ecir-2018-alessandro
[2] https://www.ecir2018.org/programme/keynote-speakers/
[3] http://www.dei.unipd.it/~silvello/papers/ECIR2018_SA.pdf
[4] https://dbs.ifi.uni-heidelberg.de/files/Team/aspitz/publications/Spitz_Gertz_2018_Entity-centric_Topic_Extraction.pdf
[5] https://strathprints.strath.ac.uk/62856/
[6] https://link.springer.com/chapter/10.1007%2F978-3-319-76941-7_28
[7] https://www.ecir2018.org/industry-day/