Blog

London Information Retrieval Meetup October

After the very warm reception of the first and second edition, the third London Information Retrieval Meetup is approaching (21/10/2019) and we are excited to add more details about our speakers and talks!
The event is free and you are invited to register :

https://www.eventbrite.com/e/london-information-retrieval-meetup-october-tickets-74403100677

Our second speaker is Andrea Gazzarini, our founder and software engineer:

Andrea Gazzarini

Andrea Gazzarini is a curious software engineer, mainly focused on the Java language and Search technologies.
With more than 15 years of experience in various software engineering areas, his adventure with the search domain began in 2010, when he met Apache Solr and later Elasticsearch… and it was love at first sight. 
Since then, he has been involved in many projects across different fields (bibliographic, e-government, e-commerce, geospatial).

In 2015 he wrote “Apache Solr Essentials”, a book about Solr, published by Packt Publishing.
He’s an opensource lover; he’s currently involved in several (too many!) projects, always thinking about a “big” idea that will change his (developer) life.

Music Information Retrieval Take 2: Interval Hashing Based Ranking

Retrieving musical records from a corpus of Information, using an audio input as a query is not an easy task. Various approaches try to solve the problem modelling the query and the corpus of Information as an array of hashes calculated from the chroma features of the audio input.
Scope of this talk is to introduce a novel approach in calculating such hashes, considering the intervals of the most intense pitches of sequential chroma vectors.
Building on the theoretical introduction, a prototype will show you this approach in action with Apache Solr with a sample dataset and the benefits of positional queries.
Challenges and future works will follow up as conclusive considerations.


Our first speaker is Alessandro Benedetti, our founder, software engineer and director:

Alessandro Benedetti

Alessandro Benedetti is the founder of Sease.
Senior Search Software Engineer, his focus is on R&D in information retrieval, information extraction, natural language processing, and machine learning.
He firmly believes in Open Source as a way to build a bridge between Academia and Industry and facilitate the progress of applied research.
Following his passion he entered the Apache Lucene and Solr world in 2010 becoming an active member of the community.
When he isn’t developing a new search solution he is presenting the applications of leading edge techniques in real world scenarios at conferences such as ECIR,  Lucene/Solr Revolution, Fosdem, Haystack, Apachecon and Open Source Summit.

How to Build your Training Set for a Learning to Rank Project

Learning to rank (LTR from now on) is the application of machine learning techniques, typically supervised, in the formulation of ranking models for information retrieval systems.
With LTR becoming more and more popular (Apache Solr supports it from Jan 2017), organisations struggle with the problem of how to collect and structure relevance signals necessary to train their ranking models.
This talk is a technical guide to explore and master various techniques to generate your training set(s) correctly and efficiently.
Expect to learn how to : 
– model and collect the necessary feedback from the users (implicit or explicit)
– calculate for each training sample a relevance label which is meaningful and not ambiguous (Click Through Rate, Sales Rate …)
– transform the raw data collected in an effective training set (in the numerical vector format most of the LTR training library expect)
Join us as we explore real world scenarios and dos and don’ts from the e-commerce industry.

Apache Solr ChildDocTransformerFactory: How to Build Complex ChildFilter Queries

When using nested documents and the Apache Solr Block Join functionality it is a common requirement to query for an entity (for example the parent entity) and then retrieve for each search result all(or some of) the related children.

Let’s see the most important aspects of such functionality and how to apply complex queries when retrieving children of search results.

How to Index Nested Documents

If we are providing the documents in Json format, the syntax is quite intuitive:

{
    "id": "A", 
    "queryGroup": "group1", 
    "_childDocuments_": [
      {
        "metricScore": "0.86", 
        "metric": "p", 
        "docType": "child", 
        "id": 12894
      }, 
      {
        "metricScore": "0.62", 
        "metric": "r", 
        "docType": "child", 
        "id": 12895
      }
    ], 
    "docType": "parent",
... 

The children documents are passed as an array of Json nodes, each one with a specific Id
N.B. if you rely on Apache Solr to assign the ID for you, using the UUIDUpdateProcessorFactory, this doesn’t work with child documents yet.
In such scenario you should implement your own Update Request Processor, that iterates over the children and assign an id to each one of them (and then contribute it to the community 🙂 )

If you are using SolrJ and you plan to index and retrieve children documents via code, the situation is a little bit more difficult.
First of all, let’s annotate the POJO properly:

public class Parent
{
    @Field
    private String id;
    ...

    @Field(child = true)
    private List<Child> children;

N.B. Parent, Child and children are just fantasy names, the important notation here is the SolrJ annotation @Field(child = true), you can use whatever name you like for your POJO classes and variables

Index Nested Documents in SolrJ

At Indexing time you have 2 options, you can use the Document Binder :

DocumentObjectBinder solrBinder = new DocumentObjectBinder();
Parent sampleParent = new Parent();
Child sampleChild = new Child();

SolrInputDocument parent = binder.toSolrInputDocument(sampleParent);
SolrInputDocument child = binder.toSolrInputDocument(sampleChild);
parent.addChildDocument(child);

solr.add("collection", parent)

Or you can use the plain POJO:

Parent sampleParent = new Parent();
Child sampleChild = new Child();

//you need to implement it in your POJO
sampleParent.addChildDocument(sampleChild);

solr.addBean("collection", sampleParent)

How to Query and Retrieve Nested Documents

Ok, we covered the indexing side, it’s not straightforward but at this point we should have nested documents in the index, nicely in adjacent blocks with the parent, to allow a fast retrieval at query time.
First of all let’s see how we can query parent/children and get an appropriate response.

Query Children and Retrieve Parents

q={!parent which=<allParents>}<someChildren>

e.g.

q={!parent which=docType:"parent"}title:(child title terms)

N.B. allParents is a query that matches all the parents, if you want to filter some parents later on, you can use filter queries or some additional clause:
e.g.
q= +title:join +{!parent which=”content_type:parentDocument“}comments:SolrCloud

The child query must always return only child documents.

Query Parents and Retrieve Children

q={!child of=<allParents>}<someParents>

e.g.

q={!child of="content_type:parentDocument"}title:lucene

N.B. The parameter allParents is a filter that matches only parent documents; here you would define the field and value that you used to identify all parent documents.
The parameter someParents identifies a query that will match some of the parent documents. The output is the children.

How to Retrieve Children Independently of the Query

If you have a query that returns parents, independently if it was a Block Join Query or just a plain query, you may be interested in retrieving child documents as well.
This is possible through the Child Transformer

[child] – ChildDocTransformerFactory

fl=id,[child parentFilter=doc_type:book childFilter=doc_type:chapter]

When using this transformer, the parentFilter parameter must be specified unless the schema declares _nest_path_. It works the same as in all Block Join Queries. Additional optional parameters are:

childFilter: A query to filter which child documents should be included. This can be particularly useful when you have multiple levels of hierarchical documents. The default is all children. This query supports a special syntax to match nested doc patterns so long as _nest_path_ is defined in the schema and the query contains a / preceding the first :. Example: childFilter=/comments/content:recipe 

limit: The maximum number of child documents to be returned per parent document. The default is 10

fl: The field list which the transformer is to return. The default is the top level fl).
There is a further limitation in which the fields here should be a subset of those specified by the top level fl parameter.

Complex childFilter queries

Let’s focus on the childFilter query.
This query must match only child documents.
Then It can be as complex as you like to retrieve only a specific subset of child documents.
Unfortunately is less intuitive than expected to pass complex queries here because by default spaces will work against you.

… childFilter=field:(classic OR boolean AND query)]

… childFilter=field: I am a complex query]

You can certainly try complex approaches in text analysis an debugging the parsed query, but I recommend to use local params placeholders and substitution, this will solve most of your issues:

fl=id,[child parentFilter=doc_type:book childFilter=$childQuery limit=100]
&childQuery=(field:(I am a complex child query OR boolean))

Using the placeholder substitution will solve you the whitespace local params splitting problems and help you in formulating complex queries to retrieve only subsets of children documents out of parent results.

Retrieve Child Documents in SolrJ

Once you have a query that is returning child documents (and potentially also parents) let’s see how you can use it in SolrJ to get back the Java objects.

DocumentObjectBinder solrBinder = new DocumentObjectBinder();

String fields="id,query," +
    "[child parentFilter=docType:parent childFilter=$childQuery]";
String childQuery = "childField:value";
final SolrQuery query = new SolrQuery(GET_ALL_PARENTS_QUERY);
query.add("metricFilter",metricFilter);
query.addFilterQuery("parentField:value");
...
query.setFields(fields);

QueryResponse children = solr.query("collection", query);
List<Parent> parents = binder.getBeans(Parent.class, children.getResults());

In this way you’ll obtain the Parent objects that satisfy your query including all the requested fields and the nested children.

Conclusion

Working with Nested Documents is extremely funny and can solve a lot of problems and tricky user requirements, but they are also not easy to master so I hope this blog can help you to navigate the rough sea of the Block Join and Nested Documents in Apache Solr!

London Information Retrieval Meetup June

After the very warm reception of the first edition, the second London Information Retrieval Meetup is approaching (25/06/2019) and we are excited to add more details about our speakers and talks!
The event is free and you are invited to register :

https://www.eventbrite.com/e/london-information-retrieval-meetup-june-tickets-62261343354

Our first speaker is René Kriegler, freelance search consultant and search engineer :

René Kriegler

René has been working as a freelance search consultant for clients in Germany and abroad for more than ten years. Although he is interested in all aspects of search and NLP, key areas include search relevance consulting and e-commerce search. His technological focus is on Solr/Lucene. René co-organises MICES (Mix-Camp E-Commerce Search, Berlin, 19 June). He maintains the Querqy open source library.

Query Relaxation – a Rewriting Technique between Search and Recommendations

In search quality optimisation, various techniques are used to improve recall, especially in order to avoid empty search result sets. In most of the solutions, such as spelling correction and query expansion, the search query is modified while the original query intent is normally preserved.
In my talk, I shall describe my experiments with different approaches to query relaxation. Query relaxation is a query rewriting technique which removes one or more terms from multi-term queries that would otherwise lead to zero results. In many cases the removal of a query term entails a change of the query intent, making it difficult to judge the quality of the rewritten query and hence to decide which query term should be removed.
I argue that query relaxation might be best understood if it is seen as a technique on the border between search and recommendations. My focus is on a solution in the context of e-commerce search which is based on using Word2Vec embeddings.

Haystack 2019 Experience

This blog is a quick summary of my (subjective) experience at Haystack 2019 : the Search Relevance Conference, hosted in Charlottesville (Virginia, USA) from 24/04/2019 to 25/04/2019.
References to the slides will be updated as soon as they become available.

First of all my feedback on the Haystack Conference is extremely positive.
From my perspective the conference has been a success.
Charlottesville is a delightful small city in the heart of Virginia, clean, organized, spatious and definitely relaxing, it has been a pleasure to spend my time there.
The venue chosen for the conference was a Cinema, initially I was surprised but it worked really well, kudos to OpenSource Connections for the idea.
The conference and talks were meticulously organised, on time and with a relaxed pace, that definitely helped both the audience and the speakers to enjoy it more: thanks to the whole organisation for such quality!
Let’s take a look to the conference itself now: it has been 2 days of very interesting talks, exploring the latest trends in the industry in regards to search relevance with a delightful tech agnostic approach.
That’s been one of my favourite aspects of the conference: no one was trying to sell its product, it was just a genuine discussion of interesting problems and practical solutions, no comparison between Apache Solr and Elasticsearch, just pure reasoning on challenging problems, that’s brilliant!
Last but not least, the conference allowed amazing search people from all over the world and cultures to meet, interact and discuss about search problems and technologies, it may sound obvious for a conference but it’s a great achievement nonetheless!

Keynote: What is Search Relevance?

Max Irwin opened the conference with its keynote on the meaning of Search Relevance, the talk was a smooth and nice introduction to the topic, making sure everyone was on the same page, ready for the following talks.
A good part of the opening was dedicated to the problem of collecting ground truth ratings (from explicit to implicit and hybrid approaches).

Rated Ranking Evaluation: An Open Source Approach for Search Quality Evaluation

After the keynote it was our turn, it has been an honour to open the track sessions in theatre 5 with our talk “Rated Ranking Evaluator: An Open Source Approach to Search Quality Evaluation”.
Our talk was a revised version on the introduction to RRE with a focus on the whole picture and how our software fits industry requirements.
Building on the introduction, we explored what search quality evaluation means for a generic information retrieval system and how you can apply the fundamental concepts of the topic to the real world with a full journey of assessing your system quality in an open source ecosystem.
Last part of the session was reserved for a quick demo, showing the key components in the RRE framework.
Really happy of the reception from the audience, I take the occasion to say a big thank you to everyone present in the theatre that day, this really encourages us to continue our work and make RRE even better.

Making the Case for Human Judgement Relevance Testing

After our talk, it was the turn of LexisNexis with an overview on judgement relevancy testing with the talk by Tito Serra and Tara Diedrichsen “Making the Case for Human Judgement Relevance Testing”.
The talk was quite interesting and explored the ways to practically setup a human relevance testing programme.
When dealing with humans, reaching or estimating consensus is not trivial and it is also quite important to details as much as possible why a document is rated that way (the reason is as important as the rating).

Query Relaxation – a Rewriting Technique between Searching and Recommendations

Lunch break and we’re back to the business with “Query Relaxation – a Rewriting Technique between Searching and Recommendations” by Rene Kriegler.
This one has been personally one of my favourites: from a clear definition of the problem (reducing the occurrence of zero results searches), the speaker illustrated various approaches, starting from just naive techniques (based on random removal of terms or term frequencies based removal) to the final word2vec + neural network system, able to drop words to maximise the probability of presenting a query reformulation that appeared in past sessions.
The overview of the entire journey was detailed and direct, especially because all the iterations were described and not only the final successful steps.

Beyond the Search Engine: Improving Relevancy through Query Expansion

And to conclude the first day I chose “Beyond the Search Engine: Improving Relevancy through Query Expansion”, a journey to improve the relevance in an e-commerce domain, from Taylor Rose and David Mitchell from Ibotta.
Focus of the talk was to describe a successful inter-team collaboration where a curated knowledge base used by the Machine Learning team proved quite useful to improve the mechanics of synonym matching and product categorisation.

Lightning Talks

After the sessions the first day ended with lightning talks.
They were very quick and thoughts provoking, some of them that caught my attention:

  • Quaerite – From Tim Allison, a toolkit to optimise search parameters using genetic algorithms
  • Hello LTR – From Doug Turnbull, a set of Jupiter notebooks to quickly spin up LTR experiments
  • Hathithrust – finally had the chance to hear live about one of the earliest Solr adopters for “big data” (I remember their to be the first articles I read about scaling up Apache Solr back in 2010)
  • Smui – Search Management UI for Synonyms
  • Querqy – from Rene Kriegler, a framework for query preprocessing in Java-based search engines

Addressing Variance in AB Tests: Interleaved Evaluation of Rankers

The second day opened for me with “Addressing Variance in AB Tests: Interleaved Evaluation of Rankers” where Erik Bernhardson went through the way the Wikimedia foundation faced the necessity of speeding up their AB tests, reducing the data necessary to validate the statistical significance of such tests.
The concept of interleaving results to assess rankers is well known to the academic community, but it was extremely useful to see a real life application and comparison of some of the available techniques.
Especially useful was the description of 2 tentative approaches:
– Balanced Interleaving
– Team Draft Interleaving
To learn more about the topic Erik recommended this very interesting blog post by Netflix : Innovating Faster on Personalization Algorithms at Netflix Using Interleaving
In addition to that, for people curious of exploring more the topic I would recommend this github project : https://github.com/mpkato/interleaving .
It offers the python implementations of various interleaving algorithms and present a valid bibliography of solid publications on the matter.

Solving for Satisfaction: Introduction to Click Models

Then was Elizabeth Haubert turn with “Solving for Satisfaction: Introduction to Click Models” a very interesting talk, cursed by some technical issues that didn’t prevent Elizabeth to perform brilliantly and detail to the audience various approaches in modelling the attractiveness and utility of search results from the user interactions.
If you are curious to learn more about click models I recommend this interesting survey:
Click Models for Web Search that explores in details some of the models introduced by Elizabeth.

Custom Solr Query Parser Design Option, and Pros & Cons

Last in the morning was “Custom Solr Query Parser Design Option, and Pros & Cons”[8] from Bertrand Rigaldies:  a live manual to customise Apache Solr query parsing capabilities to your needs, including a bit of coding to show the key components involved in writing a custom query parser.The example illustrated was about a slight customisation of proximity search behaviour (to parse the user query and build Lucene Span Queries to satisfy a specific requirement in distance tolerance) and capitalisation support.
The code and slides used in the presentation are available here : https://github.com/o19s/solr-query-parser-demo

Search Logs + Machine Learning = Auto-Tagging Inventory

After lunch John Berryman (co-author of Relevant Search) with “Search Logs + Machine Learning = Auto-Tagging Inventory” faced content tagging from a different perspective:
can we use query and clicks logs to guess tags for documents?
The idea makes sense, when given a query you interact with a document you are effectively generating a correlation between the two entities and this can definitely be used to help in the generation of tags!
In the talk John went through few iterative approaches (one based on just query-clicked docs training set and one based on query grouped by session), you find the Jupiter notebooks here for your reference, try them out!
First implementation
Query collapsing
Second implementation
Third implementation

Learning To Rank Panel

Following up the unfortunate absence of one of the speakers, a panel on Learning To Rank industry application took place, with interesting discussions about one of the hottest technologies right now that presents a lot of challenges still.
Various people were involved in the session and it was definitely pleasant to partecipate to the discussion.
The main takeaway from the panel has been that even if LTR is an extremely promising technology, few adopters are right now really ready to proceed with the integration:
garbage in, garbage out is still valid and extra care is needed when starting a LTR project.

Search with Vectors

Before the conference wrap up, the last session I attended was from Simon Hughes “Search with Vectors”, a beautiful survey of vectorised similarity calculation strategies and how to use them in search nowadays in correlation with word2vec and similar approaches.
The focus of the talk is to describe how vector based search can help with synonymy, polysemy, hyper/hypo-nyms and related concepts.
The related code and slides from previous talks are available in the Dice repo: https://github.com/DiceTechJobs/VectorsInSearch

London Information Retrieval Meetup

The London Information Retrieval Meetup is approaching (19/02/2019) and we are excited to add more details about the speakers and talks!
The event is free and you are invited to register :
https://www.eventbrite.com/e/information-retrieval-meetup-tickets-54542417840

After Sambhav Kothari, software engineer at Bloomberg and Elia Porciani, R&D software engineer at Sease, our last speaker is Andrea Gazzarini, founder and software engineer at Sease :

Andrea Gazzarini

Andrea Gazzarini is a curious software engineer, mainly focused on the Java language and Search technologies.
With more than 15 years of experience in various software engineering areas, his adventure with the search domain began in 2010, when he met Apache Solr and later Elasticsearch… and it was love at first sight. 
Since then, he has been involved in many projects across different fields (bibliographic, e-government, e-commerce, geospatial).

In 2015 he wrote “Apache Solr Essentials”, a book about Solr, published by Packt Publishing.
He’s an opensource lover; he’s currently involved in several (too many!) projects, always thinking about a “big” idea that will change his (developer) life.

Introduction to Music Information Retrieval

Music Information Retrieval is about retrieving information from music entities.
This high-level definition relates to a complex discipline with many real-world applications.     
Being a former bass player, Andrea will describe a high-level overview about Music Information Retrieval and it will analyse from a musician perspective a set of challenges that the topic offers.
We will introduce the basic concepts of the music language, then passing through different kind of music representations we will end up describing some useful low level features that are used when dealing with music entities. 

Elia Porciani

Elia is a Software Engineer passionate about algorithms and data structures concerning search engines and efficiency.
He is currently involved in many research projects at CNR (National Research Council, Italy ) and for personal purpose.
Before joining Sease he worked in Intecs and List where he could experience different fields and levels of computer science, by handling low level programming problems such as embedded and networking up to high level trading algorithms.
He graduated with a dissertation about data compression and query performance on search engines.
He is active part of the information retrieval research community, attending international conferences such as SIGIR and ECIR.
His most recent pubblication is : FASTER BLOCKMAX WAND WITH VARIABLE-SIZED BLOCKS SIGIR 2017 Proceedings of the 40th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2017

Improving top-k retrieval algorithms using dynamic programming and longer skipping

Modern search engines has to keep up with the enormous growth in the number of documents and queries submitted by users. One of the problem to deal with is finding the best k relevant documents for a given query. This operation has to be fast and this is possible only by using specialised technologies.
Block max wand is one of the best known algorithm for solving this problem without any effectiveness degradation of its ranking.
After a brief introduction, in this talk I’m going to show a strategy introduced in “Faster BlockMax WAND with Variable-sized Blocks” (SIGIR 2017), that applied to BlockMaxWand data has made possible to speed up the algorithm execution by almost 2x.
Then, will be presented another optimisation of the BlockMaxWand algorithm (“Faster BlockMax WAND with Longer Skipping”, ECIR 2019) for reducing the time execution of short queries.


Sambhav Kothari

Sambhav is a software engineer at Bloomberg, working in the News Search Experience team.



Learning To Rank: Explained for Dinosaurs

Internet search has long evolved from days when you had to string up your query in just the right way to get the results you were looking for. Search has to be smart and natural, and people expect it to “just work” and read what’s on their minds.

On the other hand, anyone who has worked behind-the-scenes with a search engine knows exactly how hard it is to get the right result to show up at the right time. Countless hours are spent tuning the boosts before your user can find his favorite two-legged tiny-armed dinosaur on the front page.

When your data is constantly evolving, updating, it’s only realistic that so do your search engines. Search teams thus are on a constant pursuit to refine and improve the ranking and relevance of their search results. But, working smart is not the same as working hard. There are many techniques we can employ, that can help us dynamically improve and automate this process. One such technique is Learning to Rank.

Learning to Rank was initially proposed in academia around 20 years ago and almost all commercial web search-engines utilize it in some form or other. At Bloomberg, we decided that it was time for an open source search-engine to support Learning to Rank, so we spent more than a year designing and implementing it. The result of our efforts has been accepted by the Solr community and our Learning to Rank plugin is now available in Apache Solr.

This talk will serve as an introduction to the LTR(Learning-to-Rank) module in Solr. No prior knowledge about Learning to Rank is needed, but attendees will be expected to know the basics of Python, Solr, and machine learning techniques. We will be going step-by-step through the process of shipping a machine-learned ranking model in Solr, including:

  • how you can engineer features and build a training data-set as per your needs
  • how you can train ranking models using popular Python ML(machine learning) libraries like scikit-learn
  • how you can use the above-learned ranking-models in Solr

Get ready for an interactive session where we learn to rank!


Apache Solr Facets and ACL Filters Using Tag and Exclusion

What happens with facets aggregations on fields when documents in the results have been filtered by Access Control Lists ?
In such scenarios it is important to use the facet mincount parameter.
That specifies the minimum count in the result set for a facet value to appear in the response:

  • mincount=0, all the facet values present in the corpus are returned in the response. This includes the ones related to documents that have been filtered out by the ACLs(0 counts facets). This could cause some nasty side effect: such as a user seeing a facet value that he/she’s not supposed to see(because ACL filtered out that document from the result set).
  • mincount=1, only facet values matching at least one document in the result set are returned. This configuration is safe, users are going to see only facet values regulated by the ACL. They will effectively see only what they are supposed to see.

But what happens if you like to see 0 counting facet values, but preserving ACL?
This may help you in having a better understanding of the distribution of the values in the entire corpus, but ACL are still valid, so that users still see only possible values that they are supposed to see.
Tags and Exclusion comes handy in such case.

Faceting Tag And Exclusion

Tag and Exclusion is an extremely important feature for faceting in Apache Solr and you would not believe how many times it is misused or completely ignored, causing an erratic experience for the user.
Let’s see how it works:

Tagging

You can tag a filter query using Solr local parameter syntax:

fq={!tag=docTypeFilter}doctype:pdf

The same applies to the main query(with some caveats if you are using an explicit query parser) :

q={!tag=mainQuery}I am the main query

q={!edismax qf=text title tag=mainQuery}I am the main query

When assigning a tag we give Solr the possibility of identifying separately the various search clauses (such the main query or filter queries).
Effectively it is a way to assign an identifier to a search query or filter.

Excluding in Legacy Faceting

When applying filter queries, Solr is reducing the result space eliminating documents that don’t satisfy the additional filters added.
Let’s assume we want to count the values for a facet on the result set, ignoring the additional filtering that was added by a filter query.
Effectively can be equivalent to the concept of counting the facet values on a result set status that precedes the application of the filter that reduced the result set.
Apache Solr allows you to do that, without affecting the final results returned.

This is called exclusion and can be applied on a facet by facet basis.

fq={!tag=docTypeFilter}doctype:pdf...&facet=true&
facet.field={!ex=docTypeFilter}doctype

This will calculate the ‘doctype’ field facet on the result set with the exclusion of the tagged filter (so for the matter of calculating such aggregation the “doctype:pdf” filter will not be applied and the counts will be calculated on an extended result set).
All other facets, aggregations and the result set itself will not be affected.

1.<Wanted Behaviour - applying tag and exclusion>
=== Document Type ===
[ ] Word (42)
[x] PDF (96)
[ ] Excel(11)
[ ] HTML (63)

This is especially useful for single valued fields:
when selecting a facet value and refreshing the search if you don’t apply tag and exclusion you will get just that value in the facets, defeating the refinement and exploration facet functionality for that field.

2.<Unwanted Behaviour - out of the box>
=== Document Type ===
[ ] Word (0)
[x] PDF (96)
[ ] Excel(0)
[ ] HTML (0)
3.<Unwanted Behaviour - mincount=1>
=== Document Type ===
[x] PDF (96)

As you see in 2. and 3. the facet become barely usable to further explore the results, this may bring the user experience to be fragmented with a lot of back and forth activity selecting and unselecting filters.

Excluding in Json Faceting

After the tagging of a filter, applying an exclusion with the json.facet approach is quite simple:

visibleValues: {
type: terms,
field: cat,
mincount: 1,
limit: 100,
domain: {
excludeTags: <tag>
}
}

When defining a json facet, applying exclusion is just adding the domain node with the excludeTags defined.

Tag and Exclusion to Preserve Acl Filtering in 0 counts

Problem

  • Users are subject to a set of ACL that limit their results visibility.
  • They would like to see also 0 count facets to have a better understanding of the result set and corpus.
  • You don’t want to invalidate the ACL control, so you don’t expect them to see sensible facet values.

Tagging the Main Query and Json Faceting

This is achievale with a combination of tagging and exclusion with Json faceting.
First of all, we want to tag the main query.
We assume the ACL control will be a filter query(and we recommend to apply ACL filtering with properly tuned filter queries).
Tagging the main query and excluding it from the facet calculation will allow us to get all the facet values in the ACL filtered corpus (the main query will be excluded but the ACL filter query will still be applied).

q={!edismax tag=mainQuery qf=name}query&fq=aclField:user1...
json.facet={visibleValues: {
type: terms,
field: cat,
mincount: 1,
limit: 100,
domain: {
excludeTags: mainQuery
}
}}

We are almost there, this facet aggregation will give the counts of all facet values visible to the user in the original corpus(with ACL applied).
But what we want is to have the correct counts based on the current result set and all the visible 0 count facets.
To do that we can add a block to the Json faceting request:

q={!edismax tag=mainQuery qf=name}query&fq=aclField:user1...
json.facet={
resultSetCounts: {
type: terms,
field: category,
mincount: 1
},
visibleValues: {
type: terms,
field: category,
mincount: 1,
domain: {
excludeTags: mainQuery
}
}
}
  • resultSetCounts –  are the counts in the result set, including only NOT 0 counts facet values. This is the list of values the user has visibility on the current result set with correct counts.
  • visibleValues – are all the facet values in the result set the user should have visibility

Then, depending on the user experience we want to provide, we could use these blocks of information to properly render a final response.
For example we may want to show all visible values and associate with them a count from the resultSetCounts when available.

=== Document Type - Result Counts ===   
[ ] Word (10)
[ ] PDF (7)
[ ] Excel(5)
[ ] HTML (2)
=== Document Type - Visible Values ===
[ ] Word (100)
[ ] PDF (75)
[ ] Excel(54)
[ ] HTML (34)
[ ] Jpeg (31)
[ ] Mp4 (14)
 [ ] SecretDocType1 (0) -> not visible, mincount=1 in visibleValues
 [ ] SecretDocType2 (0) -> not visible, mincount=1 in visibleValues

=== Document Type - Final Result for users ===
[ ] Word (10) -> count is replaced with effective result count
[ ] PDF (7) -> count is replaced with effective result count
[ ] Excel(5) -> count is replaced with effective result count
[ ] HTML (2)-> count is replaced with effective result count
[ ] Jpeg (+31)
[ ] Mp4 (+14)

Bonus: What if I Defined the Query Parser in the Solrconfig.xml

This solution is still valid if you are using your query parser defined in the solrconfig.xml .
Extra care is needed to tag the main query.
You can achieve that using the local params in Solr request parameters:

solrconfig.xml
<lst name="defaults">
...
<str name="q">{!type=edismax tag=mainQuery v=$qq}</str>
<str name="qq">*:*</str>
...

Query Time
.../solr/techproducts/browse?qq=ipod mini&fq=acl:user1&json.facet=...

Hope this helps when dealing with ACL or generic filter queries and faceting!

Apache Solr Distributed Facets

Apache Solr distributed faceting feature has been introduced back in 2008 with the first versions of Solr (1.3 according to this jira[1]) .
Until now, I always assumed it just worked, without diving too much into the details.
Nowadays distributed search and faceting are extremely popular, you can find them pretty much everywhere (in the legacy or SolrCloud form alike).
N.B. Although the mechanics are pretty much the same, Json faceting revisits this approach with some change, so we will now focus on legacy field faceting.

I think it’s time to get a better understanding of how it works:

Multiple Shard Requests

When dealing with distributed search and distributed aggregation calculations, you are going to see multiple requests going back and forth across the shards.
They have different focus and are meant to retrieve the different bits of information necessary to build the final response.
We are going to explore the different rounds of requests, focusing just for the faceting purpose.
N.B. Some of these requests are also carrying results for the distributed search calculation, this is used to minimise the network traffic.

For the sake of this blog let’s simulate a simple sharded index, white space tokenization on field1 and facet.field=field1

Shard 1 Shard 2
Doc0
{  “id”:”1”,
“field1”:”a b”
}
Doc3
{  “id”:”4”,
“field1”:”b c”
}
Doc1
{  “id”:”2”,
“field1”:”a”
}
Doc4
{  “id”:”5”,
“field1”:”b c”
}
Doc2
{  “id”:”3”,
“field1”:”b c”
}
Doc53
{  “id”:”6”,
“field1”:”c”
}

Global Facets : b(4), c(4), a(2)

Shard 1 Local Facets : a(2), b(2), c(1)

Shard 2 Local Facets : c(3), b(2)

Collection of Candidate Facet Field Values

The first round of requests is sent to each shard to identify the candidate top K global facet values.
To achieve this target each shard will be requested to respond with its local top K+J facet values and counts.
The reason we actually ask for more facets from each shard is to have a better term coverage, to avoid losing relevant facet values and to minimise the refinement requests.
How many more we request from each shard is regulated by the “overrequest” facet parameter, a factor that gives more accurate facets at the cost of additional computations[2].
Let’s assume we configure a facet.limit=2&facet.overrequest.count=0&facet.overrequest.ratio=1 to explain when refinement happens and how it works.

Shard 1 Returned Facets : a(2), b(2)

Shard 2 Returned Facets : c(3), b(2)

Global Merge of Collected Counts

The facet value counts collected from each shard are merged and the most occurring global top K is calculated.
These facet field values are the first candidates to be the final ones.
In addition to that, other candidates are extracted from the terms below the top K, based on the shards that didn’t return those values statistics.
At this point we have a candidate set of values and we are ready to refine their counts where necessary, asking back this information to the shards that didn’t include that in the first round.
This happens including the following specific facet parameter to the following refinement requests:

{!terms=$<field>__terms}<field>&<field>__terms=<values>
e.g.
{!terms=$field1__terms}field1&field1__terms=term1,term2

N.B. This request is specifically asking a Solr instance to return back the facet counts just for the terms specified[3]

Top 2 candidates = b(4), c(3)
Additional candidates = a(2)

The reason that a(2) is added to the potential candidates is because Shard 2 didn’t answer with a count for a, the potential missing count of 1 could bring a to the top K. So it is worth a verification.

Shard 1 didn’t return any value for the candidate c facet.
So the following request is built and sent to it:
facet.field={!terms=$field1__terms}field1&field1__terms=c

Shard 2 didn’t return any value for the candidate a facet.
So the following request is built and sent to it:
facet.field={!terms=$field1__terms}field1&field1__terms=a

Final Counts Refinement

The refinements counts returned by each shard can be used to finalise the global candidate facet values counts and to identify the final top K to be returned by the distributed request.
We are finally done!

Shard 1 Refinements Facets : c(1)

Shard 2 Refinements Facets : a(0)

Top K candidates updatedb(4), c(4), a(2)

GIven a facet.limit=2 the final global facets with correct results returned is :
b(4), c(4)

 

[1] https://issues.apache.org/jira/browse/SOLR-303

[2] https://lucene.apache.org/solr/guide/6_6/faceting.html#Faceting-Over-RequestParameters

[3] https://lucene.apache.org/solr/guide/7_5/faceting.html#limiting-facet-with-certain-terms

Synonyms and Stopwords: Vademecum

In this post we’ll cover two additional synonyms scenarios and we’ll try to summarise all previous tips in a coincise form. Following the approach of the previous posts [1] [2] [3], everything can be applied both to Apache Solr and Elasticsearch.

Preconditions

  • Synonyms and stopwords at query time: this is not just a “theoretical” constraint; imagine if you have to manage a deployment context belonging to the same customer with a lot of small / medium indexes: you cannot re-build from scratch everything each time a synonym or a stopword changes.
  • Synonyms, not hypernyms or hyponyms: or better, we aren’t talking about what a thesaurus calls broader, narrower or related terms. Although some of the things below could be also valid in those contexts, the broader or narrower scope introduced with hypernyms, hyponyms or related concepts can have some weird side-effect on the scoring phase.

Test data

Let’s start with the test data.

  • synonyms = [“out of warranty, oow”, “transfer phone number, port number”]
  • stopwords = [“of”, “my”]
  • query analyzer = [ “standard_tokenizer”, “lowercase filter”, “synonyms (graph) filter”, “stopwords filter”]

#1: How can I define Multi-terms Concepts?

If you want to manage a multi-terms concept as a whole, regardless it has synonyms or not, you can use the synonyms file. Here’s a couple of examples: the first is a concept with one synonym, the second one doesn’t have any synonym:

Multimedia Messaging Service,Multimedia Text Message,MMS
Apache Cassandra, Apache Cassandra

As you can see, when a concept doesn’t have any available synonym, we can just repeat it.

Solr users only: don’t forget the following things:

  • the request handler should use an edismax or lucene query parser, and the SplitOnWhiteSpace flag (sow) must be set to true
  • the field type which includes the synonyms graph filter must have the autoGeneratePhraseQueries set to true

You can read more here [1] about this approach.

Note: this will work until the Lucene SynonymMap uses a List/Array for collecting the synonyms associated with a given concept. When and if the implementation will switch to a Set-like approach, there’s a high chance this trick will stop working.

#2: What if the query contains multi-terms concepts with stopwords?

Imagine a query like this

q=my car is out of warranty. What can I do?

Well, with the configuration above the stopwords removal after the synonyms detection causes a weird effect on the generated query: the “what” term is wrongly added to the synonym phrase query: “out ? warranty what”.

While the issue affects the FilteringTokenFilter (the superclass of StopFilter) and therefore it has a wider scope, for this specific problem we proposed a solution [2], consisting of a specialised StopFilter which is aware about synonym tokens. The result is that terms which are part of a previously detected synonym are not removed, even if they are stopwords. The query analyzer of our field becomes something like this:

<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SynonymGraphFilterFactory" 
        synonyms="synonyms.txt" 
        ignoreCase="false" 
        expand="true"/>
<filter class="io.sease.SynonymAwareStopFilterFactory" 
        words="stopwords.txt" 
        ignoreCase="true"/>

#3: What if the document contains multi-terms concepts with “intruder” stopwords?

We have a document like this:

{
  "id": 1,
  "title": "how do I transfer my phone number?"
}

and the query:

q=transfer phone number procedure

at query time, the synonym is correctly detected and phrase clauses are generated, but unfortunately it doesn’t match the document above because the intermediate “my” stopwords:

You can read here [3] the proposed solution for this scenario, which basically consists of a two-steps query plan: in the first, the detected synonyms generate phrase clauses, while in the second they are destructured in term clauses.

#4: What if the query contains multi-terms concepts with “intruder” stopwords?

And here we are in the opposite case. We have a document like this:

{ 
  "id": 1, 
  "title": "transfer phone number procedure" 
}

and the query:

q=how do I transfer my phone number?

As you can see, at query time the synonym is not detected because the “my” stopword between terms. While the document above could be still be part of the response of the generated query, here we are focusing on the missing synonym detection.

A possible solution is to double the synonym filter before and after the stopwords filter:

<fieldtype 
       name="text_with_synonyms_phrases" 
       class="solr.TextField" autoGeneratePhraseQueries="true">
       
       <analyzer type="index">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
       </analyzer>
       <analyzer type="query">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
           <filter class="solr.SynonymGraphFilterFactory" 
                   synonyms="synonyms.txt" 
                   ignoreCase="true" 
                   expand="true"/>
           <filter class="io.sease.SynonymAwareStopFilterFactory" 
                   words="stopwords.txt" 
                   ignoreCase="true"/>
           <filter class="solr.SynonymGraphFilterFactory" 
                   synonyms="synonyms.txt" 
                   ignoreCase="true" 
                   expand="true"/>
       </analyzer>
</fieldtype>

In the first iteration the synonym is not detected, then the StopFilter removes the “my” stopword so in the second iteration the synonym will be correctly recognized. Note the StopFilter is still the custom class we introduced in #2 because we want to cover also that scenario.

What is the drawback of this approach? This is something which worked in my specific case, but be aware that the SynonymGraphFilter documentation states this explicit warning:

NOTE: this cannot consume an incoming graph; results will be undefined.

#5 (UNSOLVED) What if the query contains multi-terms concepts more than one “intruder” stopwords?

This is the worst case, where we have a query like this:

q=out of my warranty

That is: we have a couple of terms which have been declared as stopwords, but the first (of) is potentially part of a synonym (out of warranty) while the second (my) isn’t.

We’re still working on this case so unfortunately there’s no a proposal here, if you got some idea or feedback, it is warmly welcome.


[1] Multi-terms concepts in Apache Solr / Elasticsearch
[2] SynonymAwareStopFilter
[3] https://sease.io/2018/08/still-synonyms-stopwords-mamma-mia.html

Still Synonyms + Stopwords?? Mamma mia!

The Context

Brief recap of where we arrived in the preceding article: we had the following synonyms and stopwords settings:

  • synonyms = {“out of warranty”,”oow”}
  • stopwords = {“of”}

Both of those filters were configured exclusively at query-time; the synonym filter first and then the stopwords filter.

Using the built-in StopFilter we had a synonym detection issue because the removal of the “of” term in the query string (e.g. “my device ran out of warranty“). For that reason, we introduced a custom StopFilter subclass which was aware about stopwords in synonyms.

The other scenario we are going to describe is a little bit different: let’s suppose we have the following data:

  • synonyms = {test code, tdd, testing}
  • stopwords = {my, your, how ,to, in}

Still here, we want to manage synonyms and stopwords only at query time.
We have this document indexed:

   {
      "id": 1,
      "title": "Java programmer: do you want to test your code?"
   }

And a query like this:

"how to test code in Java?"

The Problem: missing synonym match

The query parser matches the “test code” synonym in the query and produces a query like this:

(title:tdd title:testing PhraseQuery(title:"test code")) title:java

unfortunately there’s no match, because the document title contains an intruder: the “your” term between the “test” and “code”.

A Solution: invisible queries with and without synonym phrases

In the preceding article we’ve underlined the role of the autoGeneratePhraseQueries flag. It is the responsible of creating phrase clauses for all detected multi-terms synonyms. In case this flag is set to false (or even missing) the generated query won’t have any phrase, even if a multi-term synonym is detected.

While ordinarily this is not what you would expect, in this specific case it could be a valid alternative for dealing with such mismatching: a first request would require the “synonym phrasing” behaviour, a second one wouldn’t. The first query would be:

(title:tdd title:testing PhraseQuery(title:"test code")) title:java

After receiving an empty response, a second query will be sent, targeting another (similar) field related to a field type which has the autoGeneratePhraseQueries parameter will be set to false. That would generates the following query:

(title:testing title:tdd (+title:test +title:code)) title:java

and here we would get a match!

A couple of notes:

  • in the second try we are requiring the disjoint presence of those two terms (“test” and “code”) in whatever order, with whatever proximity, so the increased recall could produce some unexpected results. In case we are using the edismax query parser, a “pf” parameter would be helpful for moving up those results which adhere better to the entered query, in terms of proximity and terms order.
  • we could put the stop filter at index time, but that violates the precondition: we want a pure query-time management.

How to implement such search workflow? In Solr, we need a couple of fields, the first one is exactly the field + field type we described in the preceding article, the second is similar, the only difference is in the autoGeneratePhraseQueries parameter, which is set to false:

<fieldtype 
       name="text_with_synonyms_phrases" 
       class="solr.TextField" autoGeneratePhraseQueries="true">
       
       <analyzer type="index">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
       </analyzer>
       <analyzer type="query">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
           <filter class="solr.SynonymGraphFilterFactory" 
                   synonyms="synonyms.txt" 
                   ignoreCase="false" 
                   expand="true"/>
           <filter class="sc.SynonymAwareStopFilterFactory" 
                   words="stopwords.txt" 
                   ignoreCase="true"/>
       </analyzer>
</fieldtype>
<fieldtype 
       name="text_without_synonyms_phrases" 
       class="solr.TextField" autoGeneratePhraseQueries="false">
       
       <analyzer type="index">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
       </analyzer>
       <analyzer type="query">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
           <filter class="solr.SynonymGraphFilterFactory" 
                   synonyms="synonyms.txt" 
                   ignoreCase="false" 
                   expand="true"/>
           <filter class="sc.SynonymAwareStopFilterFactory" 
                   words="stopwords.txt" 
                   ignoreCase="true"/>
       </analyzer>
</fieldtype>

<field 
      name="title_with_synonyms_phrases" 
      type="text_with_synonyms_phrases .../>
<field 
      name="title_without_synonyms_phrases" 
      type="text_without_synonyms_phrases .../>

then, here is the minimal request handler:

<requestHandler name="/search" class="solr.SearchHandler" default="true">
       <lst name="defaults">
           <bool name="sow">false</bool>
           <str name="df">title_with_synonyms_phrases</str>
           <str name="defType">lucene</str> 
       </lst>
   </requestHandler>

A client would send first a request like this:

/search?q=how to test code in Java

And, after receiving an empty response, it will send a second query:

/search?q=how to test code in Java&df=text_without_synonyms_phrases

Another option, which moves the search workflow on Solr side, is our CompositeRequestHandler [1], a Solr component which invokes in chain a set of RequestHandler instances: a first request handler, targeting the title_with_synonyms_phrases would be invoked and, in case of zero results, the same query will be sent to another request handler, which would target the title_without_synonyms_phrases.

Note for Elasticsearch users: you will find some difference in applying what is described above. Although the auto_generate_phrase_queries attribute is also present in Elasticsearch, it doesn’t have the same effect. What you’re looking for is an attribute which is not related with field types, it is a query attribute [2] [3]  and it is called auto_generate_synonyms_phrase_query.


[1] https://github.com/SeaseLtd/composite-request-handler
[2] Match Query / Synonyms
[3] Query String Query / Synonyms

Synonyms + Stopwords?? OMG!

The Context

The scenario description is quite simple: we want to use synonyms and stopwords.

Following the path of our previous article, we will introduce an additional component in the analysis chain: a StopFilter, which, as the name suggests, removes a set of words from an incoming token stream.

We will use the following data through the examples:

  • synonyms = [“out of warranty”,”oow”]
  • stopwords = [“of”]

Token filters can be configured at index and/or query time. In this context we are focused on the query side: both synonyms and stopwords will be configured only in the query analyzer.

Working exclusively at query time has a great benefit: we can change things at runtime without any reindex need. At the same time, no stopwords filtering will be executed at index time so those terms will be uselessly part of the dictionary.

The Problem: synonyms followed by stopwords

We have the following analyzers:

  • index analyzer
    • standard-tokenizer
    • lowercase
  • query analyzer
    • standard-tokenizer
    • lowercase + synonyms + stopwords

Theoretically, in the query analyzer we would have two options: the stopwords filter could be defined before or after the synonym filter. However, the first way (before) doesn’t make so much sense, because terms that are stopwords and that are, at the same time, part of a synonym will be removed before the synonym detection. As consequence of that those synonym won’t be detected: in the example data, issuing a query like

q=out of warranty

the “of” term will be removed by the StopFilter, the subsequent filter would receive [“out”, “warranty”], which doesn’t match the configured synonym (“out of warranty”).

Elasticsearch users: Elasticsearch doesn’t allow this scenario at all; if you try to use the PUT Settings API with a chain defined as above (first stopwords then synonyms with some term intersection), it will throw an illegal argument exception saying “term: out of warranty analyzed to a token (warranty) with position increment != 1 (got: 2)” .

Apache Solr instead uses a lenient approach: no errors at index creation, but the problem remains (personally I prefer the Elasticsearch approach)

So the obvious choice is to postpone the stopwords management after the synonym filter. Unfortunately, here there’s an issue: the stopword(s) removal has some unwanted side-effect in the generated token graph and the query parser generates a wrong query because it consumes the token stream at the end of the chain.

Let’s imagine we have the following query:

q=tv went out of warranty something of

it will generate the following:

title:tv title:went (title:oow PhraseQuery(title:"out ? warranty something"))

As you can see, the synonym (out of warranty -> oow) is correctly detected but the stopwords filter removes all the “of” tokens, even if the first occurrence is part of a synonym. In the generated query you can see the sneaky effect: the “hole” created by the first “of” occurrence removal, produces the inclusion, in the phrase query, of the next available token in the stream (“something”, in the example).

In other words, the oow token synonym is marked with a positionLength = 3, which correctly means it spans three tokens (1=out, 2=of, 3=warranty); later, the query parser will include the next three available terms for generating a synonym phrase queries but since we no longer have the 2nd token (of), such count includes also “something”, which is the 3rd available token in the stream.

Before proceeding: this is a known problem, a long-standing issue [1] in Lucene which has a broader domain because it is related with the FilteringTokenFilter, the superclass of StopFilter.

The problem we will try to solve is: how can we manage synonyms and stopwords at query time without generating the conflict above?

A Solution

A note first: the token filter we are going to create is something that deals only with Lucene classes. However, when things need to be plugged in a runtime container (e.g. Apache Solr or Elasticsearch) the deployment procedure depends on the target platform: we won’t cover this part here.

The proposed solution is to create a StopFilter subclass which will be “synonym-aware”; it will check the tokenType and positionLength attributes before deciding if a token needs to be removed from the stream. The goal is to avoid removing those terms which have been defined in the stopwords list but are part of a synonym definition.

The class that we are going to extends is org.apache.lucene.analysis.core.StopFlter. This is an empty class, because all the filtering logic is in the superclasses (org.apache.lucene.analysis.StopFilter and the more generic org.apache.lucene.analysis.FilteringTokenFilter). The stopwords logic resides in the accept() method, which as you can see is very simple:

protected boolean accept() {
  return !stopWords.contains(termAtt.buffer(), 0, termAtt.length());
}

If the stopwords list contains the current term, it will be removed. So far, so good. We need to extend (actually we could also decorate) the StopFilter class for doing something else before calling the logic above.

First we need to check the token type: if a token has been marked as a SYNONYM then our filter doesn’t have to remove it. Then we need to check the positionLength attribute, because, within a synonym detection context, a position length greater than 1 means we have traversing a multi-term synonym:

public class SynonymAwareStopFilter extends StopFilter {

  private TypeAttribute tAtt = 
                             addAttribute(TypeAttribute.class);
  private PositionLengthAttribute plAtt = 
                             addAttribute(PositionLengthAttribute.class);

  private int synonymSpans;

  protected SynonymAwareStopFilter(
                         TokenStream in, CharArraySet stopwords) {
    super(in, stopwords);
  }

  @Override
  protected boolean accept() {
    if (isSynonymToken()) {
      synonymSpans = plAtt.getPositionLength() > 1 
                             ? plAtt.getPositionLength() 
                             : 0;
      return true;
    }

    return (--synonymSpans > 0) || super.accept();
  }

  private boolean isSynonymToken() {
    return "SYNONYM".equals(tAtt.type());
  }

Let’s do some test. We will use Apache Solr 7.4.0 for checking the results. Here is the field type definition, where you can see our SynonymAwareStopFilter:

<fieldtype name="text" class="solr.TextField" autoGeneratePhraseQueries="true">
       <analyzer type="index">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
       </analyzer>
       <analyzer type="query">
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
           <filter class="solr.SynonymGraphFilterFactory" 
                   synonyms="synonyms.txt" 
                   ignoreCase="false" 
                   expand="true"/>
           <filter class="sc.SynonymAwareStopFilterFactory" 
                   words="stopwords.txt" 
                   ignoreCase="true"/>
       </analyzer>
</fieldtype>

and this is a minimal request handler:

<requestHandler name="/def" class="solr.SearchHandler" default="true">
       <lst name="defaults">
           <bool name="sow">false</bool>
           <str name="df">title</str>
           <str name="defType">lucene</str>
           <bool name="debug">true</bool>
       </lst>
   </requestHandler>

Running the previous query:

q=tv went out of warranty something of

we have the following:

title:tv title:went (title:oow PhraseQuery(title:"out of warranty")) title:something

if we use instead the other synonym variant:

q=tv went oow something of

we have the following:

title:tv title:went (PhraseQuery(title:"out of warranty") title:oow) title:something

Everything seems working as expected! This is probably just one specific scenario among those addressed by LUCENE-4065; however, it helped me a lot because this is (at least in my experience) a frequent use case.

As usual, any feedback is warmly welcome. See you next time!

 


[1] https://issues.apache.org/jira/browse/LUCENE-4065