Apache Solr, Main Blog

Solr: You complete me! The Apache Solr Autocomplete

July 11, 2015
22 mins read

This blog post is about the Apache Solr Autocomplete feature.
The current documentation available on the wiki is not enough to fully understand the Solr Suggester: this blog post will describe all the available implementations with examples, tricks and tips.

Introduction to Solr autocomplete

If there’s one thing that months of Solr-user mailing list have taught me is that the autocomplete feature in a Search Engine is vital and around Apache Solr Suggester there’s as much hype as confusion.

In this blog I am going to try to clarify as much as possible all the kinds of Suggesters that can be used in Solr, exploring in detail how they work and showing some real-world examples.

It’s not the scope of this blog post to explore in detail the configurations.

Please use the official wiki [1] to integrate this resource.

Let’s start with the definition of the Apache Solr Suggester component.

The Apache Solr Suggester

From the official wiki [1]:

” The SuggestComponent in Solr provides users with automatic suggestions for query terms. You can use this to implement a powerful auto-suggest feature in your search application.

This approach utilizes Lucene’s Suggester implementation and supports all of the lookup implementations available in Lucene.

The main features of this Suggester are:

- - Lookup implementation pluggability
  - Term dictionary pluggability, giving you the flexibility to choose the dictionary implementation
  - Distributed support “

For the details of the configuration parameter, I suggest you the official wiki as a reference.

Our focus will be on the practical use of the different Lookup Implementation, with clear examples.

Term Dictionary

The Term Dictionary defines the way the terms (source for the suggestions) are retrieved for the Solr suggester.

There are different ways of retrieving the terms, we are going to focus on the DocumentDictionary ( the most common and simple to use).

For details about the other Dictionaries implementation please refer to the official documentation as usual.

The DocumentDictionary uses the Lucene Index to provide the list of possible suggestions, and specifically a field is set to be the source for these terms.

Solr Suggester Building

Building a suggester is the process of :

- retrieving the terms (source for the suggestions) from the dictionary
- build the data structures that the Suggester requires for the lookup at query time
- Store the data structures in memory/disk

The produced data structure will be stored in memory in first place.

It is suggested to additionally store on disk the built data structures, in this way it will available without rebuilding, when it is not in memory anymore.

For example when you start up Solr, the data will be loaded from disk to the memory without any rebuilding to be necessary.

This parameter is:

“storeDir” for the FuzzyLookup

“indexPath” for theAnalyzingInfixLookup

The built data structures will be later used by the suggester lookup strategy, at query time.

In detail, for the DocumentDictionary during the building process, for ALL the documents in the index :

- the stored content of the configured field is read from the disk ( stored=”true” is required for the field to have the Suggester working)
- the compressed content is decompressed (remember that Solr stores the plain content of a field applying a compression algorithm) [3]
- the suggester data structure is built

We must be really careful here to this sentence :

“for ALL the documents” -> no delta dictionary building is happening.

So extra care every time you decide to build the Solr Suggester!

Two suggester configurations are strictly related to this observation :

Parameter	Description
buildOnCommit or buildOnOptimize	If true then the lookup data structure will be rebuilt after each soft-commit. If false, the default, then the lookup data will be built only when requested by query parameter suggest.build=true. Because of the previous observation is quite easy to understand that the buildOnCommit is highly discouraged.
buildOnStartup	If true then the lookup data structure will be built when Solr starts or when the core is reloaded. If this parameter is not specified, the suggester will check if the lookup data structure is present on disk and build it if not found. Again, is highly discouraged to set this to true, or our Solr cores could take really long

Parameter

Description

buildOnCommit or buildOnOptimize

If true then the lookup data structure will be rebuilt after each soft-commit. If false, the default, then the lookup data will be built only when requested by query parameter suggest.build=true.

Because of the previous observation is quite easy to understand that the buildOnCommit is highly discouraged.

buildOnStartup

If true then the lookup data structure will be built when Solr starts or when the core is reloaded. If this parameter is not specified, the suggester will check if the lookup data structure is present on disk and build it if not found.

Again, is highly discouraged to set this to true, or our Solr cores could take really long

A good consideration at this point would be to introduce a delta approach in the dictionary building.

Could be a good improvement, making more sense out of the “buildOnCommit” feature.

I will follow up to verify the technical feasibility of this solution.

Now let’s step to the description of the various lookup implementations with related examples. Note: when using the field type “text_en” we refer to a simple English analyser with soft stemming and stop filter enabled.

The simple corpus of documents for the examples will be the following :

				
					[
      {
        "id":"44",
        "title":"Video gaming: the history"},
      {
        "id":"11",
        "title":"Video games are an economic business"},
      {
        "id":"55",
        "title":"The new generation of PC and Console Video games"},
      {
        "id":"33",
        "title":"Video games: multiplayer gaming"}]

And a simple synonym mapping: multiplayer, online

AnalyzingLookupFactory

				
					<lst name="suggester">

  <str name="name">AnalyzingSuggester</str>

  <str name="lookupImpl">AnalyzingLookupFactory</str>

  <str name="dictionaryImpl">DocumentDictionaryFactory</str>

  <str name="field">title</str>

  <str name="weightField">price</str>

  <str name="suggestAnalyzerFieldType">text_en</str>

</lst>

	Description
Data Structure	FST
Building	For each Document, the stored content from the field is analyzed according to the suggestAnalyzerFieldType. The tokens produced are added to the Index FST.
Lookup strategy	The query is analysed, the tokens produced are added to the query FST. An intersection happens between the Index FST and the query FST. The suggestions are identified starting at the beginning of the field content.
Suggestions returned	The entire content of the field .

This suggester is quite powerful as it allows to provide suggestions at the beginning of a field content, taking advantage of the analysis chain provided with the field.

It will be possible in this way to provide suggestions considering synonyms, stop words, stemming and any other token filter used in the analysis. Let’s see some examples:

Query to autocomplete	Suggestions	Explanation
“Video gam”	“Video gaming: the history” “Video games are an economic business” “Video game: multiplayer gaming”	The suggestions coming are simply the result of the prefix match. No surprises so far.
“Video Games”	“Video gaming: the history” “Video games are an economic business” “Video game: multiplayer gaming”	The input query is analysed, and the tokens produced are the following : “video” “game”. The analysis was applied at building time as well, producing the same stemmed terms for the beginning of the titles. “video gaming” -> “video” “game” “video games” -> “video” “game” So the prefix match applies.
“Video game econ”	“Video games are an economic business”	In this case we can see that the stop words were not considered when building the index FST. Note : position increments MUST NOT be preserved for this example to work, see the configuration details.
“Video games online ga”	“Video game: multiplayer gaming”	Synonym expansion has happened and the match is returned as online and multiplayer are considered synonyms by the suggester, based on the analysis applied.

FuzzyLookupFactory

				
					<lst name="suggester">

  <str name="name">FuzzySuggester</str>

  <str name="lookupImpl">FuzzyLookupFactory</str> 

  <str name="dictionaryImpl">DocumentDictionaryFactory</str>

  <str name="field">title</str>

  <str name="weightField">price</str>

  <str name="suggestAnalyzerFieldType">text_en</str>

</lst>

	Description
Data Structure	FST
Building	For each Document, the stored content from the field is analyzed according to the suggestAnalyzerFieldType. The tokens produced are added to the Index FST.
Lookup strategy	The query is analysed, the tokens produced are then expanded producing for each token all the variations accordingly to the max edit configured for the String distance function configured ( default is Levestein Distance [4] ). The finally produced tokens are added to the query FST keeping the variations. An intersection happens between the Index FST and the query FST. The suggestions are identified starting at the beginning of the field content.
Suggestions returned	The entire content of the field.

This suggester is quite powerful as it allows to provide suggestions at the beginning of a field content, taking advantage of a fuzzy search on top of the analysis chain provided with the field.

It will be possible in this way to provide suggestions considering synonyms, stop words, stemming and any other token filter used in the analysis and support also misspelt terms by the user.

It is an extension of the Analysis lookup. IMPORTANT: Remember the proper order of processing happening at query time:

- FIRST, the query is analysed, and tokens produced
- THEN, the tokens are expanded with the inflexions based on the Edit distance and distance algorithm configured

Let’s see some examples:

Query to autocomplete	Suggestions	Explanation
“Video gmaes”	“Video gaming: the history” “Video games are an economic business” “Video game: multiplayer gaming”	The input query is analysed, and the tokens produced are the following : “video” “gmae”. Then the FST associated is expanded with new statuses containing the inflections for each token. For example “game” will be added to the query FST because it has a distance of 1 from the original token. And the prefix matching is working fine returning the expected suggestions.
“Video gmaing“	“Video gaming: the history” “Video games are an economic business” “Video game: multiplayer gaming”	The input query is analysed, and the tokens produced are the following : “video” “gma”. Then the FST associated is expanded with new statuses containing the inflections for each token. For example “gam” will be added to the query FST because it has a distance of 1 from the original token. So the prefix match applies.
“Video gamign“	No suggestion returned	This can seem odd at first, but it is coherent with the Look up implementation. The input query is analysed, and the tokens produced are the following : “video” “gamign”. Then the FST associated is expanded with new statuses containing the inflections for each token. For example “gaming” will be added to the query FST because it has a distance of 1 from the original token. But no prefix matching will apply because in the Index FST we have “game”, the stemmed token for “gaming”

Query to autocomplete

Suggestions

Explanation

“Video gmaes”

“Video gaming: the history”
“Video games are an economic business”
“Video game: multiplayer gaming”

The input query is analysed, and the tokens produced are the following : “video” “gmae”.

Then the FST associated is expanded with new statuses containing the inflections for each token.

For example “game” will be added to the query FST because it has a distance of 1 from the original token.

And the prefix matching is working fine returning the expected suggestions.

“Video gmaing“

“Video gaming: the history”
“Video games are an economic business”
“Video game: multiplayer gaming”

The input query is analysed, and the tokens produced are the following : “video” “gma”.

Then the FST associated is expanded with new statuses containing the inflections for each token.

For example “gam” will be added to the query FST because it has a distance of 1 from the original token.

So the prefix match applies.

“Video gamign“

No suggestion returned

This can seem odd at first, but it is coherent with the Look up implementation.

The input query is analysed, and the tokens produced are the following : “video” “gamign”.

Then the FST associated is expanded with new statuses containing the inflections for each token.

For example “gaming” will be added to the query FST because it has a distance of 1 from the original token.

But no prefix matching will apply because in the Index FST we have “game”, the stemmed token for “gaming”

AnalyzingInfixLookupFactory

				
					<lst name="suggester">

  <str name="name">AnalyzingInfixSuggester</str>

  <str name="lookupImpl">AnalyzingInfixLookupFactory</str> 

  <str name="dictionaryImpl">DocumentDictionaryFactory</str>

  <str name="field">title</str>

  <str name="weightField">price</str>

  <str name="suggestAnalyzerFieldType">text_en</str>

</lst>

Description
Data Structure	Auxiliary Lucene Index
Building	For each Document, the stored content from the field is analyzed according to the suggestAnalyzerFieldType and then additionally EdgeNgram token filtered. Finally an auxiliary index is built with those tokens.
Lookup strategy	The query is analysed according to the suggestAnalyzerFieldType. Than a phrase search is triggered against the Auxiliary Lucene index The suggestions are identified starting at the beginning of each token in the field content.
Suggestions returned	The entire content of the field .

This suggester is common nowadays as it allows to provide suggestions in the middle of a field content, taking advantage of the analysis chain provided with the field.

It will be possible in this way to provide suggestions considering synonyms, stop words, stemming and any other token filter used in the analysis and match the suggestion based on internal tokens. Let’s see some examples:

Query to autocomplete	Suggestions	Explanation
“gaming”	“Video gaming: the history” “Video games are an economic business” “Video game: multiplayer gaming”	The input query is analysed, and the tokens produced are the following : “game” . In the Auxiliary Index , for each of the field content we have the EdgeNgram tokens: “v”,”vi”,”vid”… , “g”,”ga”,”gam”,“game” . So the match happens and the suggestion are returned
“ga”	“Video gaming: the history” “Video games are an economic business” “Video game: multiplayer gaming”	The input query is analysed, and the tokens produced are the following : “ga” . In the Auxiliary Index , for each of the field content we have the EdgeNgram tokens: “v”,”vi”,”vid”… , “g”,“ga”,”gam”,”game” . So the match happens and the suggestion are returned
“game econ”	“Video games are an economic business”	Stop words will not appear in the Auxiliary Index. Both “game” and “econ” will be, so the match applies.

Query to autocomplete

Suggestions

Explanation

“gaming”

“Video gaming: the history”
“Video games are an economic business”
“Video game: multiplayer gaming”

The input query is analysed, and the tokens produced are the following : “game” .

In the Auxiliary Index , for each of the field content we have the EdgeNgram tokens:

“v”,”vi”,”vid”… , “g”,”ga”,”gam”,“game” .

So the match happens and the suggestion are returned

“ga”

“Video gaming: the history”
“Video games are an economic business”
“Video game: multiplayer gaming”

The input query is analysed, and the tokens produced are the following : “ga” .

In the Auxiliary Index , for each of the field content we have the EdgeNgram tokens:

“v”,”vi”,”vid”… , “g”,“ga”,”gam”,”game” .

So the match happens and the suggestion are returned

“game econ”

“Video games are an economic business”

Stop words will not appear in the Auxiliary Index.

Both “game” and “econ” will be, so the match applies.

BlendedInfixLookupFactory

We are not going to describe the details of this lookup strategy as it’s pretty much the same as the AnalyzingInfix.

The only difference appears in scoring the suggestions, to weight prefix matches across the matched documents. The score will be higher if a hit is closer to the start of the suggestion or vice versa.

FSTLookupFactory

				
					<lst name="suggester">

  <str name="name">FSTSuggester</str>

  <str name="lookupImpl">FSTLookupFactory</str> 

  <str name="dictionaryImpl">DocumentDictionaryFactory</str>

  <str name="field">title</str>

</lst>

	Description
Data Structure	FST
Building	For each Document, the stored content is added to the Index FST.
Lookup strategy	The query is added to the query FST. An intersection happens between the Index FST and the query FST. The suggestions are identified starting at the beginning of the field content.
Suggestions returned	The entire content of the field .

This Solr suggester is quite simple as it allows to provide suggestions at the beginning of a field content, with an exact prefix match. Let’s see some examples:

Query to autocomplete	Suggestions	Explanation
“Video gam”	“Video gaming: the history” “Video games are an economic business” “Video game: multiplayer gaming”	The suggestions coming are simply the result of the prefix match. No surprises so far.
“Video Games”	No Suggestions	The input query is not analysed, and no field content in the documents starts with that exact prefix
“video gam”	No Suggestions	The input query is not analysed, and no field content in the documents starts with that exact prefix
“game”	No Suggestions	This lookup strategy works only at the beginning of the field content. So no suggestion is returned.

For the following lookup strategy we are going to use a slightly modified corpus of documents :

				
					[
      {
        "id":"44",
        "title":"Video games: the history"},
      {
        "id":"11",
        "title":"Video games the historical background"},
      {
        "id":"55",
        "title":"Superman, hero of the modern time"},
      {
        "id":"33",
        "title":"the study of the hierarchical faceting"}]

FreeTextLookupFactory

				
					<lst name="suggester">

  <str name="name">FreeTextSuggester</str>

  <str name="lookupImpl">FreeTextLookupFactory</str> 

  <str name="dictionaryImpl">DocumentDictionaryFactory</str>

  <str name="field">title</str>

  <str name="ngrams">3</str>

  <str name="separator"> </str>

  <str name="suggestFreeTextAnalyzerFieldType">text_general</str>

</lst>

	Description
Data Structure	FST
Building	For each Document, the stored content from the field is analyzed according to the suggestFreeTextAnalyzerFieldType. As a last token filter is added a ShingleFilter with a minShingle=2 and maxShingle=. The final tokens produced are added to the Index FST.
Lookup strategy	The query is analysed according to the suggestFreeTextAnalyzerFieldType. As a last token filter is added a ShingleFilter with a minShingle=2 and maxShingle=. Only the latest “ngrams” tokens will be evaluated to produce
Suggestions returned	ngram tokens suggestions

This lookup strategy is completely different from the others seen so far, its main difference is that the suggestions are ngram tokens (and NOT the full content of the field).

We must take extra care in using this Solr suggester as it is quite easily prone to errors, some guidelines :

Don’t use heavy Analyzers, the suggested terms will come from the index, so be sure they are meaningful tokens. A basic analyser is suggested, stop words and stemming are not
Be sure you use the proper separator (‘ ‘ is suggested), the default will be encoded in “#30;”
ngrams parameter will set the last n tokens to be considered from the query

Let’s see some examples:

Query to autocomplete	Suggestions	Explanation
“video g”	“video gaming” “video games” “generation”	The input query is analysed, and the tokens produced are the following : “video g” “g” The analysis was applied at building time as well, producing 2-3 shingles. “video g” matches by prefix 2 shingles from the index FST . “g” matches by prefix 1 shingle from the index FST.
“games the h”	“games the history” “games the historical” “the hierarchical” “hero”	The input query is analysed, and the tokens produced are the following : “games the h” “the h””h” The analysis was applied at building time as well, producing 2-3 shingles. “games the h” matches by prefix 2 shingles from the index FST . “the h” matches by prefix 1 shingle from the index FST. “h” matches by prefix 1 shingle from the index FST.

Query to autocomplete

Suggestions

Explanation

“video g”

“video gaming”
“video games”
“generation”

The input query is analysed, and the tokens produced are the following : “video g” “g”

The analysis was applied at building time as well, producing 2-3 shingles.

“video g” matches by prefix 2 shingles from the index FST .

“g” matches by prefix 1 shingle from the index FST.

“games the h”

“games the history”
“games the historical”
“the hierarchical”
“hero”

The input query is analysed, and the tokens produced are the following : “games the h” “the h””h”

The analysis was applied at building time as well, producing 2-3 shingles.

“games the h” matches by prefix 2 shingles from the index FST .

“the h” matches by prefix 1 shingle from the index FST.

“h” matches by prefix 1 shingle from the index FST.

Need Help With This Topic?

If you’re struggling with Autocomplete in Apache Solr, don’t worry – we’re here to help! Our team offers expert services and training to help you optimize your Solr search engine and get the most out of your system. Contact us today to learn more!

Need Help with this topic?

If you're struggling with Autocomplete in Apache Solr, don't worry - we're here to help! Our team offers expert services and training to help you optimize your Solr search engine and get the most out of your system. Contact us today to learn more!

Click Here

analysis, apache lucene, apache solr, autocomplete, autosuggestion, FST, Lucene index, Ngrams, Suggester, token filters, tokenizer

Sign up for our Newsletter

Did you like this post? Don’t forget to subscribe to our Newsletter to stay always updated in the Information Retrieval world!

14 Responses

Unknown says:

December 16, 2015 at 10:51 pm

hi there!
Thanks for your post – very helpful info that I couldnt find elsewhere! I wonder if you can help me with a problem. I am trying to use FreeTextLookupFactory lookup to provide suggestions that are part of the actual indexed content field. But I keep getting solr errors like:
IllegalArgumentException: tokens must not contain separator byte
Would it be possible for you to provide an example field and type definition that can be used with this? Perhaps my field set up is incorrect. Thanks very much!

Loading...

Reply
Alessandro Benedetti says:

February 11, 2016 at 2:26 pm

Hi Unkwown, unfortunately i missed this comment!
Have your solved your problem ?
What was the solution ?
This kind of suggester is actually not using the field type, but the specific analysis you specify in the suggester conf .
Be careful to the note about the separator, it was a tricky one !
Cheers

Loading...

Reply
Harsha JSN says:

June 10, 2016 at 12:52 pm

Hi, a very good analysis on different suggesters. Can you please explain about 'context filtering' in AnalyzingInfixSuggester. Just curious about how the filtering happens in this case over auxiliary lucene index.

Loading...

Reply
Alessandro Benedetti says:

July 25, 2016 at 11:25 am

Hey Arsha,
thanks for the comment 🙂
For the context filtering, what happens is we actually add to the auxiliary index data structure, the field we want to filter later on.
Then it is possible to configure a query and filter the results ( suggestions) by the content of that field.
I will take a note and add a deep analysis of the feature in the blog post 🙂
Thanks for the feedback !

Loading...

Reply
Shyamsunder says:

August 22, 2016 at 4:07 pm

http://jirasearch.mikemccandless.com/search.py?index=jira uses context param to uses both AnalyzingInfixSuggester “context” feature, to only show suggestions for the project you've drilled into, and its “payload” feature, to hold the metadata behind each suggestion

Loading...

Reply
Shyamsunder says:

August 22, 2016 at 4:10 pm

Hi Alex, thanks for the detailed information on suggesters with examples. Solr Suggestor Wiki is confusing and misleading – https://cwiki.apache.org/confluence/display/solr/Suggester. They should link to this page on that page.

Loading...

Reply
Shyamsunder says:

August 22, 2016 at 4:12 pm

About getting matches for “Video gamign” using FuzzyLookupFactory, what if we apply analysis on spelling correction of “gamign”, i.e., “gaming” to get stemmed tokens. This way we get results.

Loading...

Reply
Alessandro Benedetti says:

August 23, 2016 at 8:33 am

Hi Shyamsunder,
you are correct, the context filtering is used in Michael portal 🙂
But what about the “payload” ?
Which metadata are you referring to ? I can see only the title in the suggestions ( but I just quickly played with it)

Cheers

Loading...

Reply
Alessandro Benedetti says:

August 23, 2016 at 8:33 am

Thank you very much Shyamsunder!
Much appreciated!

Loading...

Reply
Alessandro Benedetti says:

August 23, 2016 at 8:52 am

Hi Shyamsunder, you mean using an analyzer that performs spell correction ( dictionary based ? ) and then stemming ?
It could be possible.
First we define a TokenFilter that does the spell correction based on a dictionary ( it is actually a good idea, but I think it doesn't exist out of the box).
Then we can specify a stemming token filter, and the game is done.

This is actually a good idea, and can be potentially useful is a number of use cases :

https://issues.apache.org/jira/browse/SOLR-9429

Loading...

Reply
Shyamsunder says:

August 23, 2016 at 9:14 pm

You got it. Thanks for considering my idea.

Loading...

Reply
Shyamsunder says:

August 23, 2016 at 9:17 pm

This comment has been removed by the author.

Loading...

Reply
Harsha JSN says:

September 14, 2016 at 5:14 am

Hi, in case of AnalyzingInfixSuggester if the auxiliary index build is in progress when “suggest.build = true” will the suggestions work? during this interval?

Loading...

Reply
Gaurav says:

December 16, 2019 at 12:24 am

Great Post

We have below example data.

******************

solrId1
New York
City
1

solrId2
New York
City
1

solrId3
New York
City
1

solrId33
New York
City
1

solrId22
New Manhatan
City
1

solrId32
New Manhatan
City
1

solrId333
New Manhatan
City
1

solrId4
New jersey
City
1

solrId5
New jersey
City
1

solrId6
newark
City
1

****************************

I am able to implement Suggester on top of filed name “meta” not able to sort the suggestion based on no. of times documents contain suggest filed

Exepcted result for query “New” : New York, New Manhatan, New jersey, Newark

Getting result for query “New” : New Manhatan, New jersey, Newark, New York ( not getting in order as expected)

Loading...

Reply

About the company

about our work

Rated Ranking Evaluator (RRE)

Rated Ranking Evaluator Enterprise (RREE)

Apache Solr LLM Highlighter plugin

News

Main Blog

TIPS AND TRICKS

LATEST BLOG POST

contact us

Don't miss all the news - subscribe to our newsletter!

Solr: You complete me! The Apache Solr Autocomplete

Introduction to Solr autocomplete

The Apache Solr Suggester

Term Dictionary

Solr Suggester Building

AnalyzingLookupFactory

FuzzyLookupFactory

AnalyzingInfixLookupFactory

BlendedInfixLookupFactory

FSTLookupFactory

FreeTextLookupFactory

Need Help With This Topic?​​

Need Help with this topic?​

Other posts you may find useful

Apache Solr Neural Search Knn benchmark

Introducing Weighted Synonyms in Apache Lucene/Solr