How many of you have happened to execute fuzzy queries in Elasticsearch and not obtain the expected results?
Let’s dive together into fuzziness behaviour, seeing how it works in Elasticsearch, when to use it, and which elements to pay attention to.
WHAT ARE FUZZY QUERIES?
As you can read from the Elasticsearch documentation[1]:
Fuzzy queries are queries that return documents that contain terms close to the search term, as measured by a Levenshtein edit distance.
An edit distance is the number of one-character changes needed to turn one term into another. These changes can include:
- Changing a character (box → fox)
- Removing a character (black → lack)
- Inserting a character (sic → sick)
- Transposing two adjacent characters (act → cat)
If you type something like Elasticsearc; the Levenshtein distance with Elasticsearch is 1, since you just need to add the h character at the end to turn one term into the other: Elasticsearch.
If you type something like Elastacsearc; the Levenshtein distance with Elasticsearch is 2, since you need to make two changes to turn one term into the other: Elastacsearc -> Elasticsearc -> Elasticsearch.
How to Perform a Fuzzy Query in Elasticsearch
There are three possible ways of doing a fuzzy query:
- With the
fuzzyclause in the query Domain Specific Language (DSL) of Elasticsearch [1]; - With the
fuzzinessparameter, where supported by the queries API you are using [2]; - Using the ~ operator in the
query_stringquery [3].
Let’s see some examples!
Suppose we indexed a book collection in Elasticsearch, and suppose we would like to retrieve all the documents with an author name equal to “thomas” but the user wrongly typed “toma”.
1. perform a fuzzy query With the fuzzy clause
GET /_search
{
"query": {
"fuzzy": {
"name": {
"value": "toma"
}
}
}
}
Here, to find similar terms, the fuzzy query creates a set of all possible variations, or expansions, of the search term within a specified edit distance. The query then returns exact matches for each expansion [1].
In this example:
nameis the field to search on;valueis the term to find in the provided field
There are several other parameters for the fuzzy query. If you want to go into detail, you can read about them in the documentation. Here I would like to talk to you about the fuzziness and max_expansions parameters.
the fuzziness parameter in elasticsearch
fuzziness is the maximum edit distance allowed for matching [2]. Therefore our maximum Levenshtein distance. It can assume two values:
0, 1, 2: the number of edits (Levenshtein distance);AUTO: generates an edit distance based on the length of the term.
Low and high distance arguments may be optionally providedAUTO:[low],[high].
If not specified, the default values are 3 and 6, equivalent toAUTO:3,6that make short words (length <= 2) match exactly, medium-length words (3 <= length <= 5) accept one edit, and long words (length > 5) accept two edits.
the max_expansions parameter in elasticsearch
max_expansions is the maximum number of variations created. Defaults to 50. Pay attention to the value you choose here since a high value in max_expansions can cause poor performance due to the high number of variations examined.
2. perform a fuzzy query with the fuzziness parameter
"query": {
"match": {
"text": {
"query": "toma",
"fuzziness": "2"
}
}
}
Here we are asking for all the documents matching terms that have at most a Levenshtein distance of 2 from toma.
The maximum allowed distance is defined in the fuzziness parameter.
The fuzziness parameter is the same as defined in point 1 but used in a different query type.
3. perform a fuzzy query using the ~ operator
{
"query": {
"query_string": {
"query": "toma~",
"fields" : ["author"]
}
}
}
The main thing to keep in mind with this type of query is that the query_string passed is normalized.
What is normalization?
Normalization is a sort of text analysis that supports only per-character filters. This is because no tokenizer is used, only a single token is emitted, and therefore filters that need to look at the keyword as a whole cannot be used. Here you can find the list of the currently supported filters:
arabic_normalization, asciifolding, bengali_normalization, cjk_width, decimal_digit, elision, german_normalization, hindi_normalization, indic_normalization, lowercase, pattern_replace, persian_normalization, scandinavian_folding, serbian_normalization, sorani_normalization, trim, uppercase.
After normalization, the query behaviour follows the same process as points 1 and 2. A set of all possible variations, or expansions, of the search term within a specified edit distance is created and the exact matches for each expansion are returned from the query.
The default edit distance is 2, but a different edit distance can be specified as:
quikc~1
Best Practices: What to Pay Attention to with Fuzzy Queries in Elasticsearch
From what we have learned so far, there are four main things to pay attention to:
PERFORMANCE
When using fuzzy queries a great set of variations/expansions is created and the search considers matches with all of them. This is a very expensive process that could lead to poor performance at query time. You can read more about this in this Elasticsearch blog post.
NORMALIZATION
When using the ~ operator in the query_string clause normalization is done. Be aware that this modifies the way text analysis is executed since only specific filters are applied. This also changes the final query construction.
Let’s see an example [4]. Suppose to execute a query like:
{
"query": {
"query_string": {
"query": "dog toma~",
"fields": [
"title",
"author"
],
"default_operator": "and"
}
}
}
When using fuzziness, the query text is split by whitespaces. Then, for each term, a mandatory clause is built (the AND operator states that each term MUST appear in the document). You can call this a “term-centric” query. Then each term is searched across the multiple fields in input with a disjunction (|) clause. You therefore see:
"explanation": "+(title:dog | author:dog) +(title:toma~1 | author:toma~1)"
“dog MUST be in title OR author, AND toma (with variations within the Levenshtein_distance) MUST be in title OR author”
If no fuzziness is used, the query is passed to each field and then the terms will follow the corresponding field text analysis (if any). You may call this a “field-centric” query. Therefore you see:
"explanation": "(title:dog toma | author:dog toma)"
“dog toma MUST be in title OR author”
You are effectively moving from an “I want all the terms to appear in the document” (it could be in different fields) to “I want all terms to appear in a single field” [8].
WILDCARDS
Avoid mixing fuzziness and wildcards [5]. This is not supported in Elasticsearch.
CHOOSE THE BEST FUZZY OPTION
From [6], a match query with the fuzziness parameter set is perhaps the most versatile of the fuzzy queries. The fuzzy query type supports the exact same behaviour, except it does not allow for any analysis of the query text. Additionally, the fuzzy query type is a subset of the functionality of a match query makes it more confusing than useful.
Fuzzy Search Alternatives
Sometimes, fuzzy search is not the best solution for imprecise matches [6]. There are good alternatives that can be used depending on your necessities:
- Using a phonetic analysis plugin can help in finding words that sound similar to other words.
- Using N-grams can speed up that query process leading to the same good results since a search just needs to have a plurality of matches of sub-parts of a given term [7]. N-grams do, however, come at the cost of additional storage/memory usage, slightly more index-time processing, and a long tail of false positives after good matches [6].





