Find and Replace in Elasticsearch Fields

Hi Elasticsearch users,

in this short blog post, we explore a common situation that many of us often encounter when working with Elasticsearch indexes.
Suppose we have an index, which for this example we call ‘cartoons‘. In this index, there is a field called ‘Title‘ which includes names of cartoons such as ‘Tom & Jerry‘, ‘Mila & Shiro‘, and so on.
If the goal is to change the symbol ‘&’ into ‘and’, thus obtaining for example ‘Tom and Jerry’, how can this find and replacement task be tackled in Elastisearch?

The good news is that Elasticsearch offers a painless way to update documents.
You can use the Update By Query API, which allows you to update multiple documents based on a specific query. It is an efficient tool for those needing to make bulk changes to field values in their Elasticsearch index.

Here is a quick example to show you how this can be done:

Create the Index

We create an index called ‘cartoons’, in which we define how documents should be stored and indexed. Here, we specify not to use dynamic mapping and set up a field called ‘title‘ of type ‘text‘:

				
					PUT /cartoons
{
  "mappings": {
    "dynamic": false,
    "properties": {
      "title": { "type": "text" }
   }
  }
}
				
			
INDEX SOME DOCUMENTS

This command allows multiple documents to be indexed in a single request to the ‘cartoons’ index. Each document is composed of a unique ID and the title:

				
					POST /cartoons/_bulk
{ "index" : { "_id" : "1" } }
{ "title" : "Tom & Jerry" }
{ "index" : { "_id" : "2" } }
{ "title" : "Mila & Shiro" }
{ "index" : { "_id" : "3" } }
{ "title" : "Hansel & Gretel" }
				
			

For simplicity, we index only three documents to provide a suitable example for illustrative purposes.

FIND AND REPLACE

This is the command to update documents in the ‘cartoons’ index using the Update By Query API:

				
					POST /cartoons/_update_by_query
{
    "script":
    {
        "lang": "painless",
        "source": "ctx._source.title = ctx._source.title.replace('&', 'and')"
    }
}
				
			

In this case, no query is provided, and an update is performed on all documents within the index.
Update by query supports "script" to update the document source and it uses the Painless scripting language to replace ‘&’ with ‘and’ in the title of each cartoon.

When dealing with large indexes or long update operations, to avoid the risk of getting a connection timeout and to allow for asynchronous execution of the operation, it is better to use the wait_for_completion=false parameter in the _update_by_query API:

				
					POST /cartoons/_update_by_query?wait_for_completion=false
				
			

This way, the update operation continues to run in the background and the response will include a task ID that you can use to monitor the status of the operation using the Tasks API.

QUERIES

To show the result after the update operation, we perform a “match all query” to match and retrieve all documents:

				
					GET cartoons/_search
{
  "query": {
    "match_all": {}
  }
}
				
			

This is the response:

				
					{
  ...
  "hits": {
    "total": {
      "value": 3,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "cartoons",
        "_id": "1",
        "_score": 1,
        "_source": {
          "title": "Tom and Jerry"
        }
      },
      {
        "_index": "cartoons",
        "_id": "2",
        "_score": 1,
        "_source": {
          "title": "Mila and Shiro"
        }
      },
      {
        "_index": "cartoons",
        "_id": "3",
        "_score": 1,
        "_source": {
          "title": "Hansel and Gretel"
        }
      }
    ]
  }
}
				
			

The response confirms that the update operation via the _update_by_query API was successful and worked as expected. This is evident from the changed titles in the ‘cartoons’ index, where the ‘&’ symbol has been replaced with ‘and’.

From the ‘match all query’ query, it is clear that the _source was changed, but was the re-indexing done automatically?
Yes, the Update By Query API in Elasticsearch does both modify the source of the documents and reindex them. After you have performed an update using the Update By Query API, any subsequent searches will reflect these changes. So, if you update a document’s field from an old value to a new one, searching for the new value should return the updated document.
Conversely, if you search for the previous term (the one that was present before the update), you will not find the updated documents. This is because the old term no longer exists in the indexed data of these documents.

To check this, we can perform a subsequent search using a match_phrase query [read more here]:

				
					GET cartoons/_search
{
  "query": {
    "match_phrase": {
      "title": "Tom and Jerry"
    }
  }
}
				
			

If we search “Tom and Jerry“, we will get one result:

				
					{
    ...
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 2.0951898,
        "hits": [
            {
                "_index": "cartoons",
                "_id": "1",
                "_score": 2.0951898,
                "_source": {
                    "title": "Tom and Jerry"
                }
            }
        ]
    }
}
				
			

If we search “Tom & Jerry”, no results will be returned:

				
					GET cartoons/_search
{
  "query": {
    "match_phrase": {
      "title": "Tom & Jerry"
    }
  }
}
				
			

This is the response: 

				
					{
    ...
    "hits": {
        "total": {
            "value": 0,
            "relation": "eq"
        },
        "max_score": null,
        "hits": []
    }
}
				
			

This process ensures that Elasticsearch’s search results are always synchronized with the current state of your data.

Conclusions

To sum up, we hope this short blog post has helped understand how the Update by Query API can be a powerful and efficient solution for modifying multiple documents within Elasticsearch.

See you for the next pills!

Need Help with this topic?​

If you're struggling with Elasticsearch, don't worry - we're here to help! Our team offers expert services and training to help you optimize your Elasticsearch search engine and get the most out of your system. Contact us today to learn more!

We are Sease, an Information Retrieval Company based in London, focused on providing R&D project guidance and implementation, Search consulting services, Training, and Search solutions using open source software like Apache Lucene/Solr, Elasticsearch, OpenSearch and Vespa.

Follow Us

Top Categories

Recent Posts

Monthly video

Sign up for our Newsletter

Did you like this post? Don’t forget to subscribe to our Newsletter to stay always updated in the Information Retrieval world!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.