JSON Tips And Tricks

Efficiently Manage Numeric Ids in JSON and Pandas

It is very common when manipulating data, to deal with ids. They are very useful for identifying objects univocally. Usually, they are represented as a sequence of digits.

Suppose now to manage a JSON file containing several objects, and suppose these objects to be represented by a set of features as key-value pairs. Some of the features make use of numerical ids like the objectId and the objectCategoriesId.

Here is an example of our JSON object, where the object represents a book:

{
"bookId":"527054",
"author":"Dan Brown",
"title":"The Da Vinci Code",
"numberPages":300,
"publishingYear":2003,
"totalReviews":3514,
"avgReviews":4.5,
"price":20,
"bookCategories":["237", "451", "110", "745"]
"bookDiscount":0,
"bookSales":1048
}

One question that naturally arises when dealing with large amount of data (many of these objects) is: which data type is better to use to manage ids?

We indeed know that the chosen data type can make a big difference in terms of the performance of the pipeline that manages these data.

If we look at ids’ meaning, the most correct way to store them should be as Strings.
Ids, indeed, are not representing real numbers with mathematical operations and ordering, but they are labels usually representing real-world objects.
As we all know, however, Strings are expensive to store in memory and to process in time. Therefore, using Integers comes naturally in order to improve performance.

How much effort does it require to process and store ids as Strings? How much is the gain in managing ids as Integers?

I made some experiments to measure the difference between these two types of ids management.

First of all, I created an input file containing a single JSON object where each id feature is passed as a String.
Then, I created an identical input file with the only difference that the id features’ values are passed as Integers.

I execute the same pipeline (that processes these data) twice, giving in input the String file before and then the Integer version.

I then compare the two experiments in terms of both memory and time.

STRINGS

For the first experiment, I used an input file where the id features are passed as Strings. Here is an example of the content:

{
"bookId":"527054",
"author":"Dan Brown",
"title":"The Da Vinci Code",
"numberPages":300,
"publishingYear":2003,
"totalReviews":3514,
"avgReviews":4.5,
"price":20,
"bookCategoriesIds":["237", "451", "110", "745"]
"bookDiscount":0,
"bookSales":1048
}

In this file, there is a single element representing a book. This element has two id features: bookId and bookCategoriesIds. Here they are Strings.

INTEGERS

For the second experiment, I used an input file where the id features are passed as Integers. Here is an example of the content:

{
"bookId":527054,
"author":"Dan Brown",
"title":"The Da Vinci Code",
"numberPages":300,
"publishingYear":2003,
"totalReviews":3514,
"avgReviews":4.5,
"price":20,
"bookCategories":[237, 451, 110, 745]
"bookDiscount":0,
"bookSales":1048
}

In this file, I have the same book element as before, with the only difference being that we are using Integers ids.

RESULTS

As mentioned before, I executed the same pipeline twice. Once with id features as Strings and once with id features as Integers.

I started with a single JSON object in the input file and repeated the pipeline 10 time to average the performance on all the iterations.

I then repeated the same evaluation again, increasing the number of JSON objects in the input file.

Here are the results of the first analysis. Here I use a single JSON object and repeat the pipeline 10 times. On the left of the table, the results use Integers ids, on the right the results use Strings ids.

Here we can see that using Strings requires more disk memory with respect to Integers, while the RAM and the total time required are very similar in both strategies.

The same process has been done for a JSON file containing 11 books:

Here we can see that using Strings requires more disk memory and time with respect to Integers, while the RAM remains very similar in both the strategies.

Again with 12020 books:

Here we can see that the performance gap is starting to get bigger. Integers start to be better in both memory and time consumption.

And finally with 24040 books:

The previous results are here confirmed, with better performance from Integers with respect to Strings.

Summary

Collecting the average results from the previous tables. Here are the results for Integers:

Number of books Memory RAM peak MiB Time ms
1 3.0 KB 19.10 0.031
11 32.6 KB 19.49 0.119
12020 35.6 MB 279.41 100.198
24040 71.1 MB 539.64 197.410

And here for Strings:

Number of books Memory RAM peak MiB Time ms
1 3.8 KB 19.16 0.028
11 41.5 KB 19.46 0.121
12020 45.3 MB 334.79 103.213
24040 90.5 MB 650.50 202.945

From the most expensive elaboration, we can see that using Integers is better in terms of both memory and time.

As a final consideration be aware that using numbers for ids when parsing a JSON is not a silver bullet and it has its own difficulties, especially with typing: make sure you know the boundaries of your Ids so that you can choose the correct numerical type (float, integer, long…).
When dealing with numeric ids make also sure you are not sorting them or doing math operations when it’s not intended (they use numeric characters but they are not numbers after all!)

// our service

Shameless plug for our training and services!

Did I mention we do Learning To Rank and Artificial Intelligence in Search training?
We also provide consulting on these topics, get in touch if you want to bring your search engine to the next level with the power of AI!

// STAY ALWAYS UP TO DATE

Subscribe to our newsletter

Did you like this post about Efficiently Manage Numeric Ids in JSON and Pandas? Don’t forget to subscribe to our Newsletter to stay always updated from the Information Retrieval world!

Author

Anna Ruggero

Anna Ruggero is a software engineer passionate about Information Retrieval and Data Mining. She loves to find new solutions to problems, suggesting and testing new ideas, especially those that concern the integration of machine learning techniques into information retrieval systems.

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.