JSON, Tips And Tricks

Efficiently Manage Numeric Ids in JSON and Pandas

It is very common when manipulating data, to deal with ids. They are very useful for identifying objects univocally. Usually, they are represented as a sequence of digits.

Suppose now to manage a JSON file containing several objects, and suppose these objects to be represented by a set of features as key-value pairs. Some of the features make use of numerical ids like the objectId and the objectCategoriesId.

Here is an example of our JSON object, where the object represents a book:

				
					{
"bookId":<strong>"527054"</strong>,
"author":"Dan Brown",
"title":"The Da Vinci Code",
"numberPages":300,
"publishingYear":2003,
"totalReviews":3514,
"avgReviews":4.5,
"price":20,
"bookCategories":[<strong>"237"</strong>, <strong>"451"</strong>, <strong>"110"</strong>, <strong>"745"</strong>]
"bookDiscount":0,
"bookSales":1048
}

One question that naturally arises when dealing with large amounts of data (many of these objects) is: which data type is better to use to manage ids?

We indeed know that the chosen data type can make a big difference in terms of the performance of the pipeline that manages these data.

If we look at ids’ meanings, the most correct way to store them should be as Strings.
Ids, indeed, do not represent real numbers with mathematical operations and order, but they are labels usually representing real-world objects.
As we all know, however, Strings are expensive to store in memory and to process in time. Therefore, using Integers comes naturally to improve performance.

How much effort does it require to process and store ids as Strings? How much is the gain in managing ids as Integers?

I made some experiments to measure the difference between these two types of ids management.

First of all, I created an input file containing a single JSON object where each id feature is passed as a String.
Then, I created an identical input file with the only difference that the id features’ values are passed as Integers.

I execute the same pipeline (that processes these data) twice, giving in input the String file before and then the Integer version.

I then compare the two experiments in terms of both memory and time.

STRINGS

For the first experiment, I used an input file where the id features are passed as Strings. Here is an example of the content:

				
					{
"bookId":<strong>"527054"</strong>,
"author":"Dan Brown",
"title":"The Da Vinci Code",
"numberPages":300,
"publishingYear":2003,
"totalReviews":3514,
"avgReviews":4.5,
"price":20,
"bookCategoriesIds":[<strong>"237"</strong>, <strong>"451"</strong>, <strong>"110"</strong>, <strong>"745"</strong>]
"bookDiscount":0,
"bookSales":1048
}

In this file, there is a single element representing a book. This element has two id features: bookId and bookCategoriesIds. Here they are Strings.

INTEGERS

For the second experiment, I used an input file where the id features are passed as Integers. Here is an example of the content:

				
					{
"bookId":<strong>527054</strong>,
"author":"Dan Brown",
"title":"The Da Vinci Code",
"numberPages":300,
"publishingYear":2003,
"totalReviews":3514,
"avgReviews":4.5,
"price":20,
"bookCategories":[<strong>237</strong>, <strong>451</strong>,<strong> 110</strong>,<strong> 745</strong>]
"bookDiscount":0,
"bookSales":1048
}

In this file, I have the same book element as before, with the only difference being that we are using Integers ids.

RESULTS

As mentioned before, I executed the same pipeline twice. Once with id features as Strings and once with id features as Integers.

I started with a single JSON object in the input file and repeated the pipeline 10 times to average the performance on all the iterations.

I then repeated the same evaluation, increasing the number of JSON objects in the input file.

Here are the results of the first analysis. Here I use a single JSON object and repeat the pipeline 10 times. On the left of the table, the results use Integers ids, on the right the results use Strings ids.

Here we can see that using Strings requires more disk memory with respect to Integers, while the RAM and the total time required are very similar in both strategies.

The same process has been done for a JSON file containing 11 books:

Here we can see that using Strings requires more disk memory and time with respect to Integers, while the RAM remains very similar in both strategies.

Again with 12020 books:

Here we can see that the performance gap is starting to get bigger. Integers start to be better in both memory and time consumption.

And finally with 24040 books:

The previous results are here confirmed, with better performance from Integers with respect to Strings.

Summary

Collecting the average results from the previous tables. Here are the results for Integers:

Number of books	Memory	RAM peak MiB	Time ms
1	3.0 KB	19.10	0.031
11	32.6 KB	19.49	0.119
12020	35.6 MB	279.41	100.198
24040	71.1 MB	539.64	197.410

And here for Strings:

Number of books	Memory	RAM peak MiB	Time ms
1	3.8 KB	19.16	0.028
11	41.5 KB	19.46	0.121
12020	45.3 MB	334.79	103.213
24040	90.5 MB	650.50	202.945

From the most expensive elaboration, we can see that using Integers is better in terms of both memory and time.

As a final consideration be aware that using numbers for ids when parsing a JSON is not a silver bullet and it has its difficulties, especially with typing: make sure you know the boundaries of your Ids so that you can choose the correct numerical type (float, integer, long…).
When dealing with numeric ids make also sure you are not sorting them or doing math operations when it’s not intended (they use numeric characters but they are not numbers after all!)

Need Help With This Topic?

If you’re struggling with Numeric Ids, don’t worry – we’re here to help! Our team offers expert services and training to help you optimize your search engine and get the most out of your system. Contact us today to learn more!

Need Help with this topic?

If you're struggling with Numeric Ids, don't worry - we're here to help! Our team offers expert services and training to help you optimize your search engine and get the most out of your system. Contact us today to learn more!

Click Here

data preparation, efficiency, json, performance

Sign up for our Newsletter

Did you like this post? Don’t forget to subscribe to our Newsletter to stay always updated in the Information Retrieval world!

About the company

about our work

Rated Ranking Evaluator
(RRE)

Rated Ranking Evaluator Enterprise (RREE)

Apache Solr LLM Highlighter plugin

News

Main Blog

TIPS AND TRICKS

LATEST BLOG POST

contact us

Don't miss all the news - subscribe to our newsletter!

Efficiently Manage Numeric Ids in JSON and Pandas

How much effort does it require to process and store ids as Strings? How much is the gain in managing ids as Integers?

STRINGS

INTEGERS

RESULTS

Summary

Need Help With This Topic?

Need Help with this topic?

Other posts you may find useful

When and How to Use N-grams in Elasticsearch

The Missing Piece: ID Discovery in RRE Enterprise

DocValues VS Stored Fields: Apache Solr Features and Performance SmackDown

Anna Ruggero

Anna Ruggero

Follow Us

Top Categories

Recent Posts

Retrieval and Responsibility: The Ethics of Augmented Knowledge

Faster Vector Search: Early Termination Strategy Now in Apache Solr

OpenSearch and Large Language Models

Monthly video

Sign up for our Newsletter

Leave a Reply Cancel reply

Quick Links

Services

Subscribe

About the company

about our work

Rated Ranking Evaluator (RRE)

Rated Ranking Evaluator Enterprise (RREE)

Apache Solr LLM Highlighter plugin

News

Main Blog

TIPS AND TRICKS

LATEST BLOG POST

contact us

Don't miss all the news - subscribe to our newsletter!

Efficiently Manage Numeric Ids in JSON and Pandas

How much effort does it require to process and store ids as Strings? How much is the gain in managing ids as Integers?

STRINGS

INTEGERS

RESULTS

Summary

Need Help With This Topic?​​

Need Help with this topic?​

Other posts you may find useful

When and How to Use N-grams in Elasticsearch

The Missing Piece: ID Discovery in RRE Enterprise

DocValues VS Stored Fields: Apache Solr Features and Performance SmackDown

Anna Ruggero

Anna Ruggero

Follow Us

Top Categories

Recent Posts

Retrieval and Responsibility: The Ethics of Augmented Knowledge

Faster Vector Search: Early Termination Strategy Now in Apache Solr

OpenSearch and Large Language Models

Monthly video

Sign up for our Newsletter

Leave a Reply Cancel reply

Rated Ranking Evaluator
(RRE)

Need Help With This Topic?

Need Help with this topic?