JSON, Tips And Tricks

How to manage large JSON efficiently and quickly: multiple files

If you’ve read the first blog post, you have already learned some tips and tricks on how to handle a large JSON file in Python. In this, I want to focus on how to work efficiently with multiple JSON files.

As already suggested, it is better to read a JSON file via Pandas, using the read_json() method and passing the chunksize parameter, to load and manipulate only a certain amount of rows at a time. The method will not return a Data frame but a JsonReader object to iterate over. To access the file contents and create a Pandas data frame, you can use:

1) pandas.concat

				
					interactions_data_frames = []
for interactions_input_file in json_files:
    interactions_temp = pd.concat(pd.read_json(interactions_input_folder+'/'+interactions_input_file,
orient='records', lines=True, chunksize=chunk_value))
    interactions_data_frames.append(interactions_temp)

interactions = pd.concat(interactions_data_frames, ignore_index=True, sort=True)

2) For loop

				
					interactions_data_frames = []
for interactions_input_file in json_files:
    interactions_temp = pd.read_json(interactions_input_folder + '/' + interactions_input_file, 
orient='records', lines=True, chunksize=chunk_value)
    for chunk in interactions_temp:
        interactions_data_frames.append(chunk)

interactions = pd.concat(interactions_data_frames, ignore_index=True, sort=True)

CODE EXPLANATION

- In both cases, we created an empty list named interactions_data_frames

- In both cases, we iterated over json_files, the list containing all the JSON files

- In example 1) interactions_temp is a Pandas Dataframe. The concatenation will only take place once the entire file has been read. Then we append all the data frames (one for each file) to the empty list

- In example 2) interactions_temp is a JsonReader object; by iterating each chunk, we basically obtain smaller data frames (one for each chunk) and append each of them to the empty list

- In both cases, we created a unique data frame (interactions) by the concatenation of the data frame objects in interactions_data_frames

We tested both methods using the same data and found that a For loop appears to be better than Concat in terms of time (139.496203 seconds instead of 146.306893) but not in terms of memory, as you can see from the results of this experiment:

You could also play with the value of the chunksize parameter until reaching a good balance; the value has to be set according to the availability of your data. In particular, we noticed that the bigger the chunks the faster the parsing and the higher the memory usage. Then, you can try to experiment and achieve acceptable parsing time and memory usage. In our case with a chunksize value of 10000, we reached a good compromise.

Here you can find some community discussions on the topic [2], [3]

Many small files vs few large files

When working with JSON, is it better to have many small files or few large files?

To explain what I mean, let’s imagine having e-commerce.

We collect all the interactions that users have with the website products and save them in JSON logs. Then we have to read all these files with Python, manipulate them, and create the training and the test sets in order to train a Learning to Rank model.

We configure the system to roll log files based on date/time and we hypothesize two different scenarios:

1) Roll log files once daily

Every day we obtain one log (around 2GB), for example: user-interactions-2021-06-20.log

2) Roll log files every 15 minutes

Every day we obtain 96 logs (around 20 MB each), for example:
user-interactions-2021-06-20-00-00.log
user-interactions-2021-06-20-00-15.log
user-interactions-2021-06-20-00-30.log
…
user-interactions-2021-06-20-23-45.log

We would have about 2GB of data in both cases, but we wondered if it would be better to manage one large file or several small files in terms of time and memory usage.

We tested the pipeline using both approaches on half-month user interaction data to simulate a real-world application.

Here are the differences between parsing many small files and a few large files:

Parsing 27 GB JSON files takes around 40 minutes and the data frame memory usage is roughly 60 GB.

- Using many small files, we have advantages in terms of TIME when loading logs: approximately 3 minutes less
- Using few large files, we have advantages in terms of MEMORY: 4GB less

The Pandas functiondataframe.info() [4] was used to print the summaryinformation about the data frame. It includes the columns’ name, the non-null count, and the dtype. The memory_usage parameter specifies whether the total memory usage of the data frame elements (including index) should be displayed. A value of ‘deep’ will perform a real memory usage calculation:

				
					dataframe.info(memory_usage='deep')

OUTPUT

				
					<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28242019 entries, 0 to 28242018
Data columns (total 77 columns):
Feature1              63580000 non-null object
Feature2              62740023 non-null uint8
Feature3              63587849 non-null float64
........              ......
Feature75             61678009 non-null object
Feature76             63490887 non-null float16
dtypes: datetime64[ns](1), float16(23), float64(6), object(45), uint32(1), uint8(1)
memory usage: 60.2 GB

To record the CPU and memory activity of the entire Python process, we used the utility called psrecord [5] that allows us to store the data to a file or plot it in a graph:

				
					pip install psrecord

psrecord 11653 --interval 20 --plot plot1.png --log log1.log

where:

- 11653 is the PID (the ID of the process to monitor)
- interval: to specify the time intervals at which data is polled
- plot: to specify the path where to save the plot
- log: to specify the path where to save the log

From our experiments, we noticed that when we have a significant amount of data and most of the features are categorical (object), the RAM usage will be up to 5 times the original files (in terms of original disk space usage). In our case, the size of the original files was 27 GB, the data frame memory usage was 60.2 GB, and the process memory usage was around 129 GB.

Too many Object!

Let’s see what happens if we drop all the ‘object’ features from the same dataset.

Before we had 77 columns including 45 ‘object’ types (which are all array). Deleting them, we end up with 32 columns with the following types:

				
					dtypes: datetime64[ns](1), float16(23), float64(6), uint32(1), uint8(1)

Here are the differences between parsing many small files and a few large files:

In this case, parsing 25 GB json files takes around 10 minutes and the data frame memory usage is roughly 3 GB. Even RAM usage has also dropped significantly.

This is proof that the more your dataset contains primitive types, the less impact the data parsing will have. The more the ‘object’ features number increase, the more the gap between the cost that the json files originally had on disk and the data frame memory will increase.

The fact that many small files require less time but more RAM usage in parsing than a few large files remained unaffected!

For simplicity, in this experiment, we just dropped the ‘object’ features to show you the advantages in terms of time and memory. In a real-world scenario, we have to find a way to convert all the ‘object’ columns into more memory-efficient types.

Need Help With This Topic?

If you’re struggling to manage large multiple JSON files, don’t worry – we’re here to help! Our team offers expert services and training to help you optimize your search engine and get the most out of your system. Contact us today to learn more!

Need Help with this topic?

If you're struggling to manage large multiple JSON files, don't worry - we're here to help! Our team offers expert services and training to help you optimize your search engine and get the most out of your system. Contact us today to learn more!

Click Here

datascience, featureengineering, informationretrieval, json, pandas, python

Sign up for our Newsletter

Did you like this post? Don’t forget to subscribe to our Newsletter to stay always updated in the Information Retrieval world!

About the company

about our work

Rated Ranking Evaluator
(RRE)

Rated Ranking Evaluator Enterprise (RREE)

Apache Solr LLM Highlighter plugin

News

Main Blog

TIPS AND TRICKS

LATEST BLOG POST

contact us

Don't miss all the news - subscribe to our newsletter!

How to manage large JSON efficiently and quickly: multiple files

1) pandas.concat

2) For loop

CODE EXPLANATION

Many small files vs few large files

1) Roll log files once daily

2) Roll log files every 15 minutes

OUTPUT

Too many Object!

Need Help With This Topic?

Need Help with this topic?

Other posts you may find useful

The Missing Piece: ID Discovery in RRE Enterprise

Semantic Search (Text to Vector) with Apache Solr

Music Information Retrieval: the Intervals Table

Ilaria Petreti

Ilaria Petreti

Follow Us

Top Categories

Recent Posts

Retrieval and Responsibility: The Ethics of Augmented Knowledge

Faster Vector Search: Early Termination Strategy Now in Apache Solr

OpenSearch and Large Language Models

Monthly video

Sign up for our Newsletter

Leave a Reply Cancel reply

Quick Links

Services

Subscribe

About the company

about our work

Rated Ranking Evaluator (RRE)

Rated Ranking Evaluator Enterprise (RREE)

Apache Solr LLM Highlighter plugin

News

Main Blog

TIPS AND TRICKS

LATEST BLOG POST

contact us

Don't miss all the news - subscribe to our newsletter!

How to manage large JSON efficiently and quickly: multiple files

1) pandas.concat

2) For loop

CODE EXPLANATION

Many small files vs few large files

1) Roll log files once daily

2) Roll log files every 15 minutes

OUTPUT

Too many Object!

Need Help With This Topic?​​

Need Help with this topic?​

Other posts you may find useful

The Missing Piece: ID Discovery in RRE Enterprise

Semantic Search (Text to Vector) with Apache Solr

Music Information Retrieval: the Intervals Table

Ilaria Petreti

Ilaria Petreti

Follow Us

Top Categories

Recent Posts

Retrieval and Responsibility: The Ethics of Augmented Knowledge

Faster Vector Search: Early Termination Strategy Now in Apache Solr

OpenSearch and Large Language Models

Monthly video

Sign up for our Newsletter

Leave a Reply Cancel reply

Rated Ranking Evaluator
(RRE)

Need Help With This Topic?

Need Help with this topic?