Tips And Tricks

If you’ve read the first blog post, you have already learned some tips and tricks on how to handle a large JSON file in Python. In this, I want to focus on how to work efficiently with multiple JSON files.

As already suggested, it is better to read a JSON file via Pandas, using the read_json() method and passing the chunksize parameter, in order to load and manipulate only a certain amount of rows at a time. The method will not return a Data frame but a JsonReader object to iterate over. In order to access the file contents and create a Pandas data frame, you can use:

1) pandas.concat [2]
interactions_data_frames = []
for interactions_input_file in json_files:
    interactions_temp = pd.concat(pd.read_json(interactions_input_folder+'/'+interactions_input_file,               orient='records', lines=True, chunksize=chunk_value))
    interactions_data_frames.append(interactions_temp)

interactions = pd.concat(interactions_data_frames, ignore_index=True, sort=True)
2) For loop
interactions_data_frames = []
for interactions_input_file in json_files:
    interactions_temp = pd.read_json(interactions_input_folder + '/' + interactions_input_file, orient='records', lines=True, chunksize=chunk_value)
    for chunk in interactions_temp:
        interactions_data_frames.append(chunk)

interactions = pd.concat(interactions_data_frames, ignore_index=True, sort=True)

CODE EXPLANATION

  • In both cases, we created an empty list named interactions_data_frames
  • In both cases, we iterated over json_files, the list containing all the JSON files
  • In example 1) interactions_temp is a Pandas Dataframe. The concatenation will only take place once the entire file has been read. Then we append all the data frames (one for each file) to the empty list
  • In example 2) interactions_temp is a JsonReader object; by iterating each chunk, we basically obtain smaller data frames (one for each chunk) and append each of them to the empty list
  • In both cases, we created a unique data frame (interactions) by the concatenation of the data frame objects in interactions_data_frames

We tested both methods using the same data and found that a For loop appears to be better than Concat in terms of time (139.496203 seconds instead of 146.306893) but not in terms of memory, as you can see from the results of this experiment:

You could also play with the value of the chunksize parameter until reaching a good balance; the value has to be set according to the availability of your data. In particular, we noticed that the bigger the chunks the faster the parsing and the higher the memory usage. Then, you can try to experiment and achieve acceptable parsing time and memory usage. In our case with a chunksize value of 10000, we reached a good compromise.

Here you can find some community discussions on the topic:
https://stackoverflow.com/questions/51278619/what-are-the-efficient-ways-to-parse-process-huge-json-files-in-python
https://github.com/pandas-dev/pandas/issues/17048

Many small files vs few large files

When working with JSON, is it better to have many small files or few large files?

To explain what I mean, let’s imagine having e-commerce.

We collect all the interactions that users have with the website products and save them in JSON logs. Then we have to read all these files with Python, manipulate them, and create the training and the test sets in order to train a Learning to Rank model. 

We configure the system to roll log files based on date/time and we hypothesize two different scenarios:

1) Roll log files once daily

Every day we obtain one log (around 2GB), for example: user-interactions-2021-06-20.log

2) Roll log files every 15 minutes

Every day we obtain 96 logs (around 20 MB each), for example:
user-interactions-2021-06-20-00-00.log
user-interactions-2021-06-20-00-15.log
user-interactions-2021-06-20-00-30.log

user-interactions-2021-06-20-23-45.log

We would have about 2GB of data in both cases, but we wondered if it would be better to manage one large file or several small files in terms of time and memory usage.

We tested the pipeline using both approaches on half-month user interaction data to simulate a real-world application.

Here are the differences between parsing many small files and a few large files:

Parsing 27 GB json files takes around 40 minutes and the data frame memory usage is roughly 60 GB.

  • Using many small files, we have advantages in terms of TIME when loading logs: approximately 3 minutes less

  • Using few large files, we have advantages in terms of MEMORY: 4GB less

The Pandas function dataframe.info() was used to print the summary information about the data frame. It includes the columns’ name, the non-null count, and the dtype. The memory_usage parameter specifies whether total memory usage of the data frame elements (including index) should be displayed. A value of ‘deep’ will perform a real memory usage calculation:

dataframe.info(memory_usage='deep')

OUTPUT

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28242019 entries, 0 to 28242018
Data columns (total 77 columns):
Feature1              63580000 non-null object
Feature2              62740023 non-null uint8
Feature3              63587849 non-null float64
........              ......
Feature75             61678009 non-null object
Feature76             63490887 non-null float16
dtypes: datetime64[ns](1), float16(23), float64(6), object(45), uint32(1), uint8(1)
memory usage: 60.2 GB

To record the CPU and memory activity of the entire Python process, we used the utility called psrecord that allows us to store the data to a file or plot it in a graph:

pip install psrecord

psrecord 11653 --interval 20 --plot plot1.png --log log1.log

where:

  • 11653 is the PID (the ID of the process to monitor)
  • interval: to specify the time intervals at which data is polled
  • plot: to specify the path where to save the plot
  • log: to specify the path where to save the log

From our experiments, we noticed that when we have a significant amount of data and most of the features are categorical (object), the RAM usage will be up to 5 times the original files (in terms of original disk space usage). In our case, the size of the original files was 27 GB, the data frame memory usage was 60.2 GB, and the process memory usage was around 129 GB.

Too many Object!

Let’s see what happens if we drop all the ‘object’ features from the same dataset.

Before we had 77 columns including 45 ‘object’ types (which are all array). Deleting them, we end up with 32 columns with the following types:

dtypes: datetime64[ns](1), float16(23), float64(6), uint32(1), uint8(1)

Here are the differences between parsing many small files and a few large files:

Time and memory have drastically changed!

In this case, parsing 25 GB json files takes around 10 minutes and the data frame memory usage is roughly 3 GB. Even RAM usage has also dropped significantly.

This is proof that the more your dataset contains primitive types, the less impact the data parsing will have. The more the ‘object’ features number increase, the more the gap between the cost that the json files originally had on disk and the data frame memory will increase.

The fact that many small files require less time but more RAM usage in parsing than a few large files remained ​unaffected!

For simplicity, in this experiment, we just dropped the ‘object’ features to show you the advantages in terms of time and memory. In a real-world scenario, we have to find a way to convert all the ‘object’ columns into more memory-efficient types.

In the next “tips & tricks”, we will discuss how this can be solved in detail. Stay tuned!

// STAY ALWAYS UP TO DATE

Subscribe to our newsletter

Did you like this post about “how to manage large JSON efficiently and quickly: multiple files”? Don’t forget to subscribe to our Newsletter to stay always updated from the Information Retrieval world!

Author

Ilaria Petreti

Ilaria is a Data Scientist passionate about the world of Artificial Intelligence. She loves applying Data Mining and Machine Learnings techniques, strongly believing in the power of Big Data and Digital Transformation.

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.