Data Preparation, Tips And Tricks

How to manage large JSON files in Python efficiently and quickly

When working with large JSON files in Python, it’s crucial to use efficient methods to parse and read JSON data without hitting memory constraints. Here, we’ll cover essential tips and techniques to help you process these files, especially if you’re working with complex structures or big data.

Understanding JSON and Its Challenges with Large Files

JSON (JavaScript Object Notation) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute-value pairs and arrays.

Wikipedia Tweet

To work with files containing multiple JSON objects (e.g. several JSON rows) is pretty simple through the Python built-in package called JSON [1]. Once imported, this module provides many methods that will help us to encode and decode JSON data [2].

Anyway, if you have to parse a big JSON file and the structure of the data is too complex, it can be very expensive in terms of time and memory. A JSON is generally parsed in its entirety and then handled in memory: for a large amount of data, this is problematic.

Let’s see together some solutions that can help you import and manage large JSON in Python:

1) USE PANDAS.READ_JSON PASSING THE CHUNKSIZE PARAMETER

Input: JSON file
Desired Output: Pandas Data Frame

Instead of reading the whole file at once, the ‘chunksize‘ parameter will generate a reader that gets a specific number of lines to be read every single time and according to the length of your file, a certain amount of chunks will be created and pushed into memory; for example, if your file has 100.000 lines and you pass chunksize = 10.000, you will get 10 chunks. “Breaking” the data into smaller pieces, through chunks’ size selection, hopefully, allows you to fit them into memory.

N.B.
– The ‘chunksize’ can only be passed paired with another argument: lines=True
– The method will not return a Data frame but a JsonReader object to iterate over.

2) CHANGE THE DATA TYPE OF THE FEATURES

Pandas automatically detect data types for us, but as we know from the documentation, the default ones are not the most memory-efficient [3].

Especially for strings or columns that contain mixed data types, Pandas uses the dtype ‘object‘. It takes up a lot of space in memory and therefore when possible it would be better to avoid it. The ‘Categorical’ data type will certainly have less impact, especially when you don’t have a large number of possible values (categories) compared to the number of rows.

The pandas.read_json method has the ‘dtype’ parameter, with which you can explicitly specify the type of your columns. It accepts a dictionary that has column names as the keys and column types as the values.

N.B.
– The ‘dtype‘ parameter cannot be passed if orient=’table’: orient is another argument that can be passed to the method to indicate the expected JSON string format. As per official documentation, there are several possible orientation values accepted that indicate how your JSON file will be structured internally: split, records, index, columns, values, and table. Here is the reference to understand the orient options and find the right one for your case [4]. Remember that if ‘table’ is used, it will adhere to the JSON Table Schema, allowing for the preservation of metadata such as dtypes and index names so is not possible to pass the ‘dtype‘ parameter.

– As reported here [5], the ‘dtype‘ parameter does not appear to work correctly: in fact, it does not always apply the data type expected and specified in the dictionary.

As regards the second point, I’ll show you an example.

We specify a dictionary and pass it with ‘dtype’ parameter:

				
					dtypes_dict = {
    'feature1': 'int64',
    'feature2': 'category',
    'feature3': 'float16'
    'feature4': 'int8'
    }

df = pd.read_json('dataset.json', lines=True, dtype=dtypes_dict)
df.info()

The output of .info will be:

feature1        int64
feature2        object
feature3        float16
feature4        float64

You can see that Pandas ignores the setting of two features:

- feature2: the data type is ‘object’, not ‘category’ as specified in the dictionary; in this case, the easiest way to convert it to ‘category’ is to use .astype() [6]. Let’s use the memory_usage() to get proof that using ‘category’ instead of ‘object’ will drastically reduce the memory [7]:

				
					memory_used_if_object = df['feature2'].memory_usage(deep=True) / 1e6
2.67064

memory_used_if_category = df['feature2'].astype('category').memory_usage(deep=True) / 1e6
0.077111

- feature4: the data type is ‘float64’, not ‘int8’; it was most likely unable to convert that column to integer due to the presence of non-finite values (NA or inf).

3) DROP UNIMPORTANT COLUMNS

To save more time and memory for data manipulation and calculation, you can simply drop [8] or filter out some columns that you know are not useful at the beginning of the pipeline:

				
					df.drop(columns=['feature2', 'feature3'], axis=1, inplace=True)

4) USE DIFFERENT LIBRARIES: DASK OR PYSPARK

Pandas is one of the most popular data science tools used in the Python programming language; it is simple, flexible, does not require clusters, makes the implementation of complex algorithms easy, and is very efficient with small data. If you have certain memory constraints, you can try to apply all the tricks seen above.

Despite this, when dealing with Big Data, Pandas has its limitations, and libraries with the features of parallelism and scalability can come to our aid, like Dask and PySpark.

Dask Features

- Open source and included in Anaconda Distribution
- Familiar coding since it reuses existing Python libraries scaling Pandas, NumPy, and Scikit-Learn workflows
- It can enable efficient parallel computations on single machines by leveraging multi-core CPUs and streaming data efficiently from disk [9]
- Code Implementation using Dask Bags [10] or Dask Dataframe [11]

PySpark Features

- Open source and included in Anaconda Distribution
- The syntax of PySpark is very different from that of Pandas; the motivation lies in the fact that PySpark is the Python API for Apache Spark, written in Scala. To get a familiar interface that aims to be a Pandas equivalent while taking advantage of PySpark with minimal effort, you can take a look at Koalas [12], the Pandas API for Spark created by Databricks.
- Like Dask, it is multi-threaded and can make use of all cores of your machine [13]
- Code Implementation [14]

We have not tried these two libraries yet but we are curious to explore them and see if they are truly revolutionary tools for Big Data as we have read in many articles.

Need Help With This Topic?

If you’re struggling with how to manage large JSON files, don’t worry – we’re here to help! Our team offers expert services and training to help you optimize your search engine and get the most out of your system. Contact us today to learn more!

Need Help with this topic?

If you're struggling with how to manage large JSON files, don't worry - we're here to help! Our team offers expert services and training to help you optimize your search engine and get the most out of your system. Contact us today to learn more!

Click Here

datascience, featureengineering, informationretrieval, json, pandas, python

Sign up for our Newsletter

Did you like this post? Don’t forget to subscribe to our Newsletter to stay always updated in the Information Retrieval world!

4 Responses

Miguel Angel says:

June 6, 2022 at 4:25 pm

Is R or Python better for reading large JSON files as dataframe?

My idea is to load a JSON file of about 6 GB, read it as a dataframe, select the columns that interest me, and export the final dataframe to a CSV file. Which of the two options (R or Python) do you recommend? I have tried both and at the memory level I have had quite a few problems.

Loading...

Reply
1. Ilaria Petreti says:
  
  June 7, 2022 at 4:58 pm
  
  Hi Miguel,
  
  thanks for your question.
  
  We mainly work with Python in our projects, and honestly, we never compared the performance between R and Python when reading data in JSON format.
  
  Since you have a memory issue with both programming languages, the root cause may be different.
  
  Have you already tried all the tips we covered in the blog post?
  
  How much RAM/CPU do you have in your machine?
  
  Also (if you haven’t read them yet), you may find 2 other blog posts about JSON files useful:
  – https://sease.io/2021/11/how-to-manage-large-json-efficiently-and-quickly-multiple-files.html
  having many smaller files instead of few large files (or vice versa)
  – https://sease.io/2022/03/how-to-deal-with-too-many-object-in-pandas-from-json-parsing.html
  memory issue when most of the features are ‘object’ type
  
  Loading...
  
  Reply
Hamed says:

August 19, 2023 at 2:56 pm

Hi dear Ilaria
Thank you for this beautiful post
I am Hamed
I want to test a big data processing from Json with java script
I you have time please help me

Loading...

Reply
1. Ilaria Petreti says:
  
  September 1, 2023 at 2:20 pm
  
  Hi Hamed,
  thank you for the comment.
  If you need professional help, feel free to contact info@sease.io
  If you have any specific comment on the blog post, let me know, I’m happy to elaborate!
  
  Loading...
  
  Reply

About the company

about our work

Rated Ranking Evaluator
(RRE)

Rated Ranking Evaluator Enterprise (RREE)

Apache Solr LLM Highlighter plugin

News

Main Blog

TIPS AND TRICKS

LATEST BLOG POST

contact us

Don't miss all the news - subscribe to our newsletter!

How to manage large JSON files in Python efficiently and quickly

Understanding JSON and Its Challenges with Large Files

1) USE PANDAS.READ_JSON PASSING THE CHUNKSIZE PARAMETER

2) CHANGE THE DATA TYPE OF THE FEATURES

3) DROP UNIMPORTANT COLUMNS

4) USE DIFFERENT LIBRARIES: DASK OR PYSPARK

Dask Features

PySpark Features

Need Help With This Topic?

Need Help with this topic?

Other posts you may find useful

DocValues VS Stored Fields: Apache Solr Features and Performance SmackDown

Pick the Best Database Type for Your Next Project

RRE-Enterprise: Evaluation Explore/Compare Dashboard

Ilaria Petreti

Ilaria Petreti

Follow Us

Top Categories

Recent Posts

Retrieval and Responsibility: The Ethics of Augmented Knowledge

Faster Vector Search: Early Termination Strategy Now in Apache Solr

OpenSearch and Large Language Models

Monthly video

Sign up for our Newsletter

4 Responses

Leave a Reply Cancel reply

Quick Links

Services

Subscribe

About the company

about our work

Rated Ranking Evaluator (RRE)

Rated Ranking Evaluator Enterprise (RREE)

Apache Solr LLM Highlighter plugin

News

Main Blog

TIPS AND TRICKS

LATEST BLOG POST

contact us

Don't miss all the news - subscribe to our newsletter!

How to manage large JSON files in Python efficiently and quickly

Understanding JSON and Its Challenges with Large Files

1) USE PANDAS.READ_JSON PASSING THE CHUNKSIZE PARAMETER

2) CHANGE THE DATA TYPE OF THE FEATURES

3) DROP UNIMPORTANT COLUMNS

4) USE DIFFERENT LIBRARIES: DASK OR PYSPARK

Dask Features

PySpark Features

Need Help With This Topic?​​

Need Help with this topic?​

Other posts you may find useful

DocValues VS Stored Fields: Apache Solr Features and Performance SmackDown

Pick the Best Database Type for Your Next Project

RRE-Enterprise: Evaluation Explore/Compare Dashboard

Ilaria Petreti

Ilaria Petreti

Follow Us

Top Categories

Recent Posts

Retrieval and Responsibility: The Ethics of Augmented Knowledge

Faster Vector Search: Early Termination Strategy Now in Apache Solr

OpenSearch and Large Language Models

Monthly video

Sign up for our Newsletter

4 Responses

Leave a Reply Cancel reply

Rated Ranking Evaluator
(RRE)

Need Help With This Topic?

Need Help with this topic?