When working with large JSON files in Python, it’s crucial to use efficient methods to parse and read JSON data without hitting memory constraints. Here, we’ll cover essential tips and techniques to help you process these files, especially if you’re working with complex structures or big data.
Understanding JSON and Its Challenges with Large Files
JSON (JavaScript Object Notation) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute-value pairs and arrays.
Wikipedia Tweet
To work with files containing multiple JSON objects (e.g. several JSON rows) is pretty simple through the Python built-in package called JSON [1]. Once imported, this module provides many methods that will help us to encode and decode JSON data [2].
Anyway, if you have to parse a big JSON file and the structure of the data is too complex, it can be very expensive in terms of time and memory. A JSON is generally parsed in its entirety and then handled in memory: for a large amount of data, this is problematic.
Let’s see together some solutions that can help you import and manage large JSON in Python:
1) USE PANDAS.READ_JSON PASSING THE CHUNKSIZE PARAMETER
Input: JSON file
Desired Output: Pandas Data Frame
Instead of reading the whole file at once, the ‘chunksize‘ parameter will generate a reader that gets a specific number of lines to be read every single time and according to the length of your file, a certain amount of chunks will be created and pushed into memory; for example, if your file has 100.000 lines and you pass chunksize = 10.000, you will get 10 chunks. “Breaking” the data into smaller pieces, through chunks’ size selection, hopefully, allows you to fit them into memory.
N.B.
– The ‘chunksize’ can only be passed paired with another argument: lines=True
– The method will not return a Data frame but a JsonReader object to iterate over.
2) CHANGE THE DATA TYPE OF THE FEATURES
Pandas automatically detect data types for us, but as we know from the documentation, the default ones are not the most memory-efficient [3].
Especially for strings or columns that contain mixed data types, Pandas uses the dtype ‘object‘. It takes up a lot of space in memory and therefore when possible it would be better to avoid it. The ‘Categorical’ data type will certainly have less impact, especially when you don’t have a large number of possible values (categories) compared to the number of rows.
The pandas.read_json method has the ‘dtype’ parameter, with which you can explicitly specify the type of your columns. It accepts a dictionary that has column names as the keys and column types as the values.
N.B.
– The ‘dtype‘ parameter cannot be passed if orient=’table’: orient is another argument that can be passed to the method to indicate the expected JSON string format. As per official documentation, there are several possible orientation values accepted that indicate how your JSON file will be structured internally: split, records, index, columns, values, and table. Here is the reference to understand the orient options and find the right one for your case [4]. Remember that if ‘table’ is used, it will adhere to the JSON Table Schema, allowing for the preservation of metadata such as dtypes and index names so is not possible to pass the ‘dtype‘ parameter.
– As reported here [5], the ‘dtype‘ parameter does not appear to work correctly: in fact, it does not always apply the data type expected and specified in the dictionary.
As regards the second point, I’ll show you an example.
We specify a dictionary and pass it with ‘dtype’ parameter:
dtypes_dict = {
'feature1': 'int64',
'feature2': 'category',
'feature3': 'float16'
'feature4': 'int8'
}
df = pd.read_json('dataset.json', lines=True, dtype=dtypes_dict)
df.info()
The output of .info will be:
feature1 int64
feature2 object
feature3 float16
feature4 float64 You can see that Pandas ignores the setting of two features:
memory_used_if_object = df['feature2'].memory_usage(deep=True) / 1e6
2.67064
memory_used_if_category = df['feature2'].astype('category').memory_usage(deep=True) / 1e6
0.077111
-
- feature4: the data type is ‘float64’, not ‘int8’; it was most likely unable to convert that column to integer due to the presence of non-finite values (NA or inf).
3) DROP UNIMPORTANT COLUMNS
To save more time and memory for data manipulation and calculation, you can simply drop [8] or filter out some columns that you know are not useful at the beginning of the pipeline:
df.drop(columns=['feature2', 'feature3'], axis=1, inplace=True)
4) USE DIFFERENT LIBRARIES: DASK OR PYSPARK
Pandas is one of the most popular data science tools used in the Python programming language; it is simple, flexible, does not require clusters, makes the implementation of complex algorithms easy, and is very efficient with small data. If you have certain memory constraints, you can try to apply all the tricks seen above.
Despite this, when dealing with Big Data, Pandas has its limitations, and libraries with the features of parallelism and scalability can come to our aid, like Dask and PySpark.
Dask Features
-
- Open source and included in Anaconda Distribution
- Familiar coding since it reuses existing Python libraries scaling Pandas, NumPy, and Scikit-Learn workflows
- It can enable efficient parallel computations on single machines by leveraging multi-core CPUs and streaming data efficiently from disk [9]
- Code Implementation using Dask Bags [10] or Dask Dataframe [11]
PySpark Features
-
- Open source and included in Anaconda Distribution
- The syntax of PySpark is very different from that of Pandas; the motivation lies in the fact that PySpark is the Python API for Apache Spark, written in Scala. To get a familiar interface that aims to be a Pandas equivalent while taking advantage of PySpark with minimal effort, you can take a look at Koalas [12], the Pandas API for Spark created by Databricks.
- Like Dask, it is multi-threaded and can make use of all cores of your machine [13]
- Code Implementation [14]
We have not tried these two libraries yet but we are curious to explore them and see if they are truly revolutionary tools for Big Data as we have read in many articles.
Need Help With This Topic?
If you’re struggling with how to manage large JSON files, don’t worry – we’re here to help! Our team offers expert services and training to help you optimize your search engine and get the most out of your system. Contact us today to learn more!






4 Responses
Is R or Python better for reading large JSON files as dataframe?
My idea is to load a JSON file of about 6 GB, read it as a dataframe, select the columns that interest me, and export the final dataframe to a CSV file. Which of the two options (R or Python) do you recommend? I have tried both and at the memory level I have had quite a few problems.
Hi Miguel,
thanks for your question.
We mainly work with Python in our projects, and honestly, we never compared the performance between R and Python when reading data in JSON format.
Since you have a memory issue with both programming languages, the root cause may be different.
Have you already tried all the tips we covered in the blog post?
How much RAM/CPU do you have in your machine?
Also (if you haven’t read them yet), you may find 2 other blog posts about JSON files useful:
– https://sease.io/2021/11/how-to-manage-large-json-efficiently-and-quickly-multiple-files.html
having many smaller files instead of few large files (or vice versa)
– https://sease.io/2022/03/how-to-deal-with-too-many-object-in-pandas-from-json-parsing.html
memory issue when most of the features are ‘object’ type
Hi dear Ilaria
Thank you for this beautiful post
I am Hamed
I want to test a big data processing from Json with java script
I you have time please help me
Hi Hamed,
thank you for the comment.
If you need professional help, feel free to contact info@sease.io
If you have any specific comment on the blog post, let me know, I’m happy to elaborate!