Tips And Tricks

In this blog post, I want to give you some tips and tricks to find efficient ways to read and parse a big JSON file in Python.

JSON (JavaScript Object Notation) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute-value pairs and arrays.

Wikipedia

To work with files containing multiple JSON objects (e.g. several JSON rows) is pretty simple through the Python built-in package called json. Once imported, this module provides many methods that will help us to encode and decode JSON data [1].

Anyway, if you have to parse a big JSON file and the structure of the data is too complex, it can be very expensive in terms of time and memory. A JSON is generally parsed in its entirety and then handled in memory: for a large amount of data, this is clearly problematic.

Let’s see together some solutions that can help you importing and manage a large JSON file in Python:

1) Use the method pandas.read_json passing the chunksize parameter

Input: JSON file
Desired Output: Pandas Data frame

Instead of reading the whole file at once, the ‘chunksize‘ parameter will generate a reader that gets a specific number of lines to be read every single time and according to the length of your file, a certain amount of chunks will be created and pushed into memory; for example, if your file has 100.000 lines and you pass chunksize = 10.000,  you will get 10 chunks. “Breaking” the data into smaller pieces, through chunks’ size selection, hopefully, allows you to fit them into memory.

N.B.
– The ‘chunksize’ can only be passed paired with another argument: lines=True
– The method will not return a Data frame but a JsonReader object to iterate over.

2) Change the data type of the features

Pandas automatically detect data types for us, but as we know from the documentation, the default ones are not the most memory-efficient [3].

Especially for strings or columns that contain mixed data types, Pandas uses the dtype ‘object‘. It takes up a lot of space in memory and therefore when possible it would be better to avoid it. The ‘Categorical’ data type will certainly have less impact, especially when you don’t have a large number of possible values (categories) compared to the number of rows.

The pandas.read_json method has the ‘dtype’ parameter, with which you can explicitly specify the type of your columns. It accepts a dictionary that has column names as the keys and column types as the values.

N.B.
– The ‘dtype‘ parameter cannot be passed if orient=’table’: orient is another argument that can be passed to the method to indicate the expected JSON string format. As per official documentation, there are a number of possible orientation values accepted that give an indication of how your JSON file will be structured internally: split, records, index, columns, values, table. Here is the reference to understand the orient options and find the right one for your case. Remember that if ‘table’ is used, it will adhere to the JSON Table Schema, allowing for the preservation of metadata such as dtypes and index names so is not possible to pass the ‘dtype‘ parameter.

– As reported here, the ‘dtype‘ parameter does not appear to work correctly: in fact, it does not always apply the data type expected and specified in the dictionary.

As regards the second point, I’ll show you an example.

We specify a dictionary and pass it with ‘dtype’ parameter:

dtypes_dict = {
    'feature1': 'int64',
    'feature2': 'category',
    'feature3': 'float16'
    'feature4': 'int8'
    }

df = pd.read_json('dataset.json', lines=True, dtype=dtypes_dict)
df.info()

The output of .info will be:

feature1        int64
feature2        object
feature3        float16
feature4        float64

You can see that Pandas ignores the setting of two features:

  • feature2: the data type is ‘object’, not ‘category’ as specified in the dictionary; in this case, the easiest way to convert it to ‘category’ is to use .astype(). Let’s use the memory_usage() to get proof that using ‘category’ instead of ‘object’ will drastically reduce the memory[4]:

memory_used_if_object = df['feature2'].memory_usage(deep=True) / 1e6
2.67064

memory_used_if_category = df['feature2'].astype('category').memory_usage(deep=True) / 1e6
0.077111

  • feature4: the data type is ‘float64’, not ‘int8’; it was most likely unable to convert that column to integer due to the presence of non-finite values (NA or inf).

3) Drop unimportant columns

To save more time and memory for data manipulation and calculation, you can simply drop [5] or filter out some columns that you know are not useful at the beginning of the pipeline:

df.drop(columns=['feature2', 'feature3'], axis=1, inplace=True)

4) Use different libraries: Dask or PySpark

Pandas is one of the most popular data science tools used in the Python programming language; it is simple, flexible, does not require clusters, makes easy the implementation of complex algorithms, and is very efficient with small data. If you have certain memory constraints, you can try to apply all the tricks seen above.

Despite this, when dealing with Big Data, Pandas has its limitations, and libraries with the features of parallelism and scalability can come to our aid, like Dask and PySpark.

Dask Features

  • Open source and included in Anaconda Distribution
  • Familiar coding since it reuses existing Python libraries scaling Pandas, NumPy, and Scikit-Learn workflows
  • It can enable efficient parallel computations on single machines by leveraging multi-core CPUs and streaming data efficiently from disk [6]
  • Code Implementation using Dask Bags or Dask Dataframe

PySpark Features

  • Open source and included in Anaconda Distribution
  • The syntax of PySpark is very different from that of Pandas; the motivation lies in the fact that PySpark is the Python API for Apache Spark, written in Scala. To get a familiar interface that aims to be a Pandas equivalent while taking advantage of PySpark with minimal effort, you can take a look at Koalas, the Pandas API for Spark created by Databricks.
  • Like Dask, it is multi-threaded and can make use of all cores of your machine [7]
  • Code Implementation: https://sparkbyexamples.com/pyspark/pyspark-read-json-file-into-dataframe/

We have not tried these two libraries yet but we are curious to explore them and see if they are truly revolutionary tools for Big Data as we have read in many articles.

// STAY ALWAYS UP TO DATE

Subscribe to our newsletter

Did you like this post about Drop constant features: a real-world Learning to Rank scenario? Don’t forget to subscribe to our Newsletter to stay always updated from the Information Retrieval world!

Author

Ilaria Petreti

Ilaria is a Data Scientist passionate about the world of Artificial Intelligence. She loves applying Data Mining and Machine Learnings techniques, strongly believing in the power of Big Data and Digital Transformation.

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.