Pandas automatically detect data types for us, but as we know from the documentation, the default ones are not the most memory-efficient [3].
Especially for strings or columns that contain mixed data types, Pandas uses the dtype ‘object‘. It takes up a lot of space in memory and therefore when possible it would be better to avoid it. The ‘Categorical’ data type will certainly have less impact, especially when you don’t have a large number of possible values (categories) compared to the number of rows.
The pandas.read_json method has the ‘dtype’ parameter, with which you can explicitly specify the type of your columns. It accepts a dictionary that has column names as the keys and column types as the values.
N.B.
– The ‘dtype‘ parameter cannot be passed if orient=’table’: orient is another argument that can be passed to the method to indicate the expected JSON string format. As per official documentation, there are a number of possible orientation values accepted that give an indication of how your JSON file will be structured internally: split, records, index, columns, values, table. Here is the reference to understand the orient options and find the right one for your case [4]. Remember that if ‘table’ is used, it will adhere to the JSON Table Schema, allowing for the preservation of metadata such as dtypes and index names so is not possible to pass the ‘dtype‘ parameter.
– As reported here [5], the ‘dtype‘ parameter does not appear to work correctly: in fact, it does not always apply the data type expected and specified in the dictionary.
As regards the second point, I’ll show you an example.
We specify a dictionary and pass it with ‘dtype’ parameter:
Miguel Angel
June 6, 2022Is R or Python better for reading large JSON files as dataframe?
My idea is to load a JSON file of about 6 GB, read it as a dataframe, select the columns that interest me, and export the final dataframe to a CSV file. Which of the two options (R or Python) do you recommend? I have tried both and at the memory level I have had quite a few problems.
Ilaria Petreti
June 7, 2022Hi Miguel,
thanks for your question.
We mainly work with Python in our projects, and honestly, we never compared the performance between R and Python when reading data in JSON format.
Since you have a memory issue with both programming languages, the root cause may be different.
Have you already tried all the tips we covered in the blog post?
How much RAM/CPU do you have in your machine?
Also (if you haven’t read them yet), you may find 2 other blog posts about JSON files useful:
– https://sease.io/2021/11/how-to-manage-large-json-efficiently-and-quickly-multiple-files.html
having many smaller files instead of few large files (or vice versa)
– https://sease.io/2022/03/how-to-deal-with-too-many-object-in-pandas-from-json-parsing.html
memory issue when most of the features are ‘object’ type