Here we are with a new “episode” about managing large JSON, as promised.
If you have not yet read the first two blog posts, I suggest making up for them in order to better understand what I’m going to discuss right now:
How to manage a large JSON file efficiently and quickly
How to manage large JSON efficiently and quickly: multiple files
From our experiment in the second blog post, we noticed that the more the dataset contains primitive types, the less impact the data parsing will have on time and memory. The more object dtype features we have, the more is the memory gap between the json files on disk and the python data frame.
Having to deal with a lot of categorical data is very expensive and we need to find the most efficient way to analyze them.
As usual, we contextualize the experiment in a real-world scenario: e-commerce.
We collect all the interactions that users have with the website products and save them in JSON logs. Then we have to read all these files with Python, manipulate them, and create the training and the test sets in order to train a Learning to Rank model.
The approach we propose here is to manage categorical data while collecting it and store it directly as numeric types in the JSON. That way, when parsing JSON files, we don’t have to worry about the object dtype features.