Here we are with a new “episode” about managing large JSON, as promised.
If you have not yet read the first two blog posts, I suggest making up for them in order to better understand what I’m going to discuss right now:
From our experiment in the second blog post, we noticed that the more the dataset contains primitive types, the less impact the data parsing will have on time and memory. The more object dtype features we have, the more is the memory gap between the json files on disk and the python data frame.
Having to deal with a lot of categorical data is very expensive and we need to find the most efficient way to analyze them.
As usual, we contextualize the experiment in a real-world scenario: e-commerce.
We collect all the interactions that users have with the website products and save them in JSON logs. Then we have to read all these files with Python, manipulate them, and create the training and the test sets in order to train a Learning to Rank model.
The approach we propose here is to manage categorical data while collecting it and store it directly as numeric types in the JSON. That way, when parsing JSON files, we don’t have to worry about the object dtype features.
How could this be done?
It depends on what the categorical features look like. Here are 3 different cases:
1) WHEN HAVING MULTIPLE VALUES FOR A SINGLE FEATURE (ARRAY)
"feature_name" : [array]
‘cartCategories‘ is the feature that stores all the category ids of the products in the cart
"cartCategories": [43, 46, 48, 60, 63, 64, 65, 105, 108, 3163, 3456, 3466, 3468, 3476, 3477, 3478, 4099, 4432, 4456, 4534, 4642, 5269, 5406, 5825, 43, 3456]
How to encode it?
1) creating arrays with a fixed-length dimension (e.g. selecting top 5 values only):
"cartCategories": [43, 3456, 46, 48, 60]
2) ‘exploding’ each array in N numerical columns (where N is the fixed-length dimension of the array):
"cartCategories_position1" = 43, "cartCategories_position2" = 3456, "cartCategories_position3" = 46, "cartCategories_position4" = 48, "cartCategories_position5" = 60
2) WHEN THE NUMBER OF POSSIBLE VALUES IS LIMITED TO A FIXED SET
"feature_name" : "value"
userDevice may have 3 distinct values = desktop, mobile, and tablet
"desktop": 1, "mobile": 0, "tablet": 0
3) WHEN THE NUMBER OF POSSIBLE VALUES IS NOT LIMITED TO A BEARABLE SET
"feature_name" : "value"
userID may have N distinct values, depending on how many users are registered on the website
"userID" : "mdn456osnb210mn"
How to encode it?
A categorical feature is said to possess high cardinality when it has too many unique values. In this case, one-hot encoding is not considered a good approach since it leads to two problems: space consumption and the curse of dimensionality. Unless you decide to select only the first N categories that occur most for a particular column and apply one-hot encoding on them, you should analyze different solutions and techniques most suitable to your use case.
In our scenario, we have a total of 45 object-dtype features:
- most of the features are arrays so we used the first solution described to manipulate them
- 3 features fall into the second type so we treated them using one-hot encoding
- 2 features have very high cardinality as described in the third case;
we also have to say that depending on the input information and encoding method, there may be some information loss during the encoding process so we decided to leave them as they are and preserve their data. We trust that transforming 43 features out of 45 should be enough to improve the performance.
For the experiment, we have used the same dataset, in order to compare this approach with the previous one in terms of:
- json files dimension: in the new approach, therefore following the manipulation of the categorical variable, each JSON will contain more features and could get bigger → we need to make sure the increase is tolerable
- performance: time and memory while parsing
The dataset used is the same: half-month user interaction data → 1434 JSON files
In the previous experiment, we had 77 columns/features including 45 object dtypes:
dtypes: datetime64[ns](1), float16(23), float64(6), object(45), uint32(1), uint8(1) memory usage: 60.0 GB
In this experiment, after the features transformation, we have 209 columns and just 2 object dtypes:
dtypes: datetime64[ns](1), float16(23), float64(178), int64(3), object(2), uint32(1), uint8(1) memory usage: 41.8 GB
Here are the differences between parsing many small files and a few large files:
New experiment observations
- the total JSON files’ dimension (disk storage) is bigger than before, as expected –> roughly 1.5 times more
- the memory usage of the data frame elements in Python is less than the disk storage, which is good
- the RAM consumption peak is roughly 4 times the disk storage (before was 5 times more)
- parsing more data took only few more minutes
Again, the fact that many small files require less time but more RAM usage in parsing than a few large files remained unaffected!
Would it be better if we removed (or transformed) the 2 remaining object features?
Just for curiosity, we dropped the object features left from the dataset and repeated the experiment. Again, we obtained very good performance in terms of time and the memory usage of the data frame elements in Python, but not in the RAM consumption:
To this, it should also be added that in the approach proposed we transformed the object dtypes in a lot of float64 dtypes due to the presence of missing values in the features manipulated. Even float64 has a high memory impact compared to float16/32, or integer.
We were hoping for a smaller peak memory impact on the parsing and a reduced time at the cost of initial bigger JSON storage on disk.
Unfortunately, the additional overhead on disk and the additional number of features proved to be quite costly(in time and memory when parsing), dwarfing the benefit of having a smaller final data frame.
Contrary to what we expected, this approach did not come with great absolute benefits. Although transforming most of the ‘object’ dtype features brings a smaller data frame to work on, we need more space to store the data and consequently roughly the same time and memory of the previous experiment for parsing them as a whole.
Given that the storage is higher, the peak memory and time required for the parsing is higher, it seems better to just manipulate the data frame after the parsing to diminish its size.
This is the conclusion of this preliminary study.
Subscribe to our newsletter
Did you like this post about How to Deal with Too Many ‘object’ in Pandas from JSON Parsing? Don’t forget to subscribe to our Newsletter to stay always updated on the Information Retrieval world!