When working with JSON, is it better to have many small files or few large files?
To explain what I mean, let’s imagine having e-commerce.
We collect all the interactions that users have with the website products and save them in JSON logs. Then we have to read all these files with Python, manipulate them, and create the training and the test sets in order to train a Learning to Rank model.
We configure the system to roll log files based on date/time and we hypothesize two different scenarios:
1) Roll log files once daily
Every day we obtain one log (around 2GB), for example: user-interactions-2021-06-20.log
2) Roll log files every 15 minutes
Every day we obtain 96 logs (around 20 MB each), for example:
user-interactions-2021-06-20-00-00.log
user-interactions-2021-06-20-00-15.log
user-interactions-2021-06-20-00-30.log
…
user-interactions-2021-06-20-23-45.log
We would have about 2GB of data in both cases, but we wondered if it would be better to manage one large file or several small files in terms of time and memory usage.
We tested the pipeline using both approaches on half-month user interaction data to simulate a real-world application.
Here are the differences between parsing many small files and a few large files: