-
Notifications
You must be signed in to change notification settings - Fork 620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Low reading performance of TFRecords via S3 #5551
Comments
We identified a few inefficiencies in the way we access S3 storage in TFRecord reader vs File reader. #5554 optimizes that. |
Great! I'll test it as soon as it gets into a nightly build. Thanks! |
I reran the S3 tests using DALI 1.41.0.dev20240718, configuring
While the performance has certainly improved, it's still surprising that accessing fewer, larger TFRecords is slower than accessing multiple smaller files. Can you replicate this behaviour as well? Thanks! |
Even if there are fewer larger files, the requests to S3 are independent. I am unaware if reading parts of large files in AWS S3 is less efficient than smaller individual files. Are you able to get a similar performance with other S3 reading libraries? |
If I understand correctly, you’re accessing the parts/blobs within the TFRecords independently through S3. I believe it would be much faster, whenever possible, to read an entire TFRecord with a single query into memory (or the filesystem) and then access the individual parts locally from there. If you think that individual TFRecords might be too large for this approach (since they can be generated arbitrarily large, though that is not recommended), you could still read larger chunks (such as 64 MB) into local memory and unpack the contents of each chunk locally. |
Hi, I wanted to ask if there are any updates on this issue. We are currently doing some benchmarking and would like to achieve the best possible performance with DALI. Do you have any plans to work on this in the near future, or is it currently a low priority? Thank you! |
Hi @fversaci, We have a couple of ideas on how to approach this but this requires a noticeable effort and we cannot commit to any particular date at the moment. |
Hi @JanuszL, thanks anyway for the update. |
Will keep you posted. |
Version
1.40.0.dev20240628
Describe the bug.
Hi,
I am testing the throughput of DALI (v1.40.0.dev20240628) when reading TFRecords via S3 and I am obtaining unexpectedly poor results.
I am scanning the ImageNet training dataset (140 GB, on a system tuned to have 30 GB of available RAM, to prevent automatic memory caching). The dataset is either in its original FILES format or in the form of TFRECORD files, each with a size of 64 MB. In both cases the dataset is read in batches of size 128.
When reading (with no decoding) from fast NVME disks, these are the results I am getting:
This difference seems reasonable, since TFRECORD entails less file I/O overhead.
However, when I run the same tests with the datasets stored on the same disks and the same node, but made available via a MinIO server, the results are:
In this second scenario, the speed of accessing FILES might be reasonable in comparison to the previous case, as accessing S3 incurs higher overhead than directly accessing the filesystem. However, I would have anticipated a much faster speed for TFRECORD, given that it is supposed to distribute access latencies across 64 MB blocks, which is significantly larger than the average 115 kB JPEG files. Surprisingly, TFRECORD is even slower than the FILES setup.
We also conducted a test using your 12 MB tfrecord example with a batch size of 47, and we achieved performance consistent with the previous results.
Do you have any ideas about what could be causing the low reading performance of TFRecords via S3? Are we overlooking some optimization parameters in the TFRecord reader?
Thanks!
Minimum reproducible example
The text was updated successfully, but these errors were encountered: