-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with map #6789
Comments
Default For multi-processing you should probably pass the |
Hi @Modexus, Thank you very much for the help! Yep after playing around with map, I managed to get the parallel processing to work by implementing it like you suggested. Regarding the temp files, it seems like the temp files just keep growing in size as the map continues. Eventually, once map finishes, the temp files are deleted, but they are instead saved as cache .arrow files. These cache files are absolutely gigantic (~ 30-50x the size of the initial dataset!). After playing around with the
When I remove this line, the final cache files are approximately the same size as the initial dataset. Can I check whether this is expected behavior with the whisper feature extractor? I cant imagine the spectrograms are that large! Thank you so much for the help! |
I'm having a similar issue with the spectrographs taking up an incredibly large amount of space. (i.e. 100GB for 3GB of audio). Is this really normal behavior? |
Same problem, dataset.map takes long time to process 12GB raw audio data and create 200GB cache file. Is there any method can run process(map) during train, instead current run |
Same issue here. Just trying to normalise image data for a 300MB dataset, ends up with an 11GB cache. The initial .map() call takes 80s over the 15000 images, but then simply iterating over the dataset takes almost 2 minutes. It should be doing no processing here! Something seems wrong. |
@eufrizz How you handle this using set_transform? |
Just give the preprocessing function you were using for map to set_transform. Just look at the set_transform documentation. If you're going to do lots of epochs you might be better off just saving the preprocessed data into a new dataset. |
Describe the bug
Map has been taking extremely long to preprocess my data.
It seems to process 1000 examples (which it does really fast in about 10 seconds), then it hangs for a good 1-2 minutes, before it moves on to the next batch of 1000 examples.
It also keeps eating up my hard drive space for some reason by creating a file named tmp1335llua that is over 300GB.
Trying to set num_proc to be >1 also gives me the following error: NameError: name 'processor' is not defined
Please advise on how I could optimise this?
Steps to reproduce the bug
In general, I have been using map as per normal. Here is a snippet of my code:
Expected behavior
Based on the speed at which map is processing examples, I would expect a 5-6 hours completion for all mapping
However, because it hangs every 1000 examples, I instead roughly estimate it would take about 40 hours!
Moreover, i cant even finish the map because it keeps exponentially eating up my hard drive space
Environment info
datasets
version: 2.18.0huggingface_hub
version: 0.22.2fsspec
version: 2024.2.0The text was updated successfully, but these errors were encountered: