Issue with map #6789

Nsohko · 2024-04-07T02:52:06Z

Describe the bug

Map has been taking extremely long to preprocess my data.

It seems to process 1000 examples (which it does really fast in about 10 seconds), then it hangs for a good 1-2 minutes, before it moves on to the next batch of 1000 examples.

It also keeps eating up my hard drive space for some reason by creating a file named tmp1335llua that is over 300GB.

Trying to set num_proc to be >1 also gives me the following error: NameError: name 'processor' is not defined

Please advise on how I could optimise this?

Steps to reproduce the bug

In general, I have been using map as per normal. Here is a snippet of my code:

###########################        DATASET LOADING AND PREP        #########################

def load_custom_dataset(split):
    ds = []
    if split == 'train':
        for dset in args.train_datasets:
            ds.append(load_from_disk(dset))
    if split == 'test':
        for dset in args.test_datasets:
            ds.append(load_from_disk(dset))

    ds_to_return = concatenate_datasets(ds)
    ds_to_return = ds_to_return.shuffle(seed=22)
    return ds_to_return



def prepare_dataset(batch):
    # load and (possibly) resample audio data to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array
    batch["input_features"] = processor.feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
    # compute input length of audio sample in seconds
    batch["input_length"] = len(audio["array"]) / audio["sampling_rate"]

    # optional pre-processing steps
    transcription = batch["sentence"]
    if do_lower_case:
        transcription = transcription.lower()
    if do_remove_punctuation:
        transcription = normalizer(transcription).strip()

    # encode target text to label ids
    batch["labels"] = processor.tokenizer(transcription).input_ids
    return batch

print('DATASET PREPARATION IN PROGRESS...')

# case 3: combine_and_shuffle is true, only train provided
# load train datasets
train_set = load_custom_dataset('train')

 # split dataset
raw_dataset = DatasetDict()
raw_dataset = train_set.train_test_split(test_size = args.test_size, shuffle=True, seed=42)

raw_dataset = raw_dataset.cast_column("audio", Audio(sampling_rate=args.sampling_rate))

print("Before Map:")
print(raw_dataset)

raw_dataset = raw_dataset.map(prepare_dataset, num_proc=1)

print("After Map:")
print(raw_dataset)

Expected behavior

Based on the speed at which map is processing examples, I would expect a 5-6 hours completion for all mapping

However, because it hangs every 1000 examples, I instead roughly estimate it would take about 40 hours!

Moreover, i cant even finish the map because it keeps exponentially eating up my hard drive space

Environment info

datasets version: 2.18.0
Platform: Windows-10-10.0.22631-SP0
Python version: 3.10.14
huggingface_hub version: 0.22.2
PyArrow version: 15.0.2
Pandas version: 2.2.1
fsspec version: 2024.2.0

The text was updated successfully, but these errors were encountered:

Modexus · 2024-04-08T09:27:44Z

Default writer_batch_size is set to 1000 (see map).
The "tmp1335llua" is probably the temp file it creates while writing to disk.
Maybe try lowering the writer_batch_size.

For multi-processing you should probably pass the processor as an argument (with e.g. partial) to the function or create it inside so that the sub-processes have access to it and maybe add if __name__ == "__main__" (not sure that's necessary?).

Nsohko · 2024-04-08T09:37:15Z

Hi @Modexus,

Thank you very much for the help! Yep after playing around with map, I managed to get the parallel processing to work by implementing it like you suggested.

Regarding the temp files, it seems like the temp files just keep growing in size as the map continues. Eventually, once map finishes, the temp files are deleted, but they are instead saved as cache .arrow files. These cache files are absolutely gigantic (~ 30-50x the size of the initial dataset!).

After playing around with the prepare_dataset() function above, it seems this issue is caused by the following line in the function, where the log-Mel spectrogram of the audio is calculated:

# compute log-Mel input features from input audio array batch["input_features"] = processor.feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

When I remove this line, the final cache files are approximately the same size as the initial dataset.

Can I check whether this is expected behavior with the whisper feature extractor? I cant imagine the spectrograms are that large!

Thank you so much for the help!

gibsonpil · 2024-04-15T15:59:40Z

I'm having a similar issue with the spectrographs taking up an incredibly large amount of space. (i.e. 100GB for 3GB of audio). Is this really normal behavior?

gibsonpil · 2024-04-15T16:43:46Z

Upon taking a look at the hex contents of the mapped dataset files I found that the overwhelming majority of the data contained within them was duplicated junk similar to this. I'm not very familiar with the inner workings of AI but I have to assume this is an inefficient way of storing data at best and a bug at worst.

zqhi71 · 2024-06-18T00:00:08Z

Same problem, dataset.map takes long time to process 12GB raw audio data and create 200GB cache file. Is there any method can run process(map) during train, instead current run
once and save cache file ?

eufrizz · 2024-07-17T04:24:38Z

Same issue here. Just trying to normalise image data for a 300MB dataset, ends up with an 11GB cache. The initial .map() call takes 80s over the 15000 images, but then simply iterating over the dataset takes almost 2 minutes. It should be doing no processing here! Something seems wrong.
keep_in_memory=True also offers no speedup.
EDIT: Running the normalisation with set_transform (i.e. on the fly) iterates through the dataset in 18s. With no normalisation it takes around 14s. No reason for .map() to take 5 mins!

VafaKnm · 2024-07-23T11:27:10Z

@eufrizz How you handle this using set_transform?
I have a really big dataset of size 1.2TB and i am going to use it for fine-tunning whisper model. if i use map for dataset_preparing function it will take over 20 days!!!

eufrizz · 2024-07-23T12:41:38Z

@eufrizz How you handle this using set_transform?
I have a really big dataset of size 1.2TB and i am going to use it for fine-tunning whisper model. if i use map for dataset_preparing function it will take over 20 days!!!

Just give the preprocessing function you were using for map to set_transform. Just look at the set_transform documentation. If you're going to do lots of epochs you might be better off just saving the preprocessed data into a new dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with map #6789

Issue with map #6789

Nsohko commented Apr 7, 2024

Modexus commented Apr 8, 2024

Nsohko commented Apr 8, 2024 •

edited

Loading

gibsonpil commented Apr 15, 2024

gibsonpil commented Apr 15, 2024

zqhi71 commented Jun 18, 2024

eufrizz commented Jul 17, 2024 •

edited

Loading

VafaKnm commented Jul 23, 2024

eufrizz commented Jul 23, 2024

Issue with map #6789

Issue with map #6789

Comments

Nsohko commented Apr 7, 2024

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

Modexus commented Apr 8, 2024

Nsohko commented Apr 8, 2024 • edited Loading

gibsonpil commented Apr 15, 2024

gibsonpil commented Apr 15, 2024

zqhi71 commented Jun 18, 2024

eufrizz commented Jul 17, 2024 • edited Loading

VafaKnm commented Jul 23, 2024

eufrizz commented Jul 23, 2024

Nsohko commented Apr 8, 2024 •

edited

Loading

eufrizz commented Jul 17, 2024 •

edited

Loading