Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with map #6789

Open
Nsohko opened this issue Apr 7, 2024 · 8 comments
Open

Issue with map #6789

Nsohko opened this issue Apr 7, 2024 · 8 comments

Comments

@Nsohko
Copy link

Nsohko commented Apr 7, 2024

Describe the bug

Map has been taking extremely long to preprocess my data.

It seems to process 1000 examples (which it does really fast in about 10 seconds), then it hangs for a good 1-2 minutes, before it moves on to the next batch of 1000 examples.

It also keeps eating up my hard drive space for some reason by creating a file named tmp1335llua that is over 300GB.

Trying to set num_proc to be >1 also gives me the following error: NameError: name 'processor' is not defined

Please advise on how I could optimise this?

Steps to reproduce the bug

In general, I have been using map as per normal. Here is a snippet of my code:

###########################        DATASET LOADING AND PREP        #########################

def load_custom_dataset(split):
    ds = []
    if split == 'train':
        for dset in args.train_datasets:
            ds.append(load_from_disk(dset))
    if split == 'test':
        for dset in args.test_datasets:
            ds.append(load_from_disk(dset))

    ds_to_return = concatenate_datasets(ds)
    ds_to_return = ds_to_return.shuffle(seed=22)
    return ds_to_return



def prepare_dataset(batch):
    # load and (possibly) resample audio data to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array
    batch["input_features"] = processor.feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
    # compute input length of audio sample in seconds
    batch["input_length"] = len(audio["array"]) / audio["sampling_rate"]

    # optional pre-processing steps
    transcription = batch["sentence"]
    if do_lower_case:
        transcription = transcription.lower()
    if do_remove_punctuation:
        transcription = normalizer(transcription).strip()

    # encode target text to label ids
    batch["labels"] = processor.tokenizer(transcription).input_ids
    return batch

print('DATASET PREPARATION IN PROGRESS...')

# case 3: combine_and_shuffle is true, only train provided
# load train datasets
train_set = load_custom_dataset('train')

 # split dataset
raw_dataset = DatasetDict()
raw_dataset = train_set.train_test_split(test_size = args.test_size, shuffle=True, seed=42)

raw_dataset = raw_dataset.cast_column("audio", Audio(sampling_rate=args.sampling_rate))

print("Before Map:")
print(raw_dataset)

raw_dataset = raw_dataset.map(prepare_dataset, num_proc=1)

print("After Map:")
print(raw_dataset)

Expected behavior

Based on the speed at which map is processing examples, I would expect a 5-6 hours completion for all mapping

However, because it hangs every 1000 examples, I instead roughly estimate it would take about 40 hours!

Moreover, i cant even finish the map because it keeps exponentially eating up my hard drive space

Environment info

  • datasets version: 2.18.0
  • Platform: Windows-10-10.0.22631-SP0
  • Python version: 3.10.14
  • huggingface_hub version: 0.22.2
  • PyArrow version: 15.0.2
  • Pandas version: 2.2.1
  • fsspec version: 2024.2.0
@Modexus
Copy link
Contributor

Modexus commented Apr 8, 2024

Default writer_batch_size is set to 1000 (see map).
The "tmp1335llua" is probably the temp file it creates while writing to disk.
Maybe try lowering the writer_batch_size.

For multi-processing you should probably pass the processor as an argument (with e.g. partial) to the function or create it inside so that the sub-processes have access to it and maybe add if __name__ == "__main__" (not sure that's necessary?).

@Nsohko
Copy link
Author

Nsohko commented Apr 8, 2024

Hi @Modexus,

Thank you very much for the help! Yep after playing around with map, I managed to get the parallel processing to work by implementing it like you suggested.

Regarding the temp files, it seems like the temp files just keep growing in size as the map continues. Eventually, once map finishes, the temp files are deleted, but they are instead saved as cache .arrow files. These cache files are absolutely gigantic (~ 30-50x the size of the initial dataset!).

After playing around with the prepare_dataset() function above, it seems this issue is caused by the following line in the function, where the log-Mel spectrogram of the audio is calculated:

# compute log-Mel input features from input audio array batch["input_features"] = processor.feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

When I remove this line, the final cache files are approximately the same size as the initial dataset.

Can I check whether this is expected behavior with the whisper feature extractor? I cant imagine the spectrograms are that large!

Thank you so much for the help!

@gibsonpil
Copy link

I'm having a similar issue with the spectrographs taking up an incredibly large amount of space. (i.e. 100GB for 3GB of audio). Is this really normal behavior?

@gibsonpil
Copy link

Upon taking a look at the hex contents of the mapped dataset files I found that the overwhelming majority of the data contained within them was duplicated junk similar to this. I'm not very familiar with the inner workings of AI but I have to assume this is an inefficient way of storing data at best and a bug at worst.
image

@zqhi71
Copy link

zqhi71 commented Jun 18, 2024

Same problem, dataset.map takes long time to process 12GB raw audio data and create 200GB cache file. Is there any method can run process(map) during train, instead current run
once and save cache file ?

@eufrizz
Copy link

eufrizz commented Jul 17, 2024

Same issue here. Just trying to normalise image data for a 300MB dataset, ends up with an 11GB cache. The initial .map() call takes 80s over the 15000 images, but then simply iterating over the dataset takes almost 2 minutes. It should be doing no processing here! Something seems wrong.
keep_in_memory=True also offers no speedup.
EDIT: Running the normalisation with set_transform (i.e. on the fly) iterates through the dataset in 18s. With no normalisation it takes around 14s. No reason for .map() to take 5 mins!

@VafaKnm
Copy link

VafaKnm commented Jul 23, 2024

@eufrizz How you handle this using set_transform?
I have a really big dataset of size 1.2TB and i am going to use it for fine-tunning whisper model. if i use map for dataset_preparing function it will take over 20 days!!!

@eufrizz
Copy link

eufrizz commented Jul 23, 2024

@eufrizz How you handle this using set_transform?
I have a really big dataset of size 1.2TB and i am going to use it for fine-tunning whisper model. if i use map for dataset_preparing function it will take over 20 days!!!

Just give the preprocessing function you were using for map to set_transform. Just look at the set_transform documentation. If you're going to do lots of epochs you might be better off just saving the preprocessed data into a new dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants