Dataset.from_generator() cost much more time in vscode debugging mode then running mode #6254

dontnet-wuenze · 2023-09-23T02:07:26Z

Describe the bug

Hey there,
I’m using Dataset.from_generator() to convert a torch_dataset to the Huggingface Dataset.
However, when I debug my code on vscode, I find that it runs really slow on Dataset.from_generator() which may even 20 times longer then run the script on terminal.

Steps to reproduce the bug

I write a simple test code :

import os
from functools import partial
from typing import Callable

import torch
import time
from torch.utils.data import Dataset as TorchDataset

from datasets import load_from_disk, Dataset as HFDataset
  
import torch  
from torch.utils.data import Dataset  
  
class SimpleDataset(Dataset):  
    def __init__(self, data):  
        self.data = data  
        self.keys = list(data[0].keys())
      
    def __len__(self):  
        return len(self.data)  
      
    def __getitem__(self, index):  
        sample = self.data[index]  
        return {key: sample[key] for key in self.keys}  
  

def TorchDataset2HuggingfaceDataset(torch_dataset: TorchDataset, cache_dir: str = None
) -> HFDataset:
    
    """
        convert torch dataset to huggingface dataset
    """
    generator : Callable[[], TorchDataset] = lambda: (sample for sample in torch_dataset)   

    return HFDataset.from_generator(generator, cache_dir=cache_dir)

if __name__ == '__main__':
    data = [  
        {'id': 1, 'name': 'Alice'},  
        {'id': 2, 'name': 'Bob'},  
        {'id': 3, 'name': 'Charlie'}  
    ]
    
    torch_dataset = SimpleDataset(data)
    start_time = time.time() 
    huggingface_dataset = TorchDataset2HuggingfaceDataset(torch_dataset)
    end_time = time.time()
    print("time: ", end_time - start_time)
    print(huggingface_dataset)

Expected behavior

this test on my machine report that the running time on terminal is 0.086,
however the running time in debugging mode on vscode is 0.25, which I think is much longer than expected.

I’d like to know is the anything wrong in the code or just because of debugging?
I have traced the code and I find is this func which I get stuck.

def create_config_id(
        self,
        config_kwargs: dict,
        custom_features: Optional[Features] = None,
    ) -> str:
...
# stuck in this line
suffix = Hasher.hash(config_kwargs_to_add_to_suffix)

Environment info

datasets version: 2.12.0
Platform: Linux-5.11.0-27-generic-x86_64-with-glibc2.31
Python version: 3.11.3
Huggingface_hub version: 0.17.2
PyArrow version: 11.0.0
Pandas version: 2.0.1

The text was updated successfully, but these errors were encountered:

mariosasko · 2023-09-26T14:10:57Z

Answered on the forum: https://discuss.huggingface.co/t/dataset-from-generator-cost-much-more-time-in-vscode-debugging-mode-then-running-mode/56005/2

mariosasko closed this as completed Oct 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset.from_generator() cost much more time in vscode debugging mode then running mode #6254

Dataset.from_generator() cost much more time in vscode debugging mode then running mode #6254

dontnet-wuenze commented Sep 23, 2023

mariosasko commented Sep 26, 2023

Dataset.from_generator() cost much more time in vscode debugging mode then running mode #6254

Dataset.from_generator() cost much more time in vscode debugging mode then running mode #6254

Comments

dontnet-wuenze commented Sep 23, 2023

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

mariosasko commented Sep 26, 2023