Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_from_disk #7268

Open
ghaith-mq opened this issue Oct 31, 2024 · 1 comment
Open

load_from_disk #7268

ghaith-mq opened this issue Oct 31, 2024 · 1 comment

Comments

@ghaith-mq
Copy link

Describe the bug

I have data saved with save_to_disk. The data is big (700Gb). When I try loading it, the only option is load_from_disk, and this function copies the data to a tmp directory, causing me to run out of disk space. Is there an alternative solution to that?

Steps to reproduce the bug

when trying to load data using load_From_disk after being saved using save_to_disk

Expected behavior

run out of disk space

Environment info

lateest version

@Jourdelune
Copy link

Jourdelune commented Oct 31, 2024

Hello, It's an interesting issue here. I have the same problem, I have a local dataset and I want to push the dataset to the hub but huggingface does a copy of it.

from datasets import load_dataset

dataset = load_dataset("webdataset", data_files="/media/works/data/*.tar") # copy here
dataset.push_to_hub("WaveGenAI/audios2")

Edit: I can use HfApi for my use case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants