Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate using HF Datasets class for splitting (and more!) #72

Open
mtauraso opened this issue Sep 24, 2024 · 0 comments
Open

Investigate using HF Datasets class for splitting (and more!) #72

mtauraso opened this issue Sep 24, 2024 · 0 comments

Comments

@mtauraso
Copy link
Collaborator

Huggingface seems to have a dataset definition library that I think we could leverage both for the immediate notion of splitting data, and for some longer-term purposes.
HF not only defines a scheme for doing dataset splits, but they also support things like progressive downloading of a dataset and making the dataset available to both TensorFlow based and PyTorch based code.
We can use HF's off-the-shelf notion of images-in-a-folder, or define our own scheme which allows us to read our existing metadata files:
https://huggingface.co/docs/datasets/en/image_dataset#imagefolder
https://huggingface.co/docs/datasets/en/image_dataset#legacy-loading-script
Splits in this library appear to be implemented by splitting metadata rather than creating separate folders. We already have metadata, and are keeping all images in one folder, so we have the bones of this system already.
Here's where they keep the code: https://github.com/huggingface/datasets/tree/v2.21.0-release
It's also my hope that by exploring this library we can also pave the way for folks using fibad to easily upload their datasets to HF, and share them with other researchers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant