-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat
] Move dataset card creation to method for easier overriding
#6988
base: main
Are you sure you want to change the base?
[feat
] Move dataset card creation to method for easier overriding
#6988
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
I'd suggest to use a separate function to push changes to the Dataset card, and call it after |
Would you consider an alternative where a Dataset instance carries a dataset card template which can be updated? I don't want to burden my users with having to call another method after
|
Actually I find the idea of overriding Well if it's been working fine on your side why not, but make sure you test correctly features that could not work because of subclassing (e.g. I'm pretty sure If it sounds good to you I'm fine with merging your addition to let you override the dataset card. |
I understand that there's limitations such as this one. The subclass doesn't have to be robust - I'd just like some simple automatic dataset card generation options directly after generating the dataset. This can be removed if the user does additional steps before pushing the model, e.g. mapping, filtering, saving to disk and uploading the loaded dataset, etc.
That would be quite useful for me! I appreciate it. I'm not very sure what the test failures are caused by, I believe the only change in behaviour is that DatasetInfosDict({config_name: info_to_dump}).to_dataset_card_data(dataset_card_data)
MetadataConfigs({config_name: metadata_config_to_dump}).to_dataset_card_data(dataset_card_data) are not called when
|
Let's try to have this PR merged then ! IMO your current implementation can be improved since you path both the dataset card data and the dataset card itself, which is redundant. Also I anticipate the failures in the CI to come from your default implementation which doesn't correspond to what it was doing before
Indeed the dataset_card_data is the value from attribute of the dataset_card from a few lines before your changes, so yes it modifies the dataset_card object too. |
Hello!
Pull Request overview
Details
It's common for me to fully automatically download, reformat, and upload a dataset (e.g. see https://huggingface.co/datasets?other=sentence-transformers), but one aspect that I cannot easily automate is the dataset card generation. This is because during
push_to_hub
, the dataset card is created in 3 lines of code in a much larger method. To automatically generate a dataset card, I need to either:Dataset
/DatasetDict
, copy the entirepush_to_hub
method to override the ~3 lines used to generate the dataset card. This is not viable as the method is likely to change over time.push_to_hub
normally, then separately download the pushed (but empty) dataset card, update it, and reupload the modified dataset. This works fine, but prevents me from being able to return aDataset
to my users which will automatically use a nice dataset card.So, in this PR I'm proposing to move the dataset generation into another method so that it can be overridden more easily. For example, imagine the following use case:
In this script, I've created a subclass which stores some additional information about how the dataset was generated. It's a bit hacky (e.g. setting a
mining_kwargs
parameter infrom_dict
that wasn't created in__init__
, but that's just a consequence of how thefrom_...
methods don't accept kwargs), but it allows me to create a "hard negatives mining" function that returns a dataset which people can use locally like normal, but if they choose to upload it, then it'll automatically include some information, e.g.: https://huggingface.co/datasets/tomaarsen/mining_demoThis allows others to actually find this dataset (e.g. via the
sentence-transformers
tag) and get an idea of the quality, source, etc. by looking at the model card.Note
I'm not fixed on this solution whatsoever: I am also completely fine with other solutions, e.g. a
dataset.set_dataset_card_creator
method that allows me to provide a function without even having to subclass anything. I'm open to all ideas :)cc @albertvillanova @lhoestq
cc @LysandreJik