Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] Move dataset card creation to method for easier overriding #6988

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

tomaarsen
Copy link
Member

@tomaarsen tomaarsen commented Jun 20, 2024

Hello!

Pull Request overview

  • Move dataset card creation to method for easier overriding

Details

It's common for me to fully automatically download, reformat, and upload a dataset (e.g. see https://huggingface.co/datasets?other=sentence-transformers), but one aspect that I cannot easily automate is the dataset card generation. This is because during push_to_hub, the dataset card is created in 3 lines of code in a much larger method. To automatically generate a dataset card, I need to either:

  1. Subclass Dataset/DatasetDict, copy the entire push_to_hub method to override the ~3 lines used to generate the dataset card. This is not viable as the method is likely to change over time.
  2. Use push_to_hub normally, then separately download the pushed (but empty) dataset card, update it, and reupload the modified dataset. This works fine, but prevents me from being able to return a Dataset to my users which will automatically use a nice dataset card.

So, in this PR I'm proposing to move the dataset generation into another method so that it can be overridden more easily. For example, imagine the following use case:

import json
from typing import Any, Dict, Optional
from datasets import Dataset, load_dataset
from datasets.info import DatasetInfosDict, DatasetInfo
from datasets.utils.metadata import MetadataConfigs
from huggingface_hub import DatasetCardData, DatasetCard

TEMPLATE = r"""---
{dataset_card_data}
---

# Dataset Card for {source_dataset_name} with mined hard negatives

This dataset is a collection of {column_one}-{column_two}-negative triplets from the {source_dataset_name} dataset. See [{source_dataset_name}](https://huggingface.co/datasets/{source_dataset_id}) for additional information. This dataset can be used directly with Sentence Transformers to train embedding models.

## Mining Parameters
The negative samples have been mined using the following parameters:
- `range_min`: {range_min}, i.e. we skip the {range_min} most similar samples
- `range_max`: {range_max}, i.e. we only look at the top {range_max} most similar samples
- `margin`: {margin}, i.e. we require negative similarity + margin < positive similarity, so negative samples can't be more similar than the known true answer
- `sampling_strategy`: {sampling_strategy}, i.e. whether to randomly sample from the candidate negatives or take the "top" negatives
- `num_negatives`: {num_negatives}, i.e. we mine {num_negatives} negatives per question-answer pair

## Dataset Format
- Columns: {column_one}, {column_two}, negative
- Column types: str, str, str
- Example:
```python
{example}
```
"""

class HNMDataset(Dataset):
    @classmethod
    def from_dict(cls, *args, mining_kwargs: Dict[str, Any], **kwargs) -> "HNMDataset":
        dataset = super().from_dict(*args, **kwargs)
        dataset.mining_kwargs = mining_kwargs
        return dataset
    
    def _create_dataset_card(
        self,
        dataset_card_data: DatasetCardData,
        dataset_card: Optional[DatasetCard],
        config_name: str,
        info_to_dump: DatasetInfo,
        metadata_config_to_dump: MetadataConfigs,
    ) -> DatasetCard:
        if dataset_card:
            return dataset_card

        DatasetInfosDict({config_name: info_to_dump}).to_dataset_card_data(dataset_card_data)
        MetadataConfigs({config_name: metadata_config_to_dump}).to_dataset_card_data(dataset_card_data)
        dataset_card_data.tags = ["sentence-transformers"]
        dataset_name = self.mining_kwargs["source_dataset"].info.dataset_name
        # Very messy, just as an example:
        dataset_id = list(self.mining_kwargs["source_dataset"].info.download_checksums.keys())[0].removeprefix("hf://datasets/").split("@")[0]
        content = TEMPLATE.format(**{
            "dataset_card_data": str(dataset_card_data),
            "source_dataset_name": dataset_name,
            "source_dataset_id": dataset_id,
            "range_min": self.mining_kwargs["range_min"],
            "range_max": self.mining_kwargs["range_max"],
            "margin": self.mining_kwargs["margin"],
            "sampling_strategy": self.mining_kwargs["sampling_strategy"],
            "num_negatives": self.mining_kwargs["num_negatives"],
            "column_one": self.column_names[0],
            "column_two": self.column_names[1],
            "example": json.dumps(self[0], indent=4),
        })
        return DatasetCard(content)

source_dataset = load_dataset("sentence-transformers/gooaq", split="train[:100]")
dataset = HNMDataset.from_dict({
    "query": source_dataset["question"],
    "answer": source_dataset["answer"],
    # "negative": ... <- In my case, this column would be 'mined' automatically with these parameters
}, mining_kwargs={
    "range_min": 10,
    "range_max": 20,
    "max_score": 0.9,
    "margin": 0.1,
    "sampling_strategy": "random",
    "num_negatives": 3,
    "source_dataset": source_dataset,
})
dataset.push_to_hub("tomaarsen/mining_demo", private=True)

In this script, I've created a subclass which stores some additional information about how the dataset was generated. It's a bit hacky (e.g. setting a mining_kwargs parameter in from_dict that wasn't created in __init__, but that's just a consequence of how the from_... methods don't accept kwargs), but it allows me to create a "hard negatives mining" function that returns a dataset which people can use locally like normal, but if they choose to upload it, then it'll automatically include some information, e.g.: https://huggingface.co/datasets/tomaarsen/mining_demo

This allows others to actually find this dataset (e.g. via the sentence-transformers tag) and get an idea of the quality, source, etc. by looking at the model card.

Note

I'm not fixed on this solution whatsoever: I am also completely fine with other solutions, e.g. a dataset.set_dataset_card_creator method that allows me to provide a function without even having to subclass anything. I'm open to all ideas :)

cc @albertvillanova @lhoestq
cc @LysandreJik

  • Tom Aarsen

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@lhoestq
Copy link
Member

lhoestq commented Jun 20, 2024

Dataset objects are not made to be subclassed, so I don't think going in that direction is a good idea. In particular there is absolutely no test to make sure it works well, and nothing in the internal has been made to anticipate this use case.

I'd suggest to use a separate function to push changes to the Dataset card, and call it after push_to_hub(). This way people can also use a similar logic with other tools that datasets. You can also use composition instead of subclassing.

@tomaarsen
Copy link
Member Author

Would you consider an alternative where a Dataset instance carries a dataset card template which can be updated?

I don't want to burden my users with having to call another method after push_to_hub themselves. If you're not a fan of the template approach above either, then I'll likely subclass push_to_hub to once again download the just-uploaded-but-empty dataset card, update it, and reupload it. It'll just be a bit more requests than necessary, but not a big deal overall.

  • Tom Aarsen

@lhoestq
Copy link
Member

lhoestq commented Jun 21, 2024

Actually I find the idea of overriding _create_dataset_card better than implementing a templating logic. My main concern is that if we go in that direction we better make sure that subclasses of Dataset are working well.

Well if it's been working fine on your side why not, but make sure you test correctly features that could not work because of subclassing (e.g. I'm pretty sure map() won't return your subclass of Dataset). Or at least the ones that matter for your lib.

If it sounds good to you I'm fine with merging your addition to let you override the dataset card.

@tomaarsen
Copy link
Member Author

e.g. I'm pretty sure map() won't return your subclass of Dataset

I understand that there's limitations such as this one. The subclass doesn't have to be robust - I'd just like some simple automatic dataset card generation options directly after generating the dataset. This can be removed if the user does additional steps before pushing the model, e.g. mapping, filtering, saving to disk and uploading the loaded dataset, etc.

If it sounds good to you I'm fine with merging your addition to let you override the dataset card.

That would be quite useful for me! I appreciate it.

I'm not very sure what the test failures are caused by, I believe the only change in behaviour is that

        DatasetInfosDict({config_name: info_to_dump}).to_dataset_card_data(dataset_card_data)
        MetadataConfigs({config_name: metadata_config_to_dump}).to_dataset_card_data(dataset_card_data)

are not called when dataset_card was already defined. Unless these have side-effects other than updating dataset_card_data, it shouldn't be any different than main.

  • Tom Aarsen

@lhoestq
Copy link
Member

lhoestq commented Jun 21, 2024

Let's try to have this PR merged then !

IMO your current implementation can be improved since you path both the dataset card data and the dataset card itself, which is redundant. Also I anticipate the failures in the CI to come from your default implementation which doesn't correspond to what it was doing before

Unless these have side-effects other than updating dataset_card_data, it shouldn't be any different than main.

Indeed the dataset_card_data is the value from attribute of the dataset_card from a few lines before your changes, so yes it modifies the dataset_card object too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants