Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide ability to load label mappings from file #6059

Open
david-waterworth opened this issue Jul 22, 2023 · 3 comments
Open

Provide ability to load label mappings from file #6059

david-waterworth opened this issue Jul 22, 2023 · 3 comments
Labels
enhancement New feature or request

Comments

@david-waterworth
Copy link

Feature request

My task is classification of a dataset containing a large label set that includes a hierarchy. Even ignoring the hierarchy I'm not able to find an example using datasets where the label names aren't hard-coded. This works find for classification of a handful of labels but ideally there would be a way of loading the name/id mappings required for datasets.features.ClassLabel from a file.

It is possible to pass a file to ClassLabel but I cannot see an easy way of using this with GeneratorBasedBuilder since self._info is called before the dl_manager is constructed so even if my dataset contains say label_mappings.json there's no way of loading it in order to construct the datasets.DatasetInfo

I can see other uses to accessing the download_manager from self._info - i.e. if the files contain a schema (i.e. arrow or parquet files) the datasets.DatasetInfo could be inferred.

The workaround that was suggested in the forum is to generate a .py file from the label_mappings.json and import it.

class TestDatasetBuilder(datasets.GeneratorBasedBuilder):
    VERSION = datasets.Version("1.0.0")

    def _info(self):
        return datasets.DatasetInfo(
            description=_DESCRIPTION,
            features=datasets.Features(
                {
                    "text": datasets.Value("string"),
                    "label": datasets.features.ClassLabel(names=["label_1", "label_2"]),
                }
            ),
            task_templates=[TextClassification(text_column="text", label_column="label")],
        )

    def _split_generators(self, dl_manager):
        train_path = dl_manager.download_and_extract(_TRAIN_DOWNLOAD_URL)
        test_path = dl_manager.download_and_extract(_TEST_DOWNLOAD_URL)
        return [
            datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs={"filepath": train_path}),
            datasets.SplitGenerator(name=datasets.Split.TEST, gen_kwargs={"filepath": test_path}),
        ]

    def _generate_examples(self, filepath):
        """Generate AG News examples."""
        with open(filepath, encoding="utf-8") as csv_file:
            csv_reader = csv.DictReader(csv_file)
            for id_, row in enumerate(csv_reader):
                yield id_, row

Motivation

Allow datasets.DatasetInfo to be generated based on the contents of the dataset.

Your contribution

I'm willing to work on a PR with guidence.

@david-waterworth david-waterworth added the enhancement New feature or request label Jul 22, 2023
@danielduckworth
Copy link

I would like this also as I have been working with a dataset with hierarchical classes. In fact, I encountered this very issue when trying to define the dataset with a script. I couldn't find a work around and reverted to hard coding the class names in the readme yaml.

@david-waterworth do you envision also being able to define the hierarchical structure of the classes?

@david-waterworth
Copy link
Author

@danielduckworth yes I did need to do that (but I ended up ditching datasets as it looks like this is a "wont fix").

@danielduckworth
Copy link

@david-waterworth Hmm, that's a shame. What are you using now? Also, I’m curious to know about the work you’re doing that involves hierarchical classes, if you don’t mind sharing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants