Provide ability to load label mappings from file #6059

david-waterworth · 2023-07-22T02:04:19Z

Feature request

My task is classification of a dataset containing a large label set that includes a hierarchy. Even ignoring the hierarchy I'm not able to find an example using datasets where the label names aren't hard-coded. This works find for classification of a handful of labels but ideally there would be a way of loading the name/id mappings required for datasets.features.ClassLabel from a file.

It is possible to pass a file to ClassLabel but I cannot see an easy way of using this with GeneratorBasedBuilder since self._info is called before the dl_manager is constructed so even if my dataset contains say label_mappings.json there's no way of loading it in order to construct the datasets.DatasetInfo

I can see other uses to accessing the download_manager from self._info - i.e. if the files contain a schema (i.e. arrow or parquet files) the datasets.DatasetInfo could be inferred.

The workaround that was suggested in the forum is to generate a .py file from the label_mappings.json and import it.

class TestDatasetBuilder(datasets.GeneratorBasedBuilder):
    VERSION = datasets.Version("1.0.0")

    def _info(self):
        return datasets.DatasetInfo(
            description=_DESCRIPTION,
            features=datasets.Features(
                {
                    "text": datasets.Value("string"),
                    "label": datasets.features.ClassLabel(names=["label_1", "label_2"]),
                }
            ),
            task_templates=[TextClassification(text_column="text", label_column="label")],
        )

    def _split_generators(self, dl_manager):
        train_path = dl_manager.download_and_extract(_TRAIN_DOWNLOAD_URL)
        test_path = dl_manager.download_and_extract(_TEST_DOWNLOAD_URL)
        return [
            datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs={"filepath": train_path}),
            datasets.SplitGenerator(name=datasets.Split.TEST, gen_kwargs={"filepath": test_path}),
        ]

    def _generate_examples(self, filepath):
        """Generate AG News examples."""
        with open(filepath, encoding="utf-8") as csv_file:
            csv_reader = csv.DictReader(csv_file)
            for id_, row in enumerate(csv_reader):
                yield id_, row

Motivation

Allow datasets.DatasetInfo to be generated based on the contents of the dataset.

Your contribution

I'm willing to work on a PR with guidence.

The text was updated successfully, but these errors were encountered:

danielduckworth · 2024-04-15T12:43:18Z

I would like this also as I have been working with a dataset with hierarchical classes. In fact, I encountered this very issue when trying to define the dataset with a script. I couldn't find a work around and reverted to hard coding the class names in the readme yaml.

@david-waterworth do you envision also being able to define the hierarchical structure of the classes?

david-waterworth · 2024-04-15T23:47:10Z

@danielduckworth yes I did need to do that (but I ended up ditching datasets as it looks like this is a "wont fix").

danielduckworth · 2024-04-16T08:07:55Z

@david-waterworth Hmm, that's a shame. What are you using now? Also, I’m curious to know about the work you’re doing that involves hierarchical classes, if you don’t mind sharing.

david-waterworth added the enhancement New feature or request label Jul 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide ability to load label mappings from file #6059

Provide ability to load label mappings from file #6059

david-waterworth commented Jul 22, 2023

danielduckworth commented Apr 15, 2024

david-waterworth commented Apr 15, 2024

danielduckworth commented Apr 16, 2024

Provide ability to load label mappings from file #6059

Provide ability to load label mappings from file #6059

Comments

david-waterworth commented Jul 22, 2023

Feature request

Motivation

Your contribution

danielduckworth commented Apr 15, 2024

david-waterworth commented Apr 15, 2024

danielduckworth commented Apr 16, 2024