You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My task is classification of a dataset containing a large label set that includes a hierarchy. Even ignoring the hierarchy I'm not able to find an example using datasets where the label names aren't hard-coded. This works find for classification of a handful of labels but ideally there would be a way of loading the name/id mappings required for datasets.features.ClassLabel from a file.
It is possible to pass a file to ClassLabel but I cannot see an easy way of using this with GeneratorBasedBuilder since self._info is called before the dl_manager is constructed so even if my dataset contains say label_mappings.json there's no way of loading it in order to construct the datasets.DatasetInfo
I can see other uses to accessing the download_manager from self._info - i.e. if the files contain a schema (i.e. arrow or parquet files) the datasets.DatasetInfo could be inferred.
The workaround that was suggested in the forum is to generate a .py file from the label_mappings.json and import it.
class TestDatasetBuilder(datasets.GeneratorBasedBuilder):
VERSION = datasets.Version("1.0.0")
def _info(self):
return datasets.DatasetInfo(
description=_DESCRIPTION,
features=datasets.Features(
{
"text": datasets.Value("string"),
"label": datasets.features.ClassLabel(names=["label_1", "label_2"]),
}
),
task_templates=[TextClassification(text_column="text", label_column="label")],
)
def _split_generators(self, dl_manager):
train_path = dl_manager.download_and_extract(_TRAIN_DOWNLOAD_URL)
test_path = dl_manager.download_and_extract(_TEST_DOWNLOAD_URL)
return [
datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs={"filepath": train_path}),
datasets.SplitGenerator(name=datasets.Split.TEST, gen_kwargs={"filepath": test_path}),
]
def _generate_examples(self, filepath):
"""Generate AG News examples."""
with open(filepath, encoding="utf-8") as csv_file:
csv_reader = csv.DictReader(csv_file)
for id_, row in enumerate(csv_reader):
yield id_, row
Motivation
Allow datasets.DatasetInfo to be generated based on the contents of the dataset.
Your contribution
I'm willing to work on a PR with guidence.
The text was updated successfully, but these errors were encountered:
I would like this also as I have been working with a dataset with hierarchical classes. In fact, I encountered this very issue when trying to define the dataset with a script. I couldn't find a work around and reverted to hard coding the class names in the readme yaml.
@david-waterworth do you envision also being able to define the hierarchical structure of the classes?
@david-waterworth Hmm, that's a shame. What are you using now? Also, I’m curious to know about the work you’re doing that involves hierarchical classes, if you don’t mind sharing.
Feature request
My task is classification of a dataset containing a large label set that includes a hierarchy. Even ignoring the hierarchy I'm not able to find an example using
datasets
where the label names aren't hard-coded. This works find for classification of a handful of labels but ideally there would be a way of loading the name/id mappings required fordatasets.features.ClassLabel
from a file.It is possible to pass a file to ClassLabel but I cannot see an easy way of using this with
GeneratorBasedBuilder
sinceself._info
is called before thedl_manager
is constructed so even if my dataset contains saylabel_mappings.json
there's no way of loading it in order to construct thedatasets.DatasetInfo
I can see other uses to accessing the
download_manager
fromself._info
- i.e. if the files contain a schema (i.e.arrow
orparquet
files) thedatasets.DatasetInfo
could be inferred.The workaround that was suggested in the forum is to generate a
.py
file from thelabel_mappings.json
and import it.Motivation
Allow
datasets.DatasetInfo
to be generated based on the contents of the dataset.Your contribution
I'm willing to work on a PR with guidence.
The text was updated successfully, but these errors were encountered: