add split argument to Generator #7015

piercus · 2024-07-01T08:09:25Z

Actual

When creating a multi-split dataset using generators like

datasets.DatasetDict({
  "val": datasets.Dataset.from_generator(
      generator=generator_val,
      features=features
  ),
  "test": datasets.Dataset.from_generator(
      generator=generator_test,
      features=features,
  )
})

It displays (for both test and val)

Generating train split

Expected

I would like to be able to improve this behavior by doing

datasets.DatasetDict({
  "val": datasets.Dataset.from_generator(
      generator=generator_val,
      features=features,
      split="val"
  ),
  "test": datasets.Dataset.from_generator(
      generator=generator_test,
      features=features,
      split="test"
  )
})

It would display

Generating val split

and

Generating test split

Proposal

Current PR is adding an explicit split argument and replace the implicit "train" split in the following classes/function :

Generator
from_generator
AbstractDatasetInputStream
GeneratorDatasetInputStream

Please share your feedbacks

…Stream, GeneratorDatasetInputStream

HuggingFaceDocBuilderDev · 2024-07-09T08:07:48Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

albertvillanova

Thanks for your proposed contribution @piercus.

This is a nice one!

Some comments below. Basically, I would propose to define the split parameter just as an attribute of GeneratorConfig instead of Generator, GeneratorDatasetInputStream, AbstractDatasetInputStream and SqlDatasetReader.

src/datasets/arrow_dataset.py

albertvillanova · 2024-07-09T08:57:36Z

src/datasets/arrow_dataset.py

@@ -1088,6 +1089,8 @@ def from_generator(
                Number of processes when downloading and generating the dataset locally.
                This is helpful if the dataset is made of multiple files. Multiprocessing is disabled by default.
                If `num_proc` is greater than one, then all list values in `gen_kwargs` must be the same length. These values will be split between calls to the generator. The number of shards will be the minimum of the shortest list in `gen_kwargs` and `num_proc`.
+            split (`str`, defaults to `"train"`):
+                Split name to be assigned to the dataset.


This docstring should go below <Added version="2.7.0"/>, because the version added tag corresponds to the num_proc parameter above split.

I would suggest to align its type with the rest of the code as: ([`NamedSplit`], defaults to `Split.TRAIN`).

I would also add a specific version added tag for the split parameter: . We may eventually change this depending on the next release.

just used <Added version="2.21.0"/>, please cross-check

src/datasets/io/abc.py

src/datasets/io/generator.py

src/datasets/iterable_dataset.py

albertvillanova · 2024-07-09T09:10:30Z

src/datasets/iterable_dataset.py

@@ -2074,7 +2075,8 @@ def from_generator(
                Keyword arguments to be passed to the `generator` callable.
                You can define a sharded iterable dataset by passing the list of shards in `gen_kwargs`.
                This can be used to improve shuffling and when iterating over the dataset with multiple workers.
-
+            split(`str`, default="train"):
+                Split name to be assigned to the dataset.


Same comments as before.

I also added <Added version="2.21.0"/> please cross-check

src/datasets/packaged_modules/generator/generator.py

albertvillanova · 2024-07-09T09:14:52Z

tests/test_arrow_dataset.py

-    dataset = Dataset.from_generator(data_generator, features=features, cache_dir=cache_dir)
-    _check_generator_dataset(dataset, expected_features)
+    dataset = Dataset.from_generator(data_generator, features=features, cache_dir=cache_dir, split=split)
+    _check_generator_dataset(dataset, expected_features, split)



I would add a specific test_dataset_from_generator_split with a parametrized split values, such as not passing any value, passing NamedSplit("train"), passing literal "train", passing other NamedSplit, etc.

test_dataset_from_generator_split added, still i have impacted _check_generator_dataset to share the same generic check everywhere

…e_iterable_datasets

piercus · 2024-07-10T06:17:40Z

@albertvillanova thanks for the review, please take a look

albertvillanova

Thanks! Good work.

Just a fix of the non-passing test and a nit.

src/datasets/iterable_dataset.py

tests/test_arrow_dataset.py

Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

src/datasets/iterable_dataset.py

Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

piercus · 2024-07-11T07:44:44Z

@albertvillanova please take a look

albertvillanova

Thanks for your contribution!

Note the CI action to generate the docs is failing due to an unrelated issue: https://github.com/huggingface/datasets/actions/runs/9887484572/job/27309892176?pr=7015

Therefore, if we do not want to break the generation of docs, this other PR should be merged before yours:

Fix doc generation when NamedSplit is used as parameter default value #7036

albertvillanova · 2024-07-26T09:32:11Z

Thank you again! Your PR is merged.

github-actions · 2024-07-26T09:37:50Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005267 / 0.011353 (-0.006086)	0.003711 / 0.011008 (-0.007297)	0.062288 / 0.038508 (0.023780)	0.031357 / 0.023109 (0.008248)	0.233592 / 0.275898 (-0.042306)	0.257722 / 0.323480 (-0.065758)	0.003124 / 0.007986 (-0.004861)	0.003335 / 0.004328 (-0.000994)	0.048594 / 0.004250 (0.044344)	0.043853 / 0.037052 (0.006801)	0.248589 / 0.258489 (-0.009900)	0.278474 / 0.293841 (-0.015367)	0.029573 / 0.128546 (-0.098973)	0.011779 / 0.075646 (-0.063868)	0.204989 / 0.419271 (-0.214282)	0.035734 / 0.043533 (-0.007799)	0.240064 / 0.255139 (-0.015075)	0.263105 / 0.283200 (-0.020094)	0.018764 / 0.141683 (-0.122919)	1.115705 / 1.452155 (-0.336449)	1.175457 / 1.492716 (-0.317260)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.092664 / 0.018006 (0.074657)	0.297893 / 0.000490 (0.297403)	0.000217 / 0.000200 (0.000017)	0.000047 / 0.000054 (-0.000007)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.019056 / 0.037411 (-0.018355)	0.062472 / 0.014526 (0.047946)	0.073462 / 0.176557 (-0.103094)	0.119723 / 0.737135 (-0.617412)	0.074420 / 0.296338 (-0.221919)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.283131 / 0.215209 (0.067922)	2.776694 / 2.077655 (0.699039)	1.455586 / 1.504120 (-0.048534)	1.323902 / 1.541195 (-0.217293)	1.333169 / 1.468490 (-0.135321)	0.723921 / 4.584777 (-3.860856)	2.385842 / 3.745712 (-1.359870)	2.926843 / 5.269862 (-2.343018)	1.896773 / 4.565676 (-2.668903)	0.079754 / 0.424275 (-0.344521)	0.005188 / 0.007607 (-0.002419)	0.342466 / 0.226044 (0.116421)	3.404204 / 2.268929 (1.135275)	1.856575 / 55.444624 (-53.588049)	1.554507 / 6.876477 (-5.321970)	1.564065 / 2.142072 (-0.578007)	0.810363 / 4.805227 (-3.994864)	0.135537 / 6.500664 (-6.365127)	0.041987 / 0.075469 (-0.033482)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.962288 / 1.841788 (-0.879500)	11.310837 / 8.074308 (3.236529)	9.630034 / 10.191392 (-0.561358)	0.131108 / 0.680424 (-0.549316)	0.015225 / 0.534201 (-0.518976)	0.304211 / 0.579283 (-0.275072)	0.272707 / 0.434364 (-0.161657)	0.341550 / 0.540337 (-0.198787)	0.444528 / 1.386936 (-0.942408)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005665 / 0.011353 (-0.005688)	0.003916 / 0.011008 (-0.007092)	0.049946 / 0.038508 (0.011438)	0.031760 / 0.023109 (0.008651)	0.273826 / 0.275898 (-0.002072)	0.300193 / 0.323480 (-0.023287)	0.004350 / 0.007986 (-0.003635)	0.002749 / 0.004328 (-0.001579)	0.048451 / 0.004250 (0.044201)	0.039798 / 0.037052 (0.002746)	0.284570 / 0.258489 (0.026081)	0.318855 / 0.293841 (0.025014)	0.032724 / 0.128546 (-0.095822)	0.012103 / 0.075646 (-0.063543)	0.059857 / 0.419271 (-0.359414)	0.034185 / 0.043533 (-0.009348)	0.276079 / 0.255139 (0.020940)	0.294070 / 0.283200 (0.010871)	0.018168 / 0.141683 (-0.123515)	1.149681 / 1.452155 (-0.302473)	1.191349 / 1.492716 (-0.301367)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.092676 / 0.018006 (0.074669)	0.304971 / 0.000490 (0.304481)	0.000203 / 0.000200 (0.000003)	0.000050 / 0.000054 (-0.000004)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.023110 / 0.037411 (-0.014301)	0.079117 / 0.014526 (0.064591)	0.087457 / 0.176557 (-0.089099)	0.128295 / 0.737135 (-0.608840)	0.089747 / 0.296338 (-0.206592)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.305158 / 0.215209 (0.089949)	2.992277 / 2.077655 (0.914623)	1.595369 / 1.504120 (0.091249)	1.462955 / 1.541195 (-0.078240)	1.476269 / 1.468490 (0.007779)	0.731652 / 4.584777 (-3.853124)	0.961053 / 3.745712 (-2.784659)	2.800259 / 5.269862 (-2.469602)	1.881249 / 4.565676 (-2.684428)	0.079503 / 0.424275 (-0.344772)	0.005252 / 0.007607 (-0.002355)	0.354921 / 0.226044 (0.128877)	3.495272 / 2.268929 (1.226343)	1.956419 / 55.444624 (-53.488205)	1.654941 / 6.876477 (-5.221536)	1.782506 / 2.142072 (-0.359567)	0.816487 / 4.805227 (-3.988741)	0.135870 / 6.500664 (-6.364794)	0.041114 / 0.075469 (-0.034355)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.050346 / 1.841788 (-0.791442)	12.510129 / 8.074308 (4.435821)	10.524835 / 10.191392 (0.333443)	0.152388 / 0.680424 (-0.528036)	0.016073 / 0.534201 (-0.518128)	0.301956 / 0.579283 (-0.277327)	0.126871 / 0.434364 (-0.307493)	0.339554 / 0.540337 (-0.200783)	0.435873 / 1.386936 (-0.951064)

* add split argument to Generator, from_generator, AbstractDatasetInputStream, GeneratorDatasetInputStream * split generator review feedbacks * import Split * tag added version in iterable_dataset, rollback change in _concatenate_iterable_datasets * rm useless Generator __init__ * docstring formatting Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> * format docstring Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> * fix test_dataset_from_generator_split[None] --------- Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

add split argument to Generator, from_generator, AbstractDatasetInput…

3524459

…Stream, GeneratorDatasetInputStream

albertvillanova mentioned this pull request Jul 9, 2024

from_generator does not allow to specify the split name #7033

Closed

albertvillanova linked an issue Jul 9, 2024 that may be closed by this pull request

from_generator does not allow to specify the split name #7033

Closed

albertvillanova requested changes Jul 9, 2024

View reviewed changes

piercus added 4 commits July 10, 2024 07:48

split generator review feedbacks

eef7c96

import Split

bdd9662

tag added version in iterable_dataset, rollback change in _concatenat…

5512e3f

…e_iterable_datasets

rm useless Generator __init__

6f1c18b

piercus requested a review from albertvillanova July 10, 2024 06:17

albertvillanova requested changes Jul 10, 2024

View reviewed changes

src/datasets/iterable_dataset.py Outdated Show resolved Hide resolved

tests/test_arrow_dataset.py Outdated Show resolved Hide resolved

docstring formatting

d74a862

Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

albertvillanova reviewed Jul 10, 2024

View reviewed changes

src/datasets/iterable_dataset.py Outdated Show resolved Hide resolved

piercus and others added 3 commits July 10, 2024 10:12

format docstring

7e50f23

Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

fix test_dataset_from_generator_split[None]

96b9e37

Merge branch 'main' into generator-split

b912261

piercus requested a review from albertvillanova July 11, 2024 07:44

albertvillanova approved these changes Jul 11, 2024

View reviewed changes

albertvillanova added 2 commits July 26, 2024 10:04

Merge remote-tracking branch 'upstream/main' into generator-split

9480dae

Merge branch 'main' into generator-split

5962e2d

albertvillanova merged commit ead089d into huggingface:main Jul 26, 2024
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add split argument to Generator #7015

add split argument to Generator #7015

piercus commented Jul 1, 2024

HuggingFaceDocBuilderDev commented Jul 9, 2024

albertvillanova left a comment

albertvillanova Jul 9, 2024

piercus Jul 10, 2024

albertvillanova Jul 9, 2024

piercus Jul 10, 2024

albertvillanova Jul 9, 2024

piercus Jul 10, 2024

piercus commented Jul 10, 2024

albertvillanova left a comment

piercus commented Jul 11, 2024

albertvillanova left a comment

albertvillanova commented Jul 26, 2024

github-actions bot commented Jul 26, 2024

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

add split argument to Generator #7015

add split argument to Generator #7015

Conversation

piercus commented Jul 1, 2024

Actual

Expected

Proposal

HuggingFaceDocBuilderDev commented Jul 9, 2024

albertvillanova left a comment

Choose a reason for hiding this comment

albertvillanova Jul 9, 2024

Choose a reason for hiding this comment

piercus Jul 10, 2024

Choose a reason for hiding this comment

albertvillanova Jul 9, 2024

Choose a reason for hiding this comment

piercus Jul 10, 2024

Choose a reason for hiding this comment

albertvillanova Jul 9, 2024

Choose a reason for hiding this comment

piercus Jul 10, 2024

Choose a reason for hiding this comment

piercus commented Jul 10, 2024

albertvillanova left a comment

Choose a reason for hiding this comment

piercus commented Jul 11, 2024

albertvillanova left a comment

Choose a reason for hiding this comment

albertvillanova commented Jul 26, 2024

github-actions bot commented Jul 26, 2024

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json