Use `hf-internal-testing` repos for hosting test dataset repos #6180

mariosasko · 2023-08-25T13:10:26Z

Use hf-internal-testing for hosting instead of the maintainers' dataset repos.

…ternal-testing-repos

HuggingFaceDocBuilderDev · 2023-08-25T13:16:59Z

The documentation is not available anymore as the PR was closed or merged.

github-actions · 2023-08-25T13:17:08Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006505 / 0.011353 (-0.004847)	0.003950 / 0.011008 (-0.007058)	0.084554 / 0.038508 (0.046046)	0.074376 / 0.023109 (0.051267)	0.350184 / 0.275898 (0.074286)	0.380704 / 0.323480 (0.057224)	0.004011 / 0.007986 (-0.003975)	0.003890 / 0.004328 (-0.000438)	0.065483 / 0.004250 (0.061232)	0.054912 / 0.037052 (0.017860)	0.359586 / 0.258489 (0.101097)	0.403360 / 0.293841 (0.109519)	0.030614 / 0.128546 (-0.097932)	0.008530 / 0.075646 (-0.067117)	0.288220 / 0.419271 (-0.131052)	0.052270 / 0.043533 (0.008737)	0.352557 / 0.255139 (0.097418)	0.380509 / 0.283200 (0.097309)	0.025513 / 0.141683 (-0.116170)	1.488469 / 1.452155 (0.036315)	1.559182 / 1.492716 (0.066466)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.266163 / 0.018006 (0.248157)	0.596345 / 0.000490 (0.595855)	0.004368 / 0.000200 (0.004168)	0.000211 / 0.000054 (0.000156)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027137 / 0.037411 (-0.010274)	0.082251 / 0.014526 (0.067725)	0.094745 / 0.176557 (-0.081812)	0.148756 / 0.737135 (-0.588379)	0.094580 / 0.296338 (-0.201758)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.383506 / 0.215209 (0.168297)	3.823147 / 2.077655 (1.745493)	1.859627 / 1.504120 (0.355507)	1.687969 / 1.541195 (0.146775)	1.720786 / 1.468490 (0.252296)	0.476552 / 4.584777 (-4.108225)	3.539558 / 3.745712 (-0.206154)	3.209032 / 5.269862 (-2.060830)	1.999643 / 4.565676 (-2.566034)	0.056484 / 0.424275 (-0.367791)	0.007443 / 0.007607 (-0.000164)	0.456089 / 0.226044 (0.230044)	4.562522 / 2.268929 (2.293593)	2.348286 / 55.444624 (-53.096338)	1.984323 / 6.876477 (-4.892154)	2.148988 / 2.142072 (0.006915)	0.570761 / 4.805227 (-4.234466)	0.131439 / 6.500664 (-6.369225)	0.059752 / 0.075469 (-0.015717)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.276803 / 1.841788 (-0.564985)	19.406812 / 8.074308 (11.332504)	13.979088 / 10.191392 (3.787696)	0.157418 / 0.680424 (-0.523006)	0.018051 / 0.534201 (-0.516150)	0.392307 / 0.579283 (-0.186976)	0.406603 / 0.434364 (-0.027760)	0.458450 / 0.540337 (-0.081888)	0.622569 / 1.386936 (-0.764367)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006552 / 0.011353 (-0.004800)	0.004060 / 0.011008 (-0.006948)	0.063522 / 0.038508 (0.025014)	0.072537 / 0.023109 (0.049428)	0.398452 / 0.275898 (0.122554)	0.422059 / 0.323480 (0.098579)	0.005577 / 0.007986 (-0.002409)	0.003413 / 0.004328 (-0.000916)	0.064095 / 0.004250 (0.059845)	0.056883 / 0.037052 (0.019831)	0.407119 / 0.258489 (0.148630)	0.435889 / 0.293841 (0.142048)	0.031549 / 0.128546 (-0.096998)	0.008418 / 0.075646 (-0.067228)	0.070315 / 0.419271 (-0.348957)	0.047828 / 0.043533 (0.004295)	0.398705 / 0.255139 (0.143566)	0.416986 / 0.283200 (0.133786)	0.022304 / 0.141683 (-0.119379)	1.512597 / 1.452155 (0.060442)	1.570588 / 1.492716 (0.077871)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.295100 / 0.018006 (0.277094)	0.541883 / 0.000490 (0.541393)	0.007375 / 0.000200 (0.007175)	0.000100 / 0.000054 (0.000045)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030877 / 0.037411 (-0.006534)	0.090807 / 0.014526 (0.076281)	0.106155 / 0.176557 (-0.070402)	0.155546 / 0.737135 (-0.581589)	0.103847 / 0.296338 (-0.192492)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.441176 / 0.215209 (0.225967)	4.401025 / 2.077655 (2.323371)	2.394764 / 1.504120 (0.890644)	2.226434 / 1.541195 (0.685239)	2.247248 / 1.468490 (0.778758)	0.489149 / 4.584777 (-4.095628)	3.642468 / 3.745712 (-0.103244)	3.235597 / 5.269862 (-2.034265)	1.992660 / 4.565676 (-2.573016)	0.057457 / 0.424275 (-0.366818)	0.007192 / 0.007607 (-0.000415)	0.515978 / 0.226044 (0.289934)	5.147728 / 2.268929 (2.878800)	2.837394 / 55.444624 (-52.607230)	2.505753 / 6.876477 (-4.370723)	2.653090 / 2.142072 (0.511018)	0.583274 / 4.805227 (-4.221954)	0.132116 / 6.500664 (-6.368548)	0.058794 / 0.075469 (-0.016675)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.331630 / 1.841788 (-0.510158)	20.056890 / 8.074308 (11.982582)	14.950561 / 10.191392 (4.759169)	0.165449 / 0.680424 (-0.514975)	0.020161 / 0.534201 (-0.514040)	0.395791 / 0.579283 (-0.183492)	0.415631 / 0.434364 (-0.018733)	0.474440 / 0.540337 (-0.065898)	0.643060 / 1.386936 (-0.743876)

github-actions · 2023-08-25T16:33:29Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007440 / 0.011353 (-0.003913)	0.004456 / 0.011008 (-0.006552)	0.099498 / 0.038508 (0.060990)	0.077579 / 0.023109 (0.054470)	0.374934 / 0.275898 (0.099036)	0.409590 / 0.323480 (0.086110)	0.005876 / 0.007986 (-0.002110)	0.003642 / 0.004328 (-0.000687)	0.076781 / 0.004250 (0.072531)	0.060185 / 0.037052 (0.023133)	0.374762 / 0.258489 (0.116273)	0.445608 / 0.293841 (0.151767)	0.036557 / 0.128546 (-0.091990)	0.009941 / 0.075646 (-0.065706)	0.345214 / 0.419271 (-0.074058)	0.061912 / 0.043533 (0.018379)	0.378346 / 0.255139 (0.123207)	0.415275 / 0.283200 (0.132076)	0.027396 / 0.141683 (-0.114287)	1.776602 / 1.452155 (0.324447)	1.827683 / 1.492716 (0.334967)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.235227 / 0.018006 (0.217221)	0.491846 / 0.000490 (0.491356)	0.004987 / 0.000200 (0.004787)	0.000127 / 0.000054 (0.000073)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.032517 / 0.037411 (-0.004894)	0.099217 / 0.014526 (0.084691)	0.109749 / 0.176557 (-0.066807)	0.176190 / 0.737135 (-0.560946)	0.109868 / 0.296338 (-0.186471)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.455188 / 0.215209 (0.239979)	4.560143 / 2.077655 (2.482489)	2.249928 / 1.504120 (0.745809)	2.032808 / 1.541195 (0.491614)	2.090096 / 1.468490 (0.621605)	0.567813 / 4.584777 (-4.016964)	4.338299 / 3.745712 (0.592587)	3.701589 / 5.269862 (-1.568273)	2.404805 / 4.565676 (-2.160871)	0.067931 / 0.424275 (-0.356344)	0.009011 / 0.007607 (0.001404)	0.542565 / 0.226044 (0.316521)	5.406578 / 2.268929 (3.137650)	2.773508 / 55.444624 (-52.671116)	2.402926 / 6.876477 (-4.473550)	2.679318 / 2.142072 (0.537246)	0.683781 / 4.805227 (-4.121446)	0.155970 / 6.500664 (-6.344694)	0.070108 / 0.075469 (-0.005361)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.541583 / 1.841788 (-0.300205)	21.592562 / 8.074308 (13.518254)	16.426868 / 10.191392 (6.235476)	0.168618 / 0.680424 (-0.511806)	0.021560 / 0.534201 (-0.512641)	0.467062 / 0.579283 (-0.112221)	0.479968 / 0.434364 (0.045604)	0.540747 / 0.540337 (0.000410)	0.775502 / 1.386936 (-0.611434)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008632 / 0.011353 (-0.002721)	0.004523 / 0.011008 (-0.006485)	0.075814 / 0.038508 (0.037306)	0.087096 / 0.023109 (0.063987)	0.482136 / 0.275898 (0.206238)	0.529969 / 0.323480 (0.206489)	0.006882 / 0.007986 (-0.001103)	0.003720 / 0.004328 (-0.000609)	0.076232 / 0.004250 (0.071981)	0.069307 / 0.037052 (0.032254)	0.491554 / 0.258489 (0.233065)	0.528989 / 0.293841 (0.235148)	0.042219 / 0.128546 (-0.086327)	0.009717 / 0.075646 (-0.065929)	0.103006 / 0.419271 (-0.316266)	0.060377 / 0.043533 (0.016844)	0.484454 / 0.255139 (0.229315)	0.536072 / 0.283200 (0.252872)	0.027482 / 0.141683 (-0.114201)	1.844677 / 1.452155 (0.392522)	2.001800 / 1.492716 (0.509083)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.252367 / 0.018006 (0.234361)	0.483601 / 0.000490 (0.483111)	0.007445 / 0.000200 (0.007245)	0.000163 / 0.000054 (0.000108)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.036463 / 0.037411 (-0.000948)	0.108837 / 0.014526 (0.094311)	0.122256 / 0.176557 (-0.054300)	0.186455 / 0.737135 (-0.550681)	0.122270 / 0.296338 (-0.174069)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.506291 / 0.215209 (0.291082)	5.038044 / 2.077655 (2.960389)	2.751017 / 1.504120 (1.246897)	2.553655 / 1.541195 (1.012460)	2.612724 / 1.468490 (1.144234)	0.581755 / 4.584777 (-4.003022)	4.376012 / 3.745712 (0.630300)	3.749755 / 5.269862 (-1.520107)	2.394059 / 4.565676 (-2.171618)	0.068727 / 0.424275 (-0.355548)	0.008714 / 0.007607 (0.001107)	0.607371 / 0.226044 (0.381326)	6.062053 / 2.268929 (3.793125)	3.278378 / 55.444624 (-52.166247)	2.866417 / 6.876477 (-4.010060)	3.056150 / 2.142072 (0.914077)	0.695090 / 4.805227 (-4.110137)	0.155274 / 6.500664 (-6.345390)	0.071106 / 0.075469 (-0.004363)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.584552 / 1.841788 (-0.257236)	23.092569 / 8.074308 (15.018260)	17.381905 / 10.191392 (7.190513)	0.206535 / 0.680424 (-0.473888)	0.025401 / 0.534201 (-0.508800)	0.514297 / 0.579283 (-0.064986)	0.507487 / 0.434364 (0.073123)	0.566883 / 0.540337 (0.026545)	0.811074 / 1.386936 (-0.575862)

github-actions · 2023-08-25T16:58:02Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008400 / 0.011353 (-0.002953)	0.004872 / 0.011008 (-0.006136)	0.104434 / 0.038508 (0.065926)	0.074411 / 0.023109 (0.051302)	0.395970 / 0.275898 (0.120072)	0.431661 / 0.323480 (0.108181)	0.005365 / 0.007986 (-0.002621)	0.005495 / 0.004328 (0.001167)	0.081255 / 0.004250 (0.077004)	0.057141 / 0.037052 (0.020089)	0.397242 / 0.258489 (0.138753)	0.456052 / 0.293841 (0.162211)	0.048362 / 0.128546 (-0.080184)	0.014077 / 0.075646 (-0.061569)	0.351128 / 0.419271 (-0.068143)	0.067842 / 0.043533 (0.024309)	0.372820 / 0.255139 (0.117681)	0.407917 / 0.283200 (0.124717)	0.037707 / 0.141683 (-0.103976)	1.677136 / 1.452155 (0.224981)	1.764614 / 1.492716 (0.271897)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.269850 / 0.018006 (0.251844)	0.601458 / 0.000490 (0.600969)	0.006500 / 0.000200 (0.006300)	0.000107 / 0.000054 (0.000053)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030340 / 0.037411 (-0.007072)	0.098041 / 0.014526 (0.083515)	0.107270 / 0.176557 (-0.069287)	0.173502 / 0.737135 (-0.563633)	0.113296 / 0.296338 (-0.183043)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.575788 / 0.215209 (0.360579)	5.723878 / 2.077655 (3.646223)	2.326339 / 1.504120 (0.822219)	2.130667 / 1.541195 (0.589472)	2.080885 / 1.468490 (0.612395)	0.800936 / 4.584777 (-3.783841)	5.227888 / 3.745712 (1.482176)	4.592647 / 5.269862 (-0.677214)	2.935765 / 4.565676 (-1.629911)	0.095909 / 0.424275 (-0.328367)	0.008763 / 0.007607 (0.001156)	0.697362 / 0.226044 (0.471318)	6.968105 / 2.268929 (4.699176)	3.129070 / 55.444624 (-52.315554)	2.554818 / 6.876477 (-4.321658)	2.776005 / 2.142072 (0.633933)	1.017064 / 4.805227 (-3.788163)	0.211552 / 6.500664 (-6.289112)	0.072132 / 0.075469 (-0.003338)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.517072 / 1.841788 (-0.324716)	23.737742 / 8.074308 (15.663433)	22.236447 / 10.191392 (12.045055)	0.235408 / 0.680424 (-0.445016)	0.031889 / 0.534201 (-0.502312)	0.458997 / 0.579283 (-0.120286)	0.610513 / 0.434364 (0.176149)	0.536508 / 0.540337 (-0.003830)	0.750137 / 1.386936 (-0.636799)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008696 / 0.011353 (-0.002657)	0.005374 / 0.011008 (-0.005634)	0.077974 / 0.038508 (0.039466)	0.083471 / 0.023109 (0.060362)	0.498890 / 0.275898 (0.222992)	0.517570 / 0.323480 (0.194090)	0.006523 / 0.007986 (-0.001462)	0.004315 / 0.004328 (-0.000013)	0.082262 / 0.004250 (0.078012)	0.064828 / 0.037052 (0.027776)	0.473101 / 0.258489 (0.214612)	0.534172 / 0.293841 (0.240331)	0.051884 / 0.128546 (-0.076662)	0.015191 / 0.075646 (-0.060455)	0.084307 / 0.419271 (-0.334965)	0.066050 / 0.043533 (0.022517)	0.518007 / 0.255139 (0.262868)	0.511145 / 0.283200 (0.227946)	0.045302 / 0.141683 (-0.096381)	1.670973 / 1.452155 (0.218818)	1.829225 / 1.492716 (0.336509)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.436537 / 0.018006 (0.418531)	0.608380 / 0.000490 (0.607890)	0.075211 / 0.000200 (0.075011)	0.000733 / 0.000054 (0.000679)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.039117 / 0.037411 (0.001706)	0.103525 / 0.014526 (0.088999)	0.124413 / 0.176557 (-0.052144)	0.192352 / 0.737135 (-0.544783)	0.120379 / 0.296338 (-0.175959)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.673338 / 0.215209 (0.458129)	6.799435 / 2.077655 (4.721780)	3.600913 / 1.504120 (2.096793)	2.881008 / 1.541195 (1.339814)	2.667154 / 1.468490 (1.198664)	0.868775 / 4.584777 (-3.716002)	5.517063 / 3.745712 (1.771351)	4.646706 / 5.269862 (-0.623156)	2.914825 / 4.565676 (-1.650852)	0.098784 / 0.424275 (-0.325491)	0.011504 / 0.007607 (0.003897)	0.724233 / 0.226044 (0.498188)	7.311045 / 2.268929 (5.042117)	3.685490 / 55.444624 (-51.759135)	2.892360 / 6.876477 (-3.984117)	3.253189 / 2.142072 (1.111117)	0.983065 / 4.805227 (-3.822162)	0.201097 / 6.500664 (-6.299567)	0.068020 / 0.075469 (-0.007450)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.793904 / 1.841788 (-0.047884)	24.451356 / 8.074308 (16.377048)	21.697191 / 10.191392 (11.505799)	0.228545 / 0.680424 (-0.451879)	0.034600 / 0.534201 (-0.499601)	0.483253 / 0.579283 (-0.096030)	0.615103 / 0.434364 (0.180739)	0.564600 / 0.540337 (0.024262)	0.799688 / 1.386936 (-0.587248)

* Use `hf-internal-testing` repos for testing * Fix

mariosasko added 2 commits August 25, 2023 00:28

Use hf-internal-testing repos for testing

d077fa1

Merge branch 'main' of github.com:huggingface/datasets into use-hf-in…

712185e

…ternal-testing-repos

Fix

5fb0129

mariosasko merged commit 74d6021 into main Aug 25, 2023
13 checks passed

mariosasko deleted the use-hf-internal-testing-repos branch August 25, 2023 16:46

albertvillanova pushed a commit that referenced this pull request Oct 24, 2023

Use hf-internal-testing repos for hosting test dataset repos (#6180)

048bce3

* Use `hf-internal-testing` repos for testing * Fix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `hf-internal-testing` repos for hosting test dataset repos #6180

Use `hf-internal-testing` repos for hosting test dataset repos #6180

mariosasko commented Aug 25, 2023

HuggingFaceDocBuilderDev commented Aug 25, 2023 •

edited

Loading

github-actions bot commented Aug 25, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Aug 25, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Aug 25, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Use hf-internal-testing repos for hosting test dataset repos #6180

Use hf-internal-testing repos for hosting test dataset repos #6180

Conversation

mariosasko commented Aug 25, 2023

HuggingFaceDocBuilderDev commented Aug 25, 2023 • edited Loading

github-actions bot commented Aug 25, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Aug 25, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Aug 25, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Use `hf-internal-testing` repos for hosting test dataset repos #6180

Use `hf-internal-testing` repos for hosting test dataset repos #6180

HuggingFaceDocBuilderDev commented Aug 25, 2023 •

edited

Loading