Fix prepare_single_hop_path_and_storage_options #7068

albertvillanova · 2024-07-24T05:52:34Z

Fix _prepare_single_hop_path_and_storage_options:

Do not pass HF authentication headers and HF user-agent to non-HF HTTP URLs
Do not overwrite passed storage_options nested values:
- Before, when passed
  DownloadConfig(storage_options={"https": {"client_kwargs": {"raise_for_status": True}}}),
  it was overwritten to
  {"https": {"client_kwargs": {"trust_env": True}}}
- Now, the result combines both:
  {"https": {"client_kwargs": {"trust_env": True, "raise_for_status": True}}}

HuggingFaceDocBuilderDev · 2024-07-24T05:54:55Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

src/datasets/utils/file_utils.py

This reverts commit a337212.

lhoestq

great !

github-actions · 2024-07-29T07:02:06Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005725 / 0.011353 (-0.005628)	0.004149 / 0.011008 (-0.006859)	0.065051 / 0.038508 (0.026543)	0.030220 / 0.023109 (0.007111)	0.256768 / 0.275898 (-0.019130)	0.269767 / 0.323480 (-0.053713)	0.003256 / 0.007986 (-0.004730)	0.003378 / 0.004328 (-0.000951)	0.049407 / 0.004250 (0.045156)	0.046041 / 0.037052 (0.008988)	0.270977 / 0.258489 (0.012488)	0.288771 / 0.293841 (-0.005070)	0.030401 / 0.128546 (-0.098145)	0.012203 / 0.075646 (-0.063443)	0.227365 / 0.419271 (-0.191906)	0.036356 / 0.043533 (-0.007176)	0.262763 / 0.255139 (0.007624)	0.268172 / 0.283200 (-0.015028)	0.020698 / 0.141683 (-0.120984)	1.171679 / 1.452155 (-0.280476)	1.155353 / 1.492716 (-0.337363)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.138740 / 0.018006 (0.120733)	0.300962 / 0.000490 (0.300473)	0.000240 / 0.000200 (0.000040)	0.000050 / 0.000054 (-0.000005)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.019056 / 0.037411 (-0.018355)	0.062922 / 0.014526 (0.048396)	0.075339 / 0.176557 (-0.101218)	0.122587 / 0.737135 (-0.614548)	0.078622 / 0.296338 (-0.217716)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.273878 / 0.215209 (0.058669)	2.753188 / 2.077655 (0.675533)	1.446877 / 1.504120 (-0.057243)	1.325034 / 1.541195 (-0.216160)	1.332849 / 1.468490 (-0.135641)	0.721042 / 4.584777 (-3.863735)	2.457241 / 3.745712 (-1.288471)	3.008013 / 5.269862 (-2.261848)	1.925773 / 4.565676 (-2.639903)	0.077725 / 0.424275 (-0.346550)	0.005232 / 0.007607 (-0.002375)	0.331398 / 0.226044 (0.105354)	3.273689 / 2.268929 (1.004761)	1.818291 / 55.444624 (-53.626334)	1.532233 / 6.876477 (-5.344244)	1.545236 / 2.142072 (-0.596837)	0.809853 / 4.805227 (-3.995374)	0.137571 / 6.500664 (-6.363093)	0.042829 / 0.075469 (-0.032640)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.962599 / 1.841788 (-0.879189)	11.593394 / 8.074308 (3.519086)	9.564848 / 10.191392 (-0.626544)	0.131547 / 0.680424 (-0.548876)	0.014724 / 0.534201 (-0.519477)	0.309343 / 0.579283 (-0.269940)	0.263476 / 0.434364 (-0.170888)	0.350755 / 0.540337 (-0.189582)	0.445279 / 1.386936 (-0.941657)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005818 / 0.011353 (-0.005534)	0.004028 / 0.011008 (-0.006980)	0.050337 / 0.038508 (0.011829)	0.033234 / 0.023109 (0.010125)	0.273498 / 0.275898 (-0.002400)	0.299130 / 0.323480 (-0.024350)	0.004391 / 0.007986 (-0.003595)	0.002854 / 0.004328 (-0.001474)	0.048616 / 0.004250 (0.044365)	0.040354 / 0.037052 (0.003302)	0.287980 / 0.258489 (0.029491)	0.323940 / 0.293841 (0.030099)	0.033031 / 0.128546 (-0.095515)	0.012539 / 0.075646 (-0.063108)	0.061129 / 0.419271 (-0.358143)	0.034410 / 0.043533 (-0.009123)	0.276367 / 0.255139 (0.021228)	0.295266 / 0.283200 (0.012066)	0.018558 / 0.141683 (-0.123125)	1.149051 / 1.452155 (-0.303104)	1.207995 / 1.492716 (-0.284721)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.095732 / 0.018006 (0.077726)	0.305774 / 0.000490 (0.305284)	0.000222 / 0.000200 (0.000022)	0.000044 / 0.000054 (-0.000010)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.023680 / 0.037411 (-0.013731)	0.077147 / 0.014526 (0.062621)	0.088850 / 0.176557 (-0.087706)	0.130219 / 0.737135 (-0.606917)	0.090582 / 0.296338 (-0.205756)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.306099 / 0.215209 (0.090890)	2.952515 / 2.077655 (0.874861)	1.593090 / 1.504120 (0.088970)	1.471887 / 1.541195 (-0.069308)	1.484277 / 1.468490 (0.015787)	0.741158 / 4.584777 (-3.843619)	0.976520 / 3.745712 (-2.769192)	2.904631 / 5.269862 (-2.365231)	1.940287 / 4.565676 (-2.625389)	0.079828 / 0.424275 (-0.344447)	0.005482 / 0.007607 (-0.002125)	0.353376 / 0.226044 (0.127332)	3.502412 / 2.268929 (1.233483)	1.976571 / 55.444624 (-53.468053)	1.675141 / 6.876477 (-5.201336)	1.821075 / 2.142072 (-0.320998)	0.814290 / 4.805227 (-3.990937)	0.135227 / 6.500664 (-6.365437)	0.041631 / 0.075469 (-0.033838)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.041495 / 1.841788 (-0.800293)	12.275647 / 8.074308 (4.201339)	10.569540 / 10.191392 (0.378148)	0.143136 / 0.680424 (-0.537288)	0.015010 / 0.534201 (-0.519191)	0.302177 / 0.579283 (-0.277106)	0.125924 / 0.434364 (-0.308440)	0.340977 / 0.540337 (-0.199360)	0.438467 / 1.386936 (-0.948469)

* Transform all HF HTTP URLs to HF protocol * Fix test URL * Remove HF headers for non-HF HTTP URLs * Fix for HTTP storage_options without 'headers' * Remove unused cookies * Refactor * Refactor list to set to check membership * Refactor to add protocol key to storage_options only at the end * Fix overwriting storage_options nested values * Add tests * Revert "Transform all HF HTTP URLs to HF protocol" This reverts commit a337212. * Test that DownloadConfig.storage_options are not modified * Fix so DownloadConfig.storage_options are not modified * Refactor fix * Test also GitHub URL * Fix DownloadConfig.storage_options for GitHub URL

Transform all HF HTTP URLs to HF protocol

a337212

albertvillanova added 9 commits July 24, 2024 08:32

Fix test URL

7e5af29

Remove HF headers for non-HF HTTP URLs

85aa4d0

Fix for HTTP storage_options without 'headers'

11f56e4

Remove unused cookies

e093dcf

Refactor

799732e

Refactor list to set to check membership

4defd60

Refactor to add protocol key to storage_options only at the end

9a60ebf

Fix overwriting storage_options nested values

560f194

Add tests

ec2bc84

albertvillanova marked this pull request as ready for review July 24, 2024 08:54

lhoestq reviewed Jul 25, 2024

View reviewed changes

src/datasets/utils/file_utils.py Outdated Show resolved Hide resolved

lhoestq reviewed Jul 25, 2024

View reviewed changes

src/datasets/utils/file_utils.py Outdated Show resolved Hide resolved

albertvillanova added 6 commits July 26, 2024 14:07

Revert "Transform all HF HTTP URLs to HF protocol"

7367535

This reverts commit a337212.

Test that DownloadConfig.storage_options are not modified

e446fc4

Fix so DownloadConfig.storage_options are not modified

45badbc

Refactor fix

533ddb9

Test also GitHub URL

e526f6c

Fix DownloadConfig.storage_options for GitHub URL

e1de6f2

lhoestq approved these changes Jul 26, 2024

View reviewed changes

albertvillanova merged commit baea190 into main Jul 29, 2024
15 checks passed

albertvillanova deleted the fix-prepare-path-storage-options branch July 29, 2024 06:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix prepare_single_hop_path_and_storage_options #7068

Fix prepare_single_hop_path_and_storage_options #7068

albertvillanova commented Jul 24, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Jul 24, 2024

lhoestq left a comment

github-actions bot commented Jul 29, 2024

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Fix prepare_single_hop_path_and_storage_options #7068

Fix prepare_single_hop_path_and_storage_options #7068

Conversation

albertvillanova commented Jul 24, 2024 • edited Loading

HuggingFaceDocBuilderDev commented Jul 24, 2024

lhoestq left a comment

Choose a reason for hiding this comment

github-actions bot commented Jul 29, 2024

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

albertvillanova commented Jul 24, 2024 •

edited

Loading