Fix push_to_hub by not calling create_branch if PR branch #7069

albertvillanova · 2024-07-25T07:50:04Z

Fix push_to_hub by not calling create_branch if PR branch (e.g. refs/pr/1).

Note that currently create_branch raises a 400 Bad Request error if the user passes a PR branch (e.g. refs/pr/1).

EDIT:
~~Fix push_to_hub by not calling create_branch if branch exists.~~

Note that currently create_branch raises a 403 Forbidden error even if all these conditions are met:

exist_ok is passed
the branch already exists
the user does not have WRITE permission

Fix #7067.

Related issue:

Different behavior when passing exist_ok huggingface_hub#2419

HuggingFaceDocBuilderDev · 2024-07-25T07:52:30Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

lhoestq · 2024-07-25T13:31:26Z

cc @Wauplin maybe it's a huggingface_hub bug ?

EDIT: ah actually the issue is opened at huggingface/huggingface_hub#2419

albertvillanova · 2024-07-25T14:16:54Z

src/datasets/arrow_dataset.py

@@ -5620,7 +5620,8 @@ def push_to_hub(
        )
        repo_id = repo_url.repo_id

-        if revision is not None:
+        if revision is not None and not api.revision_exists(repo_id, revision, repo_type="dataset", token=token):


Alternatively:

Suggested change

if revision is not None and not api.revision_exists(repo_id, revision, repo_type="dataset", token=token):

if revision is not None and not revision.startswith("refs/pr/"):

I think this modification would be more efficient:

The modification above will fix the 400 Bad Request error without the additional network call revision_exists.

In the case of a 403 Forbidden error, anyway we will raise an error some code lines below when trying to push to the branch if the user does not have write permission.

On the other hand, it is true that the call to revision_exists is a more general solution.

albertvillanova · 2024-07-30T07:22:41Z

I think we need to make this fix anyway, ~~unless we pin the lower version of huggingface-hub (once they release the patch)~~.

Calling create_branch with a PR ref raises an error

albertvillanova · 2024-07-30T07:55:15Z

Comment by @Wauplin: huggingface/huggingface_hub#2426 (comment)

I think this should be something to fix in datasets directly. Having a 400 Bad request when trying to create the branch refs/pr/1 seems normal to me since it's not a branch.

lhoestq · 2024-07-30T10:04:17Z

does this mean we should use create_pull_request() in that case ?

Wauplin · 2024-07-30T10:10:37Z

does this mean we should use create_pull_request() in that case ?

If user wants to push some data to a new PR, they can already pass create_pr=True which will automatically do the job for you (without using revision). If user is passing revision="refs/pr/1" explicitly, you should assume the PR already exists.

lhoestq · 2024-07-30T10:28:53Z

ah yes we do pass create_pr in preupload_lfs_files() ! sounds good then

github-actions · 2024-07-30T10:56:56Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005806 / 0.011353 (-0.005547)	0.004082 / 0.011008 (-0.006927)	0.064277 / 0.038508 (0.025769)	0.032289 / 0.023109 (0.009180)	0.242066 / 0.275898 (-0.033832)	0.272574 / 0.323480 (-0.050906)	0.003281 / 0.007986 (-0.004705)	0.002957 / 0.004328 (-0.001371)	0.049930 / 0.004250 (0.045679)	0.047306 / 0.037052 (0.010253)	0.252216 / 0.258489 (-0.006273)	0.286678 / 0.293841 (-0.007163)	0.030182 / 0.128546 (-0.098364)	0.012967 / 0.075646 (-0.062680)	0.204949 / 0.419271 (-0.214323)	0.036999 / 0.043533 (-0.006534)	0.243577 / 0.255139 (-0.011562)	0.265044 / 0.283200 (-0.018156)	0.021149 / 0.141683 (-0.120534)	1.112293 / 1.452155 (-0.339862)	1.186483 / 1.492716 (-0.306233)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.093239 / 0.018006 (0.075233)	0.286372 / 0.000490 (0.285883)	0.000224 / 0.000200 (0.000024)	0.000062 / 0.000054 (0.000007)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.019042 / 0.037411 (-0.018369)	0.063690 / 0.014526 (0.049164)	0.075034 / 0.176557 (-0.101523)	0.123053 / 0.737135 (-0.614083)	0.076843 / 0.296338 (-0.219495)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.276554 / 0.215209 (0.061345)	2.749338 / 2.077655 (0.671683)	1.442764 / 1.504120 (-0.061356)	1.327860 / 1.541195 (-0.213335)	1.369885 / 1.468490 (-0.098606)	0.722645 / 4.584777 (-3.862132)	2.430707 / 3.745712 (-1.315005)	3.105293 / 5.269862 (-2.164568)	1.961617 / 4.565676 (-2.604060)	0.077728 / 0.424275 (-0.346547)	0.005189 / 0.007607 (-0.002418)	0.335511 / 0.226044 (0.109467)	3.315618 / 2.268929 (1.046690)	1.858254 / 55.444624 (-53.586371)	1.552173 / 6.876477 (-5.324304)	1.627086 / 2.142072 (-0.514987)	0.790871 / 4.805227 (-4.014356)	0.136958 / 6.500664 (-6.363706)	0.043207 / 0.075469 (-0.032262)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.969314 / 1.841788 (-0.872473)	12.145318 / 8.074308 (4.071010)	9.834839 / 10.191392 (-0.356553)	0.141896 / 0.680424 (-0.538528)	0.014304 / 0.534201 (-0.519897)	0.306091 / 0.579283 (-0.273192)	0.260994 / 0.434364 (-0.173369)	0.348096 / 0.540337 (-0.192242)	0.441458 / 1.386936 (-0.945478)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005989 / 0.011353 (-0.005363)	0.003907 / 0.011008 (-0.007102)	0.050819 / 0.038508 (0.012310)	0.033178 / 0.023109 (0.010069)	0.279059 / 0.275898 (0.003161)	0.300161 / 0.323480 (-0.023319)	0.004383 / 0.007986 (-0.003603)	0.002834 / 0.004328 (-0.001495)	0.048779 / 0.004250 (0.044528)	0.040502 / 0.037052 (0.003450)	0.291786 / 0.258489 (0.033297)	0.323827 / 0.293841 (0.029986)	0.032175 / 0.128546 (-0.096371)	0.012157 / 0.075646 (-0.063489)	0.060796 / 0.419271 (-0.358476)	0.033924 / 0.043533 (-0.009609)	0.278047 / 0.255139 (0.022908)	0.297878 / 0.283200 (0.014678)	0.019137 / 0.141683 (-0.122546)	1.138996 / 1.452155 (-0.313158)	1.172731 / 1.492716 (-0.319985)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.110148 / 0.018006 (0.092142)	0.307232 / 0.000490 (0.306742)	0.000209 / 0.000200 (0.000009)	0.000044 / 0.000054 (-0.000010)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.023082 / 0.037411 (-0.014330)	0.076590 / 0.014526 (0.062065)	0.088444 / 0.176557 (-0.088113)	0.129293 / 0.737135 (-0.607842)	0.090470 / 0.296338 (-0.205868)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.305016 / 0.215209 (0.089807)	2.931671 / 2.077655 (0.854016)	1.586055 / 1.504120 (0.081935)	1.463517 / 1.541195 (-0.077678)	1.479654 / 1.468490 (0.011164)	0.726194 / 4.584777 (-3.858583)	0.970512 / 3.745712 (-2.775200)	2.850496 / 5.269862 (-2.419365)	1.920112 / 4.565676 (-2.645564)	0.079921 / 0.424275 (-0.344354)	0.005367 / 0.007607 (-0.002240)	0.347022 / 0.226044 (0.120978)	3.472425 / 2.268929 (1.203497)	1.965400 / 55.444624 (-53.479225)	1.669116 / 6.876477 (-5.207361)	1.859504 / 2.142072 (-0.282568)	0.802703 / 4.805227 (-4.002525)	0.134776 / 6.500664 (-6.365888)	0.041800 / 0.075469 (-0.033669)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.039665 / 1.841788 (-0.802122)	12.024071 / 8.074308 (3.949763)	10.338743 / 10.191392 (0.147351)	0.139495 / 0.680424 (-0.540929)	0.015249 / 0.534201 (-0.518952)	0.298580 / 0.579283 (-0.280703)	0.124625 / 0.434364 (-0.309739)	0.341868 / 0.540337 (-0.198470)	0.431396 / 1.386936 (-0.955540)

* Fix push_to_hub by not calling create_branch if branch exists * Fix push_to_hub by not calling create_branch if branch exists * Reword comment * Fix push_to_hub by not calling create_branch if PR ref * Update test

Fix push_to_hub by not calling create_branch if branch exists

06b076a

albertvillanova added 2 commits July 25, 2024 10:16

Fix push_to_hub by not calling create_branch if branch exists

857b73c

Reword comment

e4148ae

albertvillanova commented Jul 25, 2024

View reviewed changes

Wauplin mentioned this pull request Jul 29, 2024

Do not raise if branch exists and no write permission huggingface/huggingface_hub#2426

Merged

Fix push_to_hub by not calling create_branch if PR ref

dbabdda

albertvillanova added 2 commits July 30, 2024 09:59

Merge branch 'main' into fix-7067

ec8b690

Update test

cf9f2b8

lhoestq approved these changes Jul 30, 2024

View reviewed changes

albertvillanova merged commit 65b9499 into main Jul 30, 2024
15 checks passed

albertvillanova deleted the fix-7067 branch July 30, 2024 10:51

albertvillanova changed the title ~~Fix push_to_hub by not calling create_branch if branch exists~~ Fix push_to_hub by not calling create_branch if PR branch Jul 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix push_to_hub by not calling create_branch if PR branch #7069

Fix push_to_hub by not calling create_branch if PR branch #7069

albertvillanova commented Jul 25, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Jul 25, 2024

lhoestq commented Jul 25, 2024 •

edited

Loading

albertvillanova Jul 25, 2024

albertvillanova Jul 30, 2024 •

edited

Loading

albertvillanova Jul 30, 2024

albertvillanova commented Jul 30, 2024 •

edited

Loading

albertvillanova commented Jul 30, 2024

lhoestq commented Jul 30, 2024

Wauplin commented Jul 30, 2024

lhoestq commented Jul 30, 2024

github-actions bot commented Jul 30, 2024

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

	if revision is not None and not api.revision_exists(repo_id, revision, repo_type="dataset", token=token):
	if revision is not None and not revision.startswith("refs/pr/"):

Fix push_to_hub by not calling create_branch if PR branch #7069

Fix push_to_hub by not calling create_branch if PR branch #7069

Conversation

albertvillanova commented Jul 25, 2024 • edited Loading

HuggingFaceDocBuilderDev commented Jul 25, 2024

lhoestq commented Jul 25, 2024 • edited Loading

albertvillanova Jul 25, 2024

Choose a reason for hiding this comment

albertvillanova Jul 30, 2024 • edited Loading

Choose a reason for hiding this comment

albertvillanova Jul 30, 2024

Choose a reason for hiding this comment

albertvillanova commented Jul 30, 2024 • edited Loading

albertvillanova commented Jul 30, 2024

lhoestq commented Jul 30, 2024

Wauplin commented Jul 30, 2024

lhoestq commented Jul 30, 2024

github-actions bot commented Jul 30, 2024

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

albertvillanova commented Jul 25, 2024 •

edited

Loading

lhoestq commented Jul 25, 2024 •

edited

Loading

albertvillanova Jul 30, 2024 •

edited

Loading

albertvillanova commented Jul 30, 2024 •

edited

Loading