Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI is broken for convert_to_parquet: Invalid rev id: refs/pr/1 404 error causes RevisionNotFoundError #7073

Closed
albertvillanova opened this issue Jul 26, 2024 · 9 comments · Fixed by #7074
Assignees

Comments

@albertvillanova
Copy link
Member

albertvillanova commented Jul 26, 2024

See: https://github.com/huggingface/datasets/actions/runs/10095313567/job/27915185756

FAILED tests/test_hub.py::test_convert_to_parquet - huggingface_hub.utils._errors.RevisionNotFoundError: 404 Client Error. (Request ID: Root=1-66a25839-31ce7b475e70e7db1e4d44c2;b0c8870f-d5ef-4bf2-a6ff-0191f3df0f64)

Revision Not Found for url: https://hub-ci.huggingface.co/api/datasets/__DUMMY_TRANSFORMERS_USER__/test-dataset-5188a8-17219154347516/preupload/refs%2Fpr%2F1.
Invalid rev id: refs/pr/1
 /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/datasets/hub.py:86: in convert_to_parquet
    dataset.push_to_hub(
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/datasets/dataset_dict.py:1722: in push_to_hub
    split_additions, uploaded_size, dataset_nbytes = self[split]._push_parquet_shards_to_hub(
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/datasets/arrow_dataset.py:5511: in _push_parquet_shards_to_hub
    api.preupload_lfs_files(
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/huggingface_hub/hf_api.py:4231: in preupload_lfs_files
    _fetch_upload_modes(
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py:118: in _inner_fn
    return fn(*args, **kwargs)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/huggingface_hub/_commit_api.py:507: in _fetch_upload_modes
    hf_raise_for_status(resp)
@albertvillanova albertvillanova self-assigned this Jul 26, 2024
@albertvillanova albertvillanova changed the title CI is broken for convert_to_parquet: 404 error causes RevisionNotFoundError CI is broken for convert_to_parquet: Invalid rev id: refs/pr/1 404 error causes RevisionNotFoundError Jul 26, 2024
@albertvillanova
Copy link
Member Author

albertvillanova commented Jul 26, 2024

Any recent change in the API backend rejecting parameter revision="refs/pr/1" to HfApi.preupload_lfs_files?

f"{endpoint}/api/{repo_type}s/{repo_id}/preupload/{revision}"

https://hub-ci.huggingface.co/api/datasets/__DUMMY_TRANSFORMERS_USER__/test-dataset-5188a8-17219154347516/preupload/refs%2Fpr%2F1.
Invalid rev id: refs/pr/1

@Wauplin @huggingface/datasets @huggingface/moon-landing @huggingface/moon-landing-back

@albertvillanova
Copy link
Member Author

I have temporarily fixed the CI with:

However, the underlying issue must be fixed and #7074 must be reverted.

@coyotte508
Copy link
Member

Hmm does it do the preupload call before creating the ref cc @Wauplin ?

(in that case it should do a preupload call on the base branch with ?create_pr=1)

@albertvillanova
Copy link
Member Author

@coyotte508, the CI test was implemented 2 months ago and it was working OK until yesterday. See the CI status of the commits in the main branch of datasets: https://github.com/huggingface/datasets/commits/main/

@coyotte508
Copy link
Member

coyotte508 commented Jul 26, 2024

Yes i get that

We changed the preupload response to return the commit id in https://github.com/huggingface-internal/moon-landing/pull/10756

This line is probably causing the error: https://github.com/huggingface-internal/moon-landing/pull/10756/files#diff-558f6f9865e30bfa091b94d6a4a900138103ddb4eb0bec96b6deec5bf5626fa0R2322

It's weird the error is returned, it means that maybe a ref with 0 history (not even the first commit) was created

Does this change have any impact in production, or just the CI test? If it's just the CI test it should be fixed on your side, if it impacts production we can look at a solution

@albertvillanova
Copy link
Member Author

albertvillanova commented Jul 26, 2024

@coyotte508 it impacts production: convert_to_parquet raises the above error when the dataset has more that one configs/subsets:

  • First subset calls push_to_hub with create_pr=True
  • Second subset uses the refs/pr/# returned by the call above, and calls push_to_hub with revision="refs/pr/#"

@coyotte508
Copy link
Member

coyotte508 commented Jul 26, 2024

I tried removing the mock_commit call: #7076

And the tests seem to work.

So it's probably because the commit is not actually called, it doesn't actually create the pull request on the remote (and the associated refs/pr/1). But the preupload call is not mocked.

Anyway it shouldn't impact production, since production isn't mocked

@albertvillanova
Copy link
Member Author

albertvillanova commented Jul 26, 2024

@coyotte508 thanks a lot for the investigation and sorry for the noise.
I promise not trying to fix things when I have a slight fever: my head does not work well.

We need indeed to mock preupload_lfs_files: before it was not necessary, but now it is.

@albertvillanova
Copy link
Member Author

I fixed the test in:

Thanks again, @coyotte508.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants