Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update datasets to 2.19.0 and fsspec to 2024.3.1 #2740

Merged
merged 12 commits into from
Apr 24, 2024
Merged

Conversation

albertvillanova
Copy link
Member

Update datasets to 2.19.0.

Fix #2739.

@albertvillanova
Copy link
Member Author

Tests for worker are not passing: https://github.com/huggingface/dataset-viewer/actions/runs/8753796799/job/24024224560?pr=2740

I am fixing it.

@albertvillanova albertvillanova marked this pull request as draft April 19, 2024 12:52
@albertvillanova
Copy link
Member Author

The issue was introduced by @lhoestq in PR:

if any(
    NON_WORD_GLOB_SEPARATOR not in pattern.format(keyword="train", sep=NON_WORDS_CHARS)
    for pattern in datasets.data_files.KEYWORDS_IN_PATH_NAME_BASE_PATTERNS
):
    raise ImportError(

Any reason why did you put an indirect constraint on the upper version of fsspec and how to overcome it now?

@albertvillanova
Copy link
Member Author

As @mariosasko and @lhoestq suggested (see huggingface/datasets#6704 (comment) and huggingface/datasets#6704 (comment)), I am fixing the issue by creating a new variable by joining the 2 new ones).

@albertvillanova
Copy link
Member Author

@lhoestq, I have just checked that the fix is not as direct as suggested.

Before the changes introduced by @mariosasko in huggingface/datasets#6704, the ImportError was not raised if fsspec version was < 2023.12.0, i.e. the pattern NON_WORD_GLOB_SEPARATOR is present in ALL KEYWORDS_IN_PATH_NAME_BASE_PATTERNS for both cases fsspec < 2023.9.0 or fsspec < 2023.12.0.
That means the ImportError was only raised if fsspec >= 2023.12.0.

But after the changes introduced by @mariosasko in huggingface/datasets#6704, if we join the variables KEYWORDS_IN_DIR_NAME_BASE_PATTERNS and KEYWORDS_IN_FILENAME_BASE_PATTERNS, the ImportError will always be raised: there is always at least one pattern that does not contain the NON_WORD_GLOB_SEPARATOR, for example:

  • for fsspec < 2023.9.0:
    • filename = "{keyword}[{sep}]*"
    • firname = "{keyword}/**"
  • for fsspec < 2023.12.0:
    • filename = "{keyword}[{sep}]*"
    • dirname = "{keyword}/**/*"
  • for fsspec >= 2023.12.0: any of the filenames or dirnames

@albertvillanova
Copy link
Member Author

@lhoestq if you are trying to put a constraint on the version of fsspec, wouldn't it be easier to check the installed version?

@lhoestq
Copy link
Member

lhoestq commented Apr 23, 2024

Then we need to update NON_WORD_GLOB_SEPARATOR, I can continue this PR

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests are all passing now, feel free to check if my changes look good to you @albertvillanova and merge

@albertvillanova albertvillanova changed the title Update datasets to 2.19.0 Update datasets to 2.19.0 and fsspec to 2024.3.1 Apr 23, 2024
@albertvillanova
Copy link
Member Author

@lhoestq I have updated fsspec in docs, e2e and admin_ui as well.

@albertvillanova albertvillanova marked this pull request as ready for review April 23, 2024 14:06
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@albertvillanova albertvillanova merged commit 13a2c12 into main Apr 24, 2024
23 checks passed
@albertvillanova albertvillanova deleted the fix-2739 branch April 24, 2024 13:31
@lhoestq
Copy link
Member

lhoestq commented Apr 24, 2024

thanks !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update datasets to 2.19.0
3 participants