-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve default patterns resolution #6704
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Awesome ! Note that it can still create duplicates if a path matches several dir patterns, e.g.
matches two dir patterns:
PS: feel free to update your branch, I just updated ruff on |
Yes, I didn't mention that case on purpose 🙂. One solution would be deprecating the |
…efault-patterns-resolution
I think it's too big of a breaking change yes :/ (and would make the docs / logic more complex for users to get imo) Though I think your approach is already a nice step in the right direction |
…efault-patterns-resolution
These changes to the |
Nice ! Though since with open("my/local/dir/0000.txt", "w") as f:
f.write("Hello there")
d1 = load_dataset("my/local/dir")
with open("my/local/dir/0001.txt", "w") as f:
f.write("General Kenobi")
d2 = load_dataset("my/local/dir")
assert list(d1) != list(d2) |
Yes. But I think I have a solution for this. |
I'm not satisfied with the context manager approach... A clean solution would require a bigger rewrite of the resolution logic (e.g., merging The current changes make the local resolution 2-3x faster, which is good enough for now, I think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
@@ -46,36 +46,57 @@ class EmptyDatasetError(FileNotFoundError): | |||
} | |||
NON_WORDS_CHARS = "-._ 0-9" | |||
if config.FSSPEC_VERSION < version.parse("2023.9.0"): | |||
KEYWORDS_IN_PATH_NAME_BASE_PATTERNS = ["{keyword}[{sep}/]**", "**[{sep}/]{keyword}[{sep}/]**"] | |||
KEYWORDS_IN_FILENAME_BASE_PATTERNS = ["**[{sep}/]{keyword}[{sep}]*", "{keyword}[{sep}]*"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a breaking change impacting all uses of the variable KEYWORDS_IN_PATH_NAME_BASE_PATTERNS
. See: https://github.com/huggingface/dataset-viewer/actions/runs/8753796799/job/24024224560?pr=2740
CC: @mariosasko @lhoestq
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dataset-viewer
seems to be the only repo on GH using KEYWORDS_IN_PATH_NAME_BASE_PATTERNS
...
But I think it's okay to add it back if it's hard to fix datasets-viewer
's CI otherwise (merging KEYWORDS_IN_FILENAME_BASE_PATTERNS
and KEYWORDS_IN_DIR_NAME_BASE_PATTERNS
should fix it, no?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to add it back imo, we can just use the new variables (adding the two together) in dataset-viewer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fixing this in dataset-viewer: huggingface/dataset-viewer#2740 (comment)
This change is breaking in datasets/src/datasets/arrow_dataset.py Line 1515 in f96e74d
when the input is |
I opened #6828 to add proper Path support to save_to_disk / load_from_disk |
Separate the default patterns that match directories from the ones matching files and ensure directories are checked first (reverts the change from #6244, which merged these patterns). Also, ensure that the glob patterns do not overlap to avoid duplicates in the result.
Additionally, replace
get_fs_token_paths
withurl_to_fs
to avoid unnecessary glob calls.fix #6259
fix #6272