Duplicate `data_files` when named `<split>/<split>.parquet` #6272

lhoestq · 2023-10-01T15:43:56Z

e.g. with u23429/stock_1_minute_ticker

In [1]: from datasets import *

In [2]: b = load_dataset_builder("u23429/stock_1_minute_ticker")
Downloading readme: 100%|██████████████████████████| 627/627 [00:00<00:00, 246kB/s]

In [3]: b.config.data_files
Out[3]: 
{NamedSplit('train'): ['hf://datasets/u23429/stock_1_minute_ticker@65c973cf4ec061f01a363b40da4c1bb128ba4166/train/train.parquet',
  'hf://datasets/u23429/stock_1_minute_ticker@65c973cf4ec061f01a363b40da4c1bb128ba4166/train/train.parquet'],
 NamedSplit('validation'): ['hf://datasets/u23429/stock_1_minute_ticker@65c973cf4ec061f01a363b40da4c1bb128ba4166/validation/validation.parquet',
  'hf://datasets/u23429/stock_1_minute_ticker@65c973cf4ec061f01a363b40da4c1bb128ba4166/validation/validation.parquet'],
 NamedSplit('test'): ['hf://datasets/u23429/stock_1_minute_ticker@65c973cf4ec061f01a363b40da4c1bb128ba4166/test/test.parquet',
  'hf://datasets/u23429/stock_1_minute_ticker@65c973cf4ec061f01a363b40da4c1bb128ba4166/test/test.parquet']}

This bug issue is present in the current datasets 2.14.5 and also on main even after #6244 cc @mariosasko

The text was updated successfully, but these errors were encountered:

mariosasko · 2023-10-01T17:18:27Z

Also reported in #6259

mariosasko · 2023-10-02T17:49:13Z

I think it's best to drop duplicates with a set (as a temporary fix) and improve the patterns when/if fsspec/filesystem_spec#1382 gets merged. @lhoestq Do you have some other ideas?

lhoestq · 2023-10-02T18:19:49Z

Alternatively we could just use this no ?

if config.FSSPEC_VERSION < version.parse("2023.9.0"):
    KEYWORDS_IN_PATH_NAME_BASE_PATTERNS = [
        "{keyword}[{sep}/]**",
        "**[{sep}]{keyword}[{sep}/]**",
        "**/{keyword}[{sep}/]**",
    ]
else:
    KEYWORDS_IN_PATH_NAME_BASE_PATTERNS = [
        "{keyword}[{sep}/]**",
        "**/*[{sep}]{keyword}[{sep}/]**",
        "**/*/{keyword}[{sep}/]**",
    ]

This way no need to implement sets, which would require a bit of work since we've always considered a list of pattern to be resolved as the concatenated list of resolved files for each pattern (including duplicates)

lhoestq · 2023-10-02T18:22:18Z

Arf "**/*/{keyword}[{sep}/]**" does return data/keyword.txt in latest fsspec but not in glob.glob

EDIT: actually forgot to set recursive=True

lhoestq · 2023-10-02T18:25:52Z

Actually glob.glob does return it with recursive=True ! my bad

lhoestq · 2023-10-04T15:29:39Z

Pff just tested and my idea sucks, pattern 1 and 3 obviously give duplicates

lhoestq · 2023-10-05T10:32:26Z

I think it's best to drop duplicates with a set (as a temporary fix)

I started #6278 to use DataFilesSet objects instead of DataFilesList

lhoestq added the bug Something isn't working label Oct 1, 2023

This was referenced Oct 5, 2023

No data files duplicates #6278

Closed

Drop data_files duplicates #6282

Closed

mariosasko mentioned this issue Mar 1, 2024

Improve default patterns resolution #6704

Merged

mariosasko closed this as completed in #6704 Mar 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate `data_files` when named `<split>/<split>.parquet` #6272

Duplicate `data_files` when named `<split>/<split>.parquet` #6272

lhoestq commented Oct 1, 2023

mariosasko commented Oct 1, 2023

mariosasko commented Oct 2, 2023

lhoestq commented Oct 2, 2023 •

edited

Loading

lhoestq commented Oct 2, 2023 •

edited

Loading

lhoestq commented Oct 2, 2023

lhoestq commented Oct 4, 2023

lhoestq commented Oct 5, 2023

Duplicate data_files when named <split>/<split>.parquet #6272

Duplicate data_files when named <split>/<split>.parquet #6272

Comments

lhoestq commented Oct 1, 2023

mariosasko commented Oct 1, 2023

mariosasko commented Oct 2, 2023

lhoestq commented Oct 2, 2023 • edited Loading

lhoestq commented Oct 2, 2023 • edited Loading

lhoestq commented Oct 2, 2023

lhoestq commented Oct 4, 2023

lhoestq commented Oct 5, 2023

Duplicate `data_files` when named `<split>/<split>.parquet` #6272

Duplicate `data_files` when named `<split>/<split>.parquet` #6272

lhoestq commented Oct 2, 2023 •

edited

Loading

lhoestq commented Oct 2, 2023 •

edited

Loading