Duplicated Rows When Loading Parquet Files from Root Directory with Subdirectories #6259

MF-FOOM · 2023-09-25T17:20:54Z

Describe the bug

When parquet files are saved in "train" and "val" subdirectories under a root directory, and datasets are then loaded using load_dataset("parquet", data_dir="root_directory"), the resulting dataset has duplicated rows for both the training and validation sets.

Steps to reproduce the bug

Create a root directory, e.g., "testing123".
Under "testing123", create two subdirectories: "train" and "val".
Create and save a parquet file with 3 unique rows in the "train" subdirectory.
Create and save a parquet file with 4 unique rows in the "val" subdirectory.
Load the datasets from the root directory using load_dataset("parquet", data_dir="testing123")
Iterate through the datasets and print the rows

Here's a collab reproducing these steps:

https://colab.research.google.com/drive/11NEdImnQ3OqJlwKSHRMhr7jCBesNdLY4?usp=sharing

Expected behavior

Training set should contain 3 unique rows.
Validation set should contain 4 unique rows.

Environment info

datasets version: 2.14.5
Platform: Linux-5.15.120+-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.17.2
PyArrow version: 9.0.0
Pandas version: 1.5.3

The text was updated successfully, but these errors were encountered:

mariosasko · 2023-09-26T17:53:23Z

Thanks for reporting this issue! We should be able to avoid this by making our glob patterns more precise. In the meantime, you can load the dataset by directly assigning splits to the data files:

from datasets import load_dataset
ds = load_dataset("parquet", data_files={"train": "testing123/train/output_train.parquet", "validation": "testing123/val/output_val.parquet"})

mariosasko self-assigned this Sep 26, 2023

mariosasko mentioned this issue Oct 1, 2023

Duplicate data_files when named <split>/<split>.parquet #6272

Closed

This was referenced Oct 5, 2023

No data files duplicates #6278

Closed

Drop data_files duplicates #6282

Closed

mariosasko mentioned this issue Mar 1, 2024

Improve default patterns resolution #6704

Merged

mariosasko closed this as completed in #6704 Mar 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicated Rows When Loading Parquet Files from Root Directory with Subdirectories #6259

Duplicated Rows When Loading Parquet Files from Root Directory with Subdirectories #6259

MF-FOOM commented Sep 25, 2023 •

edited

Loading

mariosasko commented Sep 26, 2023

Duplicated Rows When Loading Parquet Files from Root Directory with Subdirectories #6259

Duplicated Rows When Loading Parquet Files from Root Directory with Subdirectories #6259

Comments

MF-FOOM commented Sep 25, 2023 • edited Loading

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

mariosasko commented Sep 26, 2023

MF-FOOM commented Sep 25, 2023 •

edited

Loading