Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicated Rows When Loading Parquet Files from Root Directory with Subdirectories #6259

Closed
MF-FOOM opened this issue Sep 25, 2023 · 1 comment · Fixed by #6704
Closed

Duplicated Rows When Loading Parquet Files from Root Directory with Subdirectories #6259

MF-FOOM opened this issue Sep 25, 2023 · 1 comment · Fixed by #6704
Assignees

Comments

@MF-FOOM
Copy link

MF-FOOM commented Sep 25, 2023

Describe the bug

When parquet files are saved in "train" and "val" subdirectories under a root directory, and datasets are then loaded using load_dataset("parquet", data_dir="root_directory"), the resulting dataset has duplicated rows for both the training and validation sets.

Steps to reproduce the bug

  1. Create a root directory, e.g., "testing123".
  2. Under "testing123", create two subdirectories: "train" and "val".
  3. Create and save a parquet file with 3 unique rows in the "train" subdirectory.
  4. Create and save a parquet file with 4 unique rows in the "val" subdirectory.
  5. Load the datasets from the root directory using load_dataset("parquet", data_dir="testing123")
  6. Iterate through the datasets and print the rows

Here's a collab reproducing these steps:

https://colab.research.google.com/drive/11NEdImnQ3OqJlwKSHRMhr7jCBesNdLY4?usp=sharing

Expected behavior

  • Training set should contain 3 unique rows.
  • Validation set should contain 4 unique rows.

Environment info

  • datasets version: 2.14.5
  • Platform: Linux-5.15.120+-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Huggingface_hub version: 0.17.2
  • PyArrow version: 9.0.0
  • Pandas version: 1.5.3
@mariosasko
Copy link
Collaborator

Thanks for reporting this issue! We should be able to avoid this by making our glob patterns more precise. In the meantime, you can load the dataset by directly assigning splits to the data files:

from datasets import load_dataset
ds = load_dataset("parquet", data_files={"train": "testing123/train/output_train.parquet", "validation": "testing123/val/output_val.parquet"})

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants