You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When parquet files are saved in "train" and "val" subdirectories under a root directory, and datasets are then loaded using load_dataset("parquet", data_dir="root_directory"), the resulting dataset has duplicated rows for both the training and validation sets.
Steps to reproduce the bug
Create a root directory, e.g., "testing123".
Under "testing123", create two subdirectories: "train" and "val".
Create and save a parquet file with 3 unique rows in the "train" subdirectory.
Create and save a parquet file with 4 unique rows in the "val" subdirectory.
Load the datasets from the root directory using load_dataset("parquet", data_dir="testing123")
Thanks for reporting this issue! We should be able to avoid this by making our glob patterns more precise. In the meantime, you can load the dataset by directly assigning splits to the data files:
Describe the bug
When parquet files are saved in "train" and "val" subdirectories under a root directory, and datasets are then loaded using
load_dataset("parquet", data_dir="root_directory")
, the resulting dataset has duplicated rows for both the training and validation sets.Steps to reproduce the bug
load_dataset("parquet", data_dir="testing123")
Here's a collab reproducing these steps:
https://colab.research.google.com/drive/11NEdImnQ3OqJlwKSHRMhr7jCBesNdLY4?usp=sharing
Expected behavior
Environment info
datasets
version: 2.14.5The text was updated successfully, but these errors were encountered: