-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
load_dataset with multiple jsonlines files interprets datastructure too early #7092
Comments
I’ll take a look |
Possible definitions of done for this issue:
Option 1 is trivial. I think option 2 requires significant changes to the library. Since you outlined something akin to option 2 in In the meantime, here's a solution for option 1: import datasets
data_dir = './data/annotated/api'
features = datasets.Features({'id': datasets.Value(dtype='string'),
'name': datasets.Value(dtype='string'),
'author': datasets.Value(dtype='string'),
'description': datasets.Value(dtype='string'),
'tags': datasets.Sequence(feature=datasets.Value(dtype='string'), length=-1),
'likes': datasets.Value(dtype='int64'),
'viewed': datasets.Value(dtype='int64'),
'published': datasets.Value(dtype='int64'),
'date': datasets.Value(dtype='string'),
'time_retrieved': datasets.Value(dtype='string'),
'image_code': datasets.Value(dtype='string'),
'image_inputs': [{'channel': datasets.Value(dtype='int64'),
'ctype': datasets.Value(dtype='string'),
'id': datasets.Value(dtype='int64'),
'published': datasets.Value(dtype='int64'),
'sampler': {'filter': datasets.Value(dtype='string'),
'internal': datasets.Value(dtype='string'),
'srgb': datasets.Value(dtype='string'),
'vflip': datasets.Value(dtype='string'),
'wrap': datasets.Value(dtype='string')},
'src': datasets.Value(dtype='string')}],
'common_code': datasets.Value(dtype='string'),
'sound_code': datasets.Value(dtype='string'),
'sound_inputs': [{'channel': datasets.Value(dtype='int64'),
'ctype': datasets.Value(dtype='string'),
'id': datasets.Value(dtype='int64'),
'published': datasets.Value(dtype='int64'),
'sampler': {'filter': datasets.Value(dtype='string'),
'internal': datasets.Value(dtype='string'),
'srgb': datasets.Value(dtype='string'),
'vflip': datasets.Value(dtype='string'),
'wrap': datasets.Value(dtype='string')},
'src': datasets.Value(dtype='string')}],
'buffer_a_code': datasets.Value(dtype='string'),
'buffer_a_inputs': [{'channel': datasets.Value(dtype='int64'),
'ctype': datasets.Value(dtype='string'),
'id': datasets.Value(dtype='int64'),
'published': datasets.Value(dtype='int64'),
'sampler': {'filter': datasets.Value(dtype='string'),
'internal': datasets.Value(dtype='string'),
'srgb': datasets.Value(dtype='string'),
'vflip': datasets.Value(dtype='string'),
'wrap': datasets.Value(dtype='string')},
'src': datasets.Value(dtype='string')}],
'buffer_b_code': datasets.Value(dtype='string'),
'buffer_b_inputs': [{'channel': datasets.Value(dtype='int64'),
'ctype': datasets.Value(dtype='string'),
'id': datasets.Value(dtype='int64'),
'published': datasets.Value(dtype='int64'),
'sampler': {'filter': datasets.Value(dtype='string'),
'internal': datasets.Value(dtype='string'),
'srgb': datasets.Value(dtype='string'),
'vflip': datasets.Value(dtype='string'),
'wrap': datasets.Value(dtype='string')},
'src': datasets.Value(dtype='string')}],
'buffer_c_code': datasets.Value(dtype='string'),
'buffer_c_inputs': [{'channel': datasets.Value(dtype='int64'),
'ctype': datasets.Value(dtype='string'),
'id': datasets.Value(dtype='int64'),
'published': datasets.Value(dtype='int64'),
'sampler': {'filter': datasets.Value(dtype='string'),
'internal': datasets.Value(dtype='string'),
'srgb': datasets.Value(dtype='string'),
'vflip': datasets.Value(dtype='string'),
'wrap': datasets.Value(dtype='string')},
'src': datasets.Value(dtype='string')}],
'buffer_d_code': datasets.Value(dtype='string'),
'buffer_d_inputs': [{'channel': datasets.Value(dtype='int64'),
'ctype': datasets.Value(dtype='string'),
'id': datasets.Value(dtype='int64'),
'published': datasets.Value(dtype='int64'),
'sampler': {'filter': datasets.Value(dtype='string'),
'internal': datasets.Value(dtype='string'),
'srgb': datasets.Value(dtype='string'),
'vflip': datasets.Value(dtype='string'),
'wrap': datasets.Value(dtype='string')},
'src': datasets.Value(dtype='string')}],
'cube_a_code': datasets.Value(dtype='string'),
'cube_a_inputs': [{'channel': datasets.Value(dtype='int64'),
'ctype': datasets.Value(dtype='string'),
'id': datasets.Value(dtype='int64'),
'published': datasets.Value(dtype='int64'),
'sampler': {'filter': datasets.Value(dtype='string'),
'internal': datasets.Value(dtype='string'),
'srgb': datasets.Value(dtype='string'),
'vflip': datasets.Value(dtype='string'),
'wrap': datasets.Value(dtype='string')},
'src': datasets.Value(dtype='string')}],
'thumbnail': datasets.Value(dtype='string'),
'access': datasets.Value(dtype='string'),
'license': datasets.Value(dtype='string'),
'functions': datasets.Sequence(feature=datasets.Sequence(feature=datasets.Value(dtype='int64'), length=-1), length=-1),
'test': datasets.Value(dtype='string')})
datasets.load_dataset('json', data_dir=data_dir, features=features) |
As pointed out by @hvaara, you can define explicit features so that you avoid the Note that the feature inference is done from the first few samples of JSON-Lines on purpose, so that the entire data does not need to be parsed twice (it would be inefficient for very large datasets). |
I understand this. But can there be a solution that doesn't require the end user to write this shema by hand(in my case there is some fields that contain a nested structure)? Maybe offer an option to infer the shema automatically before loading the dataset. Or perhaps - trigger such a method when this error arises? Is this "first few files" heuristics accessible via kwargs perhaps. Maybe an error that says There might be efficient implementations to solve this problem for larger datasets. |
@Vipitis raised a good point on the HF Discord regarding the use of a dataset script to provide the schema during initialization. Using this approach requires setting For cases where using a dataset script is acceptable, would it be helpful to add functionality to the library (not necessarily in Alternatively, for situations where features need to be known at load-time without using a dataset script, another option could be loading the dataset schema from a file format that doesn't require |
Describe the bug
likely related to #6460
using
datasets.load_dataset("json", data_dir= ... )
with multiple.jsonl
files will error if one of the files (maybe the first file?) contains a full column of empty data.Steps to reproduce the bug
real world example:
data is available in this PR-branch. Because my files are chunked by months, some months contain all empty data for some columns, just by chance - these are
[]
. Otherwise it's all the same structure.you get a long error trace, where in the middle it says something like
toy example: (on request)
Expected behavior
Some suggestions
as a workaround I have lazily implemented the following (essentially step 2)
this works fine for my usecase, but is potentially slower and less memory efficient for really large datasets (where this is unlikely to happen in the first place).
Environment info
datasets
version: 2.20.0huggingface_hub
version: 0.23.4fsspec
version: 2023.10.0The text was updated successfully, but these errors were encountered: