You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using a dataset with map and multiprocessing to run a function that returned a variable length list of outputs. This output list may be empty. Normally this is handled fine, but there is an edge case that crops up when using multiprocessing. In some cases, an empty list result ends up in a dataset shard consisting of a single item. This results in a The features can't be aligned error that is difficult to debug because it depends on the number of processes/shards used.
I've reproduced a minimal example below. My current workaround is to fill empty results with a dummy value that I filter after, but this was a weird error that took a while to track down.
Steps to reproduce the bug
importdatasetsdataset=datasets.Dataset.from_list([{'idx':i} foriinrange(60)])
deftest_func(row, idx):
ifidx==58:
return {'output': []}
else:
return {'output' : [{'test':1}, {'test':2}]}
# this works finetest1=dataset.map(lambdarow, idx: test_func(row, idx), with_indices=True, num_proc=4)
# this failstest2=dataset.map(lambdarow, idx: test_func(row, idx), with_indices=True, num_proc=32)
>ValueError: Thefeaturescan't be aligned because the key output of features {'idx': Value(dtype='int64', id=None), 'output': Sequence(feature=Value(dtype='null', id=None), length=-1, id=None)} has unexpected type - Sequence(feature=Value(dtype='null', id=None), length=-1, id=None) (expected either [{'test': Value(dtype='int64', id=None)}] orValue("null").
When the multiprocessing splitting lines up just right with the empty return value, one of the dset in dsets will have a single item with an empty list value, causing the error.
Expected behavior
Expected behavior is the result would be the same regardless of the num_proc value used.
Environment info
Datasets version 2.11.0
Python 3.9.16
The text was updated successfully, but these errors were encountered:
I just encountered the same error in the same situation (multiprocessing with variable length outputs).
The funny (or dangerous?) thing is, that this error only showed up when testing with a small test dataset (16 examples, ValueError with num_proc >1) but the same code works fine for the full dataset (~70k examples).
@mariosasko Any idea on how to do that with a nested feature with lists of variable lengths containing dicts?
EDIT: Was able to narrow it down: >200 Examples: no error, <150 Examples: Error.
Now idea what to make of this but pretty obvious that this is a bug....
I'm running into the same error, is there any working workaround for this that doesnt involve using a larger subset or reducing the number of workers? I couldn't get the features set mentioned above to work...
Describe the bug
I'm using a dataset with map and multiprocessing to run a function that returned a variable length list of outputs. This output list may be empty. Normally this is handled fine, but there is an edge case that crops up when using multiprocessing. In some cases, an empty list result ends up in a dataset shard consisting of a single item. This results in a
The features can't be aligned
error that is difficult to debug because it depends on the number of processes/shards used.I've reproduced a minimal example below. My current workaround is to fill empty results with a dummy value that I filter after, but this was a weird error that took a while to track down.
Steps to reproduce the bug
The error occurs during the check
When the multiprocessing splitting lines up just right with the empty return value, one of the
dset
indsets
will have a single item with an empty list value, causing the error.Expected behavior
Expected behavior is the result would be the same regardless of the
num_proc
value used.Environment info
Datasets version 2.11.0
Python 3.9.16
The text was updated successfully, but these errors were encountered: