Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent "The features can't be aligned" error when combining map, multiprocessing, and variable length outputs #6020

Open
kheyer opened this issue Jul 11, 2023 · 4 comments

Comments

@kheyer
Copy link

kheyer commented Jul 11, 2023

Describe the bug

I'm using a dataset with map and multiprocessing to run a function that returned a variable length list of outputs. This output list may be empty. Normally this is handled fine, but there is an edge case that crops up when using multiprocessing. In some cases, an empty list result ends up in a dataset shard consisting of a single item. This results in a The features can't be aligned error that is difficult to debug because it depends on the number of processes/shards used.

I've reproduced a minimal example below. My current workaround is to fill empty results with a dummy value that I filter after, but this was a weird error that took a while to track down.

Steps to reproduce the bug

import datasets

dataset = datasets.Dataset.from_list([{'idx':i} for i in range(60)])

def test_func(row, idx):
    if idx==58:
        return {'output': []}
    else:
        return {'output' : [{'test':1}, {'test':2}]}

# this works fine
test1 = dataset.map(lambda row, idx: test_func(row, idx), with_indices=True, num_proc=4)

# this fails
test2 = dataset.map(lambda row, idx: test_func(row, idx), with_indices=True, num_proc=32)
>ValueError: The features can't be aligned because the key output of features {'idx': Value(dtype='int64', id=None), 'output': Sequence(feature=Value(dtype='null', id=None), length=-1, id=None)} has unexpected type - Sequence(feature=Value(dtype='null', id=None), length=-1, id=None) (expected either [{'test': Value(dtype='int64', id=None)}] or Value("null").

The error occurs during the check

_check_if_features_can_be_aligned([dset.features for dset in dsets])

When the multiprocessing splitting lines up just right with the empty return value, one of the dset in dsets will have a single item with an empty list value, causing the error.

Expected behavior

Expected behavior is the result would be the same regardless of the num_proc value used.

Environment info

Datasets version 2.11.0
Python 3.9.16

@mariosasko
Copy link
Collaborator

mariosasko commented Jul 12, 2023

This scenario currently requires explicitly passing the target features (to avoid the error):

import datasets

...

features = dataset.features
features["output"] = = [{"test": datasets.Value("int64")}]
test2 = dataset.map(lambda row, idx: test_func(row, idx), with_indices=True, num_proc=32, features=features)

@jphme
Copy link

jphme commented Oct 25, 2023

I just encountered the same error in the same situation (multiprocessing with variable length outputs).

The funny (or dangerous?) thing is, that this error only showed up when testing with a small test dataset (16 examples, ValueError with num_proc >1) but the same code works fine for the full dataset (~70k examples).

@mariosasko Any idea on how to do that with a nested feature with lists of variable lengths containing dicts?

EDIT: Was able to narrow it down: >200 Examples: no error, <150 Examples: Error.
Now idea what to make of this but pretty obvious that this is a bug....

@Ananthzeke
Copy link

This error also occurs while concatenating the datasets.

@SirRob1997
Copy link

SirRob1997 commented Oct 27, 2024

I'm running into the same error, is there any working workaround for this that doesnt involve using a larger subset or reducing the number of workers? I couldn't get the features set mentioned above to work...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants