Inconsistent "The features can't be aligned" error when combining map, multiprocessing, and variable length outputs #6020

kheyer · 2023-07-11T20:40:38Z

Describe the bug

I'm using a dataset with map and multiprocessing to run a function that returned a variable length list of outputs. This output list may be empty. Normally this is handled fine, but there is an edge case that crops up when using multiprocessing. In some cases, an empty list result ends up in a dataset shard consisting of a single item. This results in a The features can't be aligned error that is difficult to debug because it depends on the number of processes/shards used.

I've reproduced a minimal example below. My current workaround is to fill empty results with a dummy value that I filter after, but this was a weird error that took a while to track down.

Steps to reproduce the bug

import datasets

dataset = datasets.Dataset.from_list([{'idx':i} for i in range(60)])

def test_func(row, idx):
    if idx==58:
        return {'output': []}
    else:
        return {'output' : [{'test':1}, {'test':2}]}

# this works fine
test1 = dataset.map(lambda row, idx: test_func(row, idx), with_indices=True, num_proc=4)

# this fails
test2 = dataset.map(lambda row, idx: test_func(row, idx), with_indices=True, num_proc=32)
>ValueError: The features can't be aligned because the key output of features {'idx': Value(dtype='int64', id=None), 'output': Sequence(feature=Value(dtype='null', id=None), length=-1, id=None)} has unexpected type - Sequence(feature=Value(dtype='null', id=None), length=-1, id=None) (expected either [{'test': Value(dtype='int64', id=None)}] or Value("null").

The error occurs during the check

_check_if_features_can_be_aligned([dset.features for dset in dsets])

When the multiprocessing splitting lines up just right with the empty return value, one of the dset in dsets will have a single item with an empty list value, causing the error.

Expected behavior

Expected behavior is the result would be the same regardless of the num_proc value used.

Environment info

Datasets version 2.11.0
Python 3.9.16

The text was updated successfully, but these errors were encountered:

mariosasko · 2023-07-12T15:58:05Z

This scenario currently requires explicitly passing the target features (to avoid the error):

import datasets

...

features = dataset.features
features["output"] = = [{"test": datasets.Value("int64")}]
test2 = dataset.map(lambda row, idx: test_func(row, idx), with_indices=True, num_proc=32, features=features)

jphme · 2023-10-25T13:54:42Z

I just encountered the same error in the same situation (multiprocessing with variable length outputs).

The funny (or dangerous?) thing is, that this error only showed up when testing with a small test dataset (16 examples, ValueError with num_proc >1) but the same code works fine for the full dataset (~70k examples).

@mariosasko Any idea on how to do that with a nested feature with lists of variable lengths containing dicts?

EDIT: Was able to narrow it down: >200 Examples: no error, <150 Examples: Error.
Now idea what to make of this but pretty obvious that this is a bug....

Ananthzeke · 2024-02-10T19:24:28Z

This error also occurs while concatenating the datasets.

SirRob1997 · 2024-10-27T06:19:31Z

I'm running into the same error, is there any working workaround for this that doesnt involve using a larger subset or reducing the number of workers? I couldn't get the features set mentioned above to work...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent "The features can't be aligned" error when combining map, multiprocessing, and variable length outputs #6020

Inconsistent "The features can't be aligned" error when combining map, multiprocessing, and variable length outputs #6020

kheyer commented Jul 11, 2023

mariosasko commented Jul 12, 2023 •

edited

Loading

jphme commented Oct 25, 2023 •

edited

Loading

Ananthzeke commented Feb 10, 2024

SirRob1997 commented Oct 27, 2024 •

edited

Loading

Inconsistent "The features can't be aligned" error when combining map, multiprocessing, and variable length outputs #6020

Inconsistent "The features can't be aligned" error when combining map, multiprocessing, and variable length outputs #6020

Comments

kheyer commented Jul 11, 2023

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

mariosasko commented Jul 12, 2023 • edited Loading

jphme commented Oct 25, 2023 • edited Loading

Ananthzeke commented Feb 10, 2024

SirRob1997 commented Oct 27, 2024 • edited Loading

mariosasko commented Jul 12, 2023 •

edited

Loading

jphme commented Oct 25, 2023 •

edited

Loading

SirRob1997 commented Oct 27, 2024 •

edited

Loading