fast array extraction #7227

alex-hh · 2024-10-14T20:51:32Z

Implements #7210 using method suggested in #7207 (comment)

import numpy as np
from datasets import Dataset, Features, Array3D
features=Features(**{"array0": Array3D((None, 10, 10), dtype="float32"), "array1": Array3D((None,10,10), dtype="float32")})
dataset = Dataset.from_dict({f"array{i}": [np.zeros((x,10,10), dtype=np.float32) for x in [2000,1000]*25] for i in range(2)}, features=features)

~0.02 s vs 0.9s on main

ds = dataset.to_iterable_dataset()
t0 = time.time()
for ex in ds:
    pass
t1 = time.time()

< 0.01 s vs 1.3 s on main

@lhoestq I can see this breaks a bunch of array-related tests but can update the test cases if you would support making this change?

I also added an Array1D feature which will always be decoded into a numpy array and likewise improves extraction performance:

from datasets import Dataset, Features, Array1D, Sequence, Value
array_features=Features(**{"array0": Array1D((None,), dtype="float32"), "array1": Array1D((None,), dtype="float32")})
sequence_features=Features(**{"array0": Sequence(feature=Value("float32"), length=-1), "array1": Sequence(feature=Value("float32"), length=-1)})
array_dataset = Dataset.from_dict({f"array{i}": [np.zeros((x,), dtype=np.float32) for x in [20000,10000]*25] for i in range(2)}, features=array_features)
sequence_dataset = Dataset.from_dict({f"array{i}": [np.zeros((x,), dtype=np.float32) for x in [20000,10000]*25] for i in range(2)}, features=sequence_features)


```python
t0 = time.time()
for ex in array_dataset.to_iterable_dataset():
    pass
t1 = time.time()

< 0.01 s

t0 = time.time()
for ex in sequence_dataset.to_iterable_dataset():
    pass
t1 = time.time()

~1.1s

And also added support for extracting structs of arrays as dicts of numpy arrays:

import numpy as np
from datasets import Dataset, Features, Array3D, Sequence
features=Features(struct={"array0": Array3D((None,10,10), dtype="float32"), "array1": Array3D((None,10,10), dtype="float32")}, _list=Sequence(feature=Array3D((None,10,10), dtype="float32")))
dataset = Dataset.from_dict({"struct": [{f"array{i}": np.zeros((x,10,10), dtype=np.float32) for i in range(2)} for x in [2000,1000]*25], "_list": [[np.zeros((x,10,10), dtype=np.float32) for i in range(2)] for x in [2000,1000]*25]}, features=features)

t0 = time.time()
for ex in dataset.to_iterable_dataset():
    pass
t1 = time.time()
assert isinstance(ex["struct"]["array0"], np.ndarray) and ex["struct"]["array0"].ndim == 3

~0.02 s and no exception vs ~7s with an exception on main

lhoestq

nice ! sure feel free to update the tests

lhoestq · 2024-10-15T14:50:13Z

src/datasets/formatting/formatting.py

+        if pa.types.is_struct(pa_array.field(field.name).type):
+            batch[field.name] = extract_struct_array(pa_array.field(field.name))


you can also check if it's a list or large_list type

I checked that lists of ArrayExtensionType features will call ArrayExtensionArray.to_pylist(), which didn't seem to be the case for struct, and is the main performance issue there

Not sure about large list?

cool ! maybe also check list of struct of ArrayExtensionType but no big deal, we can fix that rare case later (large list is also rare)

the list of struct case might require an ArrayExtensionScalar or something with an as_py method that returns a numpy object.

Seems like it could be useful but have no idea whether this is possible or how best to do it if so?

unless you know how to do this could we leave as issue?

maybe just add a TODO comment about it for now ?

HuggingFaceDocBuilderDev · 2024-10-15T14:51:35Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

alex-hh · 2024-10-15T16:56:37Z

I've updated the most straightforward failing test cases - lmk if you agree with those.

Might need some help / pointers on the remaining new failing tests, which seem a little bit more subtle.

alex-hh · 2024-10-18T11:36:54Z

@lhoestq I've had a go at fixing a few more test cases but getting quite uncertain about the remaining ones (as well as about some of the array writing ones that I tried to fix in my last commit). There are still 27 failures vs 21 on main. I'm not completely sure in some cases what intended behaviour is and my understanding of the flow for typed writing is a bit vague.

alex-hh added 7 commits October 14, 2024 21:43

fast array extraction

a181015

add array 1d feature

426178e

fast struct extraction by invoking extension type to_pylist

303c4e2

also use to_pylist for list array

0be0895

improve struct extraction

deee87e

handle structs and lists of arrays

ac5a46d

restore arrow array to numpy to numpy extractor

a89ef52

lhoestq reviewed Oct 15, 2024

View reviewed changes

alex-hh force-pushed the fast-array-extraction branch from 550d2f0 to a89ef52 Compare October 15, 2024 16:19

alex-hh added 3 commits October 15, 2024 17:43

fix failing array tests

7f1e217

test cast array xd to features fix

abbb59a

test array write

c39c4bc

alex-hh added 3 commits October 15, 2024 18:07

formatting

67f65b5

fix a couple more test cases

97f0f19

fix writing struct arrays

0f37d05

alex-hh added 3 commits November 2, 2024 20:52

handle null rows in struct array

7f8c00c

handle field name inference

0d80abc

formatting

0c01621

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fast array extraction #7227

fast array extraction #7227

alex-hh commented Oct 14, 2024 •

edited

Loading

lhoestq left a comment

lhoestq Oct 15, 2024

alex-hh Oct 15, 2024 •

edited

Loading

lhoestq Oct 15, 2024

alex-hh Oct 15, 2024

alex-hh Oct 15, 2024

lhoestq Oct 15, 2024

HuggingFaceDocBuilderDev commented Oct 15, 2024

alex-hh commented Oct 15, 2024

alex-hh commented Oct 18, 2024

		if pa.types.is_struct(pa_array.field(field.name).type):
		batch[field.name] = extract_struct_array(pa_array.field(field.name))

fast array extraction #7227

Are you sure you want to change the base?

fast array extraction #7227

Conversation

alex-hh commented Oct 14, 2024 • edited Loading

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq Oct 15, 2024

Choose a reason for hiding this comment

alex-hh Oct 15, 2024 • edited Loading

Choose a reason for hiding this comment

lhoestq Oct 15, 2024

Choose a reason for hiding this comment

alex-hh Oct 15, 2024

Choose a reason for hiding this comment

alex-hh Oct 15, 2024

Choose a reason for hiding this comment

lhoestq Oct 15, 2024

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Oct 15, 2024

alex-hh commented Oct 15, 2024

alex-hh commented Oct 18, 2024

alex-hh commented Oct 14, 2024 •

edited

Loading

alex-hh Oct 15, 2024 •

edited

Loading