-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fast array extraction #7227
base: main
Are you sure you want to change the base?
fast array extraction #7227
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice ! sure feel free to update the tests
if pa.types.is_struct(pa_array.field(field.name).type): | ||
batch[field.name] = extract_struct_array(pa_array.field(field.name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can also check if it's a list
or large_list
type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked that lists of ArrayExtensionType features will call ArrayExtensionArray.to_pylist(), which didn't seem to be the case for struct, and is the main performance issue there
Not sure about large list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool ! maybe also check list of struct of ArrayExtensionType but no big deal, we can fix that rare case later (large list is also rare)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the list of struct case might require an ArrayExtensionScalar or something with an as_py method that returns a numpy object.
Seems like it could be useful but have no idea whether this is possible or how best to do it if so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unless you know how to do this could we leave as issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe just add a TODO comment about it for now ?
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
550d2f0
to
a89ef52
Compare
I've updated the most straightforward failing test cases - lmk if you agree with those. Might need some help / pointers on the remaining new failing tests, which seem a little bit more subtle. |
@lhoestq I've had a go at fixing a few more test cases but getting quite uncertain about the remaining ones (as well as about some of the array writing ones that I tried to fix in my last commit). There are still 27 failures vs 21 on main. I'm not completely sure in some cases what intended behaviour is and my understanding of the flow for typed writing is a bit vague. |
Implements #7210 using method suggested in #7207 (comment)
~0.02 s vs 0.9s on main
< 0.01 s vs 1.3 s on main
@lhoestq I can see this breaks a bunch of array-related tests but can update the test cases if you would support making this change?
I also added an Array1D feature which will always be decoded into a numpy array and likewise improves extraction performance:
< 0.01 s
~1.1s
And also added support for extracting structs of arrays as dicts of numpy arrays:
~0.02 s and no exception vs ~7s with an exception on main