-
Notifications
You must be signed in to change notification settings - Fork 760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Align buffers from Python (FFI) #6472
Conversation
I think there is an expectation that FFI doesn't copy buffers, and that data provided is aligned as per the arrow specification. This is actually a bug in pyarrow and there are some solutions on the linked issue - apache/arrow#32276 (comment) That being said it might be good to document this further as this bug has been around for a while now |
@tustvold How does Rust Arrow Flight avoid mis-aligned buffers given it also uses gRPC? |
Tonic, our gRPC implementation, provides a zero-copy decode mechanism using Bytes, this internally works by taking refcounted slices of the underlying HTTP frame's payload. This is then fed into the IPC decoder which then constructs Buffer from this, copying when necessary for alignment. This does mean that potentially some data is copied twice, once from the network socket and once into the destination Buffer, but ultimately one of these copies is unavoidable with a networked protocol and practically the network throughput will be the major bottleneck regardless. It is also worth noting we enforce alignment based on the encoded data, this means that string data, which often comprises the majority of data by volume, will never be copied more than once as there is no alignment requirements for bytes. |
@tustvold Maybe C++ Flight client should do the same: copy buffers if needed when handed to IPC: apache/arrow#44279 |
That makes sense to me, although the final call will lie with the maintainers of the C++ implementation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arrow/src/pyarrow.rs
Outdated
@@ -363,13 +363,14 @@ impl FromPyArrow for RecordBatch { | |||
|
|||
let schema_ptr = unsafe { schema_capsule.reference::<FFI_ArrowSchema>() }; | |||
let ffi_array = unsafe { FFI_ArrowArray::from_raw(array_capsule.pointer().cast()) }; | |||
let array_data = unsafe { ffi::from_ffi(ffi_array, schema_ptr) }.map_err(to_py_err)?; | |||
let mut array_data = unsafe { ffi::from_ffi(ffi_array, schema_ptr) }.map_err(to_py_err)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also appears we need to run |
I feel fairly strongly that we should keep the current contract where a misalignment yields an error, as it will be surprising to users if buffers get copied across an FFI boundary, which is reasonably expected to keep the buffers as-is. That being said, should it not be possible to make progress on apache/arrow#44279 which fixes the actual upstream issue we can potentially move forward on this, perhaps gated behind a feature flag. |
FWIW I think this PR is very specific to pyarrow (it will align misaligned data from pyarrow but not other arbitrary FFI boundaries).
It seems reasonable to me to make a feature flag like "ffi_strict" that would throw an error if pyarrow provides misaligned memory. I think it should be disabled by default (aka accept data made by pyarrow by default) so as not to penalize arrow-rs users for a bug in pyarrow. |
Aah, I missed that detail, in which case this is probably fine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I merged this PR up from main, ran cargo fmt
and added some comments
I plan to create the next arrow release candidate today and would like to include this fix as well
It would still be good to have tests for this scenario, but I think creating the testing framework (python flight client, server, etc) might be overly burdensome. @EnricoMi let me know your thoughts.
That is actually quite simple: https://github.com/EnricoMi/arrow-rs/pull/1/files#diff-6a2014ebd05cb0c56d1154780d5836d04ac1f4d8626de3da67edc64bb03414a0 But I don't know how to pass the data through the |
I will be honest I don't understand how to invoke this code either 🤔 |
Which issue does this PR close?
Together with #6462 fixes #6471.
Rationale for this change
Data coming from Python could be misaligned. Using them will fail (see #6471):
Costly copying the data is better than failing and not being able to use the data.
What changes are included in this PR?
Call existing
ArrayData.align_buffers
right after array data are retrieved via FFI (as suggested by the error message).Are there any user-facing changes?
Data become useable.