Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support for Arrow PyCapsule interface #23

Merged
merged 4 commits into from
Aug 13, 2024

Conversation

kylebarron
Copy link
Collaborator

The Arrow PyCapsule Interface defines a way for Python libraries to exchange Arrow data at the binary level without needing to know library-specific APIs of producer or consumer. If an object exposes the __arrow_c_stream__ dunder method, then you can pass that into another Arrow library's constructor, which will call that method to get the PyCapsule object representing a pointer to an Arrow C stream.

I've been working to promote its adoption throughout the Python Arrow ecosystem.

Because the PyCapsule Interface makes it free to share data across libraries, I've also been working on arro3, a minimal alternative to pyarrow.

With this PR, any object that implements this interface will just work with quak:

image

Since I just had my PyCapsule interface PRs merged in Polars, with the next release of python polars, this will just work, (without going through the DataFrame API; if Polars and DuckDB and Quak all speak Arrow, why go through the DataFrame API?).

DuckDB hasn't implemented the interface itself, so for now I believe you have to go through pyarrow.

Comment on lines +43 to +44
# arrow_table = pa.RecordBatchReader.from_stream(data)
arrow_table = pa.table(data)
Copy link
Collaborator Author

@kylebarron kylebarron Jul 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here this is not ideal because pa.table materializes the entire stream in memory at once. I tried with the RecordBatchReader but I think you might need to adjust your queries to actually create a table instead of querying the stream directly. I assume you do multiple queries on the input... and presumably the first query exhausts the input stream and then the second query has no data.

If I uncomment the RecordBatchReader line above instead, I get:

image

Note that it shows 11k rows in the bottom right, but then doesn't display anything. So I think you need to persist the Arrow stream input manually.

Copy link
Owner

@manzt manzt Jul 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I figured this is not ideal... I just wasn't sure how to wire up duckdb to query something consistently. I'm guessing something like "CREATE VIEW" from the input stream might allow us register some tables that duckdb can continuously query.

Copy link
Collaborator Author

@kylebarron kylebarron Jul 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% certain but I don't think you could always use CREATE VIEW here for Arrow streams. Reading https://duckdb.org/docs/api/python/data_ingestion#directly-accessing-dataframes-and-arrow-objects that makes me think the first time you query the view it will exhaust the input stream.

I think we're running into the same issue as in ibis: ibis-project/ibis#9663 (comment) . In particular, the presence of the dunder method doesn't tell you whether the input data is already materialized or a stream. If you know that the input data is an in-memory table, then you can call __arrow_c_stream__ multiple times and get the full data each time. In that case you'd prefer to make a DuckDB view, because then you don't need to copy the input data. But if the input is a stream, then you would need to persist it into a full duckdb table for ongoing queries.

@@ -37,7 +39,10 @@ def __init__(self, data, *, table: str = "df"):
conn = data
else:
conn = duckdb.connect(":memory:")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if the DuckDB :memory: DB will spill to disk? Or for very large input would it just crash the process?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if the DuckDB :memory: DB will spill to disk? Or for very large input would it just crash the process?

I believe it spills to disk

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I was talking to someone recently who asserted :memory: in particular didn't spill to disk because it doesn't have a path to store a local database. But I'm not sure

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi just stopping by;

in-memory databases will spill to disk, it will create/use the .tmp folder in the current working directory, a database file is not required to spill to disk.

You can test this out by setting a relatively low memory limit and creating a temp table that exceeds this limit.
Using select * from duckdb_temporary_files() will show you which temporary files were created to back this temporary table

Comment on lines 26 to 27
else:
raise ValueError()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These branches aren't mutually exclusive (we grab arrow from the duckdb so that it is captured in the next block), and we don't want to raise a ValueError for displaying stuff that can't be visualized with quak.

Right now

"hello world"

Would raise a value error since it doesn't match any of the conditions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't want to raise a ValueError for displaying stuff that can't be visualized with quak

Oh I see; this is probably the error I was hitting when I tried to repr something else (why I tried to unload quak)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we grab arrow from the duckdb

As a general note, why do you convert it to Arrow if you're passing it back to duckdb? I think duckdb can do replacement scans on DuckDBPyRelation objects.

@kylebarron
Copy link
Collaborator Author

I made a few changes and now I think this should be ready for review as long as we're ok materializing the full Arrow input stream. I think it may be better to not materialize the full Arrow input (especially because DuckDB is capable of larger-than-memory datasets) but that would require creating a DuckDB table from the data input.

@manzt
Copy link
Owner

manzt commented Aug 13, 2024

I think it may be better to not materialize the full Arrow input (especially because DuckDB is capable of larger-than-memory datasets) but that would require creating a DuckDB table from the data input.

I agree. Do you think we could change the behavior in a follow up PR or would that be something we want to handle here? The main thing we need to do is give a table_name for the front end to dispatch queries against.

We could create a duckdb pyrelation from the arrow stream, for example this API is already supported:

conn = duckdb.connect(":memory:)
conn.sql("CREATE VIEW df AS SELECT * FROM ???") # consume arrow c stream
widget = quak.Widget(conn, table="df")

@kylebarron
Copy link
Collaborator Author

I think it's fine to do in a follow up PR; up to you

@manzt
Copy link
Owner

manzt commented Aug 13, 2024

Ok, let's just merge for now!

@manzt manzt changed the title Support for Arrow PyCapsule interface feat: Support for Arrow PyCapsule interface Aug 13, 2024
@manzt manzt merged commit 28d46ce into manzt:main Aug 13, 2024
3 checks passed
@manzt
Copy link
Owner

manzt commented Aug 13, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants