Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow string_view raises DuckDB NotImplementedException #41

Closed
manzt opened this issue Aug 14, 2024 · 20 comments
Closed

Arrow string_view raises DuckDB NotImplementedException #41

manzt opened this issue Aug 14, 2024 · 20 comments

Comments

@manzt
Copy link
Owner

manzt commented Aug 14, 2024

#23 led to a regression for string data types, specifically with the string_view Arrow data type not being recognized by DuckDB. I'm going to add some tests to ensure our Arrow/DuckDB connection code works, as we should have end-to-end tests to catch this.

cc: @kylebarron

import polars as pl
import pyarrow as pa
import quak

movies = pl.read_json("movies.json")
print(pa.table(movies).schema)
quak.Widget(movies.select("Director"))
Title: string_view
US Gross: int64
Worldwide Gross: int64
US DVD Sales: int64
Production Budget: int64
Release Date: string_view
MPAA Rating: string_view
Running Time min: int64
Distributor: string_view
Source: string_view
Major Genre: string_view
Creative Type: string_view
Director: string_view
Rotten Tomatoes Rating: int64
IMDB Rating: double
IMDB Votes: int64
Director: string_view
---------------------------------------------------------------------------
NotImplementedException                   Traceback (most recent call last)
Cell In[9], line 7
      5 movies = pl.read_json("movies.json")
      6 print(pa.table(movies).schema)
----> 7 quak.Widget(movies.select("Director"))

File ~/github/manzt/quak/src/quak/_widget.py:60, in Widget.__init__(self, data, table)
     55     else:
     56         raise ValueError(
     57             "input must be a DuckDB connection, DataFrame-like, an Arrow IPC "
     58             "table, or an Arrow object exporting the Arrow C Stream interface."
     59         )
---> 60     conn.register(table, arrow_table)
     61 self._conn = conn
     62 super().__init__(
     63     _table_name=table,
     64     _columns=get_columns(conn, table),
     65     temp_indexes=True,
     66     sql=f'SELECT * FROM "{table}"',
     67 )

NotImplementedException: Not implemented Error: Unsupported Internal Arrow Type vu
@kylebarron
Copy link
Collaborator

😢

I'd argue this is a DuckDB bug, since DuckDB says it supports Arrow input, and so it should check for these data types. The Arrow PyCapsule Interface actually has a mechanism to fix this: it allows data consumers to check the input schema and ask the producer to cast the data to a desired output schema.

For now, it shouldn't be too hard to manually work around this using pyarrow APIs

@manzt
Copy link
Owner Author

manzt commented Aug 14, 2024

All good!

For now, it shouldn't be too hard to manually work around this using pyarrow APIs

So once we get back the table we should cast some columns?

@kylebarron
Copy link
Collaborator

Yeah, we should be able to check for any string view (probably also binary view, and maybe run-end-array) and cast those to duckdb-supported types. It would be nice if there were a resource in DuckDB that said which Arrow types it supports.

@manzt
Copy link
Owner Author

manzt commented Aug 14, 2024

Yeah, we should be able to check for any string view (probably also binary view, and maybe run-end-array) and cast those to duckdb-supported types.

Ok working on a PR!

@manzt
Copy link
Owner Author

manzt commented Aug 14, 2024

@kylebarron
Copy link
Collaborator

Oh so the latest version (maybe unreleased) of duckdb should support it: https://github.com/duckdb/duckdb/blame/4e3a192ce94a793510f11b598805f104d7531c15/src/function/table/arrow.cpp#L88-L89

@manzt
Copy link
Owner Author

manzt commented Aug 14, 2024

@manzt
Copy link
Owner Author

manzt commented Aug 14, 2024

Oh... yup lol you beat me to it

@kylebarron
Copy link
Collaborator

I made a little note in my argument for DuckDB to support the interface: duckdb/duckdb#10716 (comment)

@manzt
Copy link
Owner Author

manzt commented Aug 14, 2024

Made a PR, but the implementation isn't working since the string_view -> string cast doesnt' seem to be supported. I'll try to dig into it later... but it would suck if we needed to pull the data into a pylist first.

@kylebarron
Copy link
Collaborator

😱 how is string_view -> string not a valid cast??

I would suggest temporarily using arro3 to do the cast, but arrow-rs won't support view types until the next major release (in ~one month)

The other option to support polars specifically for now is to call polars' to_arrow method, which allows Polars to cast to non-view array types if you set the CompatLevel as such.

@manzt
Copy link
Owner Author

manzt commented Aug 14, 2024

😱 how is string_view -> string not a valid cast??

Lol i know...

I would suggest temporarily using arro3 to do the cast, but arrow-rs won't support view types until the next major release (in ~one month)

I would defer to you here as the arrow guru :) what do you think would be the best choice for the time being?

I could definitely do the polars patch today... but i am not familiar enough with the arro3 yet :)

@kylebarron
Copy link
Collaborator

Just checking re duckdb/duckdb#10716 (reply in thread) you're on the latest DuckDB version?

@kylebarron
Copy link
Collaborator

It turns out that was a simple repro. This fails:

import polars as pl
import duckdb
import pyarrow as pa

df = pl.DataFrame({"a": ["a", "b", "c"]})
table = pa.table(df)
duckdb.from_arrow(table)
---------------------------------------------------------------------------
NotImplementedException                   Traceback (most recent call last)
File /Users/kyle/github/kylebarron/arro3/tests/core/test_ffi.py:[1](https://file+.vscode-resource.vscode-cdn.net/Users/kyle/github/kylebarron/arro3/tests/core/test_ffi.py:1)
----> 1 duckdb.from_arrow(table)

File ~/github/kylebarron/arro3/.venv/lib/python3.11/site-packages/duckdb/__init__.py:505, in from_arrow(arrow_object, **kwargs)
    [503](https://file+.vscode-resource.vscode-cdn.net/Users/kyle/github/kylebarron/arro3/~/github/kylebarron/arro3/.venv/lib/python3.11/site-packages/duckdb/__init__.py:503) else:
    [504](https://file+.vscode-resource.vscode-cdn.net/Users/kyle/github/kylebarron/arro3/~/github/kylebarron/arro3/.venv/lib/python3.11/site-packages/duckdb/__init__.py:504)     conn = duckdb.connect(":default:")
--> [505](https://file+.vscode-resource.vscode-cdn.net/Users/kyle/github/kylebarron/arro3/~/github/kylebarron/arro3/.venv/lib/python3.11/site-packages/duckdb/__init__.py:505) return conn.from_arrow(arrow_object, **kwargs)

NotImplementedException: Not implemented Error: Unsupported Internal Arrow Type vu

With polars 1.4.1, duckdb 1.0.0, pyarrow 17.0.0.

@manzt
Copy link
Owner Author

manzt commented Aug 14, 2024

Just checking re duckdb/duckdb#10716 (reply in thread) you're on the latest DuckDB version?

Yes, and just tried out with nightly 1.0.1.dev4096

@kylebarron
Copy link
Collaborator

I would defer to you here as the arrow guru :) what do you think would be the best choice for the time being?

I don't think there's a good way today to handle string view -> string data type casting. I suppose the best workaround right now is to hard-code support for polars, check for a polars.DataFrame object, and call its to_arrow() method.

Even though this goes against what I want with the pycapsule interface, which is for consumers to not have to think about where the data is coming from 😛

arro3 isn't capable of this until the next arrow-rs release

@manzt
Copy link
Owner Author

manzt commented Aug 14, 2024

Done, see #42

@kylebarron
Copy link
Collaborator

Yes, and just tried out with nightly 1.0.1.dev4096

Oh it works for me with this same nightly

image image

And the upstream was closed for being fixed on latest main duckdb/duckdb#13424

@manzt
Copy link
Owner Author

manzt commented Aug 14, 2024

ok maybe I didn't restart my jupyter kernel :(

@manzt
Copy link
Owner Author

manzt commented Aug 14, 2024

Yup, can confirm that 1.0.1.dev4096 is working:

image

I'll revert #42 after the next python release

@manzt manzt closed this as completed Aug 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants