Cannot append Pandas dataframe to existing array #592

Hoeze · 2021-06-11T01:11:22Z

Hi, I'm trying to write an array like this:

# +
import json

import tiledb
import numpy as np
import pandas as pd
import random
# -

test_df = pd.DataFrame.from_records(json.loads('{"chrom":{"0":"chr1","1":"chr1","2":"chr1","3":"chr1","4":"chr1","5":"chr1","8":"chr1","9":"chr1"},"log10_len":{"0":1,"1":1,"2":1,"3":1,"4":1,"5":1,"8":0,"9":0},"start":{"0":10108,"1":10108,"2":10108,"3":10108,"4":10108,"5":10108,"8":10143,"9":10143},"end":{"0":10114,"1":10114,"2":10114,"3":10114,"4":10114,"5":10114,"8":10144,"9":10144},"ref":{"0":"AACCCT","1":"AACCCT","2":"AACCCT","3":"AACCCT","4":"AACCCT","5":"AACCCT","8":"T","9":"T"},"alt":{"0":"A","1":"A","2":"A","3":"A","4":"A","5":"A","8":"C","9":"C"},"sample_id":{"0":"A","1":"B","2":"C","3":"D","4":"E","5":"F","8":"A","9":"B"},"GT":{"0":1,"1":1,"2":1,"3":1,"4":1,"5":1,"8":1,"9":1},"GQ":{"0":79,"1":39,"2":60,"3":99,"4":26,"5":62,"8":22,"9":65},"DP":{"0":12,"1":9,"2":39,"3":26,"4":9,"5":9,"8":35,"9":34}}'))
test_df

output_path="test.tdb"

ctx = tiledb.default_ctx()
ctx

# +
genotype_domain = tiledb.Domain(
    tiledb.Dim(name="chrom", domain=(None,None), tile=1, dtype=np.bytes_, ctx=ctx),
    tiledb.Dim(name="log10_len", domain=(0, np.iinfo(np.int8).max), tile=1, dtype=np.int8, ctx=ctx),
    tiledb.Dim(name="start", domain=(0, np.iinfo(np.int32).max), tile=100000, dtype=np.int32, ctx=ctx),
    tiledb.Dim(name="alt", domain=(None,None), tile=None, dtype=np.bytes_, ctx=ctx),
#     tiledb.Dim(name="end", domain=(1, np.iinfo(np.int32).max), dtype=np.int32, ctx=ctx),
    tiledb.Dim(name="sample_id", domain=(None,None), tile=None, dtype=np.bytes_, ctx=ctx),
    ctx=ctx,
)

string_filters = tiledb.FilterList([tiledb.ZstdFilter(level=-1),])
int_filters = tiledb.FilterList([tiledb.ByteShuffleFilter(), tiledb.ZstdFilter(level=-1),])
attrs = [
    tiledb.Attr(name='end', dtype='int32', var=False, nullable=False, filters=int_filters),
    tiledb.Attr(name='ref', dtype='S', nullable=False, filters=string_filters),
    tiledb.Attr(name='GT', dtype='int8', var=False, nullable=False, filters=int_filters),
    tiledb.Attr(name='GQ', dtype='int32', var=False, nullable=True, filters=int_filters),
    tiledb.Attr(name='DP', dtype='int32', var=False, nullable=True, filters=int_filters),
]
# -

schema = tiledb.ArraySchema(
    domain=genotype_domain,
    attrs=attrs,
    sparse=True,
    cell_order="hilbert",
#     capacity=10000,
    ctx=ctx,
)
schema

if not tiledb.array_exists(output_path):
    print("Creating array at '%s'..." % output_path)
    tiledb.array.SparseArray.create(output_path, schema, ctx=ctx)

tiledb.from_dataframe(output_path, test, sparse=True, mode="append")

However, the last line causes the following error:

---------------------------------------------------------------------------
TileDBError                               Traceback (most recent call last)
<ipython-input-84-d6af3de39a7d> in <module>
----> 1 tiledb.from_dataframe(output_path, test, sparse=True, mode="append")

/opt/anaconda/envs/tiledb/lib/python3.8/site-packages/tiledb/dataframe_.py in from_dataframe(uri, dataframe, **kwargs)
    485     )
    486 
--> 487     from_pandas(uri, dataframe, **kwargs)
    488 
    489 

/opt/anaconda/envs/tiledb/lib/python3.8/site-packages/tiledb/dataframe_.py in from_pandas(uri, dataframe, **kwargs)
    575                 dataframe, column_infos, tiledb_args.get("fillna")
    576             )
--> 577             _write_array(
    578                 uri,
    579                 dataframe,

/opt/anaconda/envs/tiledb/lib/python3.8/site-packages/tiledb/dataframe_.py in _write_array(uri, df, write_dict, nullmaps, create_array, row_start_idx, timestamp)
    649                     coords.append(df.index.get_level_values(k))
    650             # TODO ensure correct col/dim ordering
--> 651             libtiledb._setitem_impl_sparse(A, tuple(coords), write_dict, nullmaps)
    652 
    653         else:

tiledb/libtiledb.pyx in tiledb.libtiledb._setitem_impl_sparse()

tiledb/libtiledb.pyx in tiledb.libtiledb._write_array()

tiledb/libtiledb.pyx in tiledb.libtiledb._raise_ctx_err()

tiledb/libtiledb.pyx in tiledb.libtiledb._raise_tiledb_error()

TileDBError: [TileDB::Writer] Error: Cannot set buffer; Input attribute/dimension 'GQ' is nullable

Is there some mistake in my code?

PS: I had to set sparse=True in from_dataframe to be able to write, although the schema is already present.

The text was updated successfully, but these errors were encountered:

ihnorton · 2021-06-11T01:40:57Z

Hi Florian, currently, nullable attributes require using Pandas nullable types for input columns. The following diff works for me:

git diff py592.py.orig py592.py                                                                                            ✘ 1
diff --git a/py592.py.orig b/py592.py
index 42c86b4..dacabb5 100644
--- a/py592.py.orig
+++ b/py592.py
@@ -10,6 +10,8 @@ import random
 test_df = pd.DataFrame.from_records(json.loads('{"chrom":{"0":"chr1","1":"chr1","2":"chr1","3":"chr1","4":"chr1","5":"chr1","8":"chr1","9":"chr1"},"log10_len":{"0":1,"1":1,"2":1,"3":1,"4":1,"5":1,"8":0,"9":0},"start":{"0":10108,"1":10108,"2":10108,"3":10108,"4":10108,"5":10108,"8":10143,"9":10143},"end":{"0":10114,"1":10114,"2":10114,"3":10114,"4":10114,"5":10114,"8":10144,"9":10144},"ref":{"0":"AACCCT","1":"AACCCT","2":"AACCCT","3":"AACCCT","4":"AACCCT","5":"AACCCT","8":"T","9":"T"},"alt":{"0":"A","1":"A","2":"A","3":"A","4":"A","5":"A","8":"C","9":"C"},"sample_id":{"0":"A","1":"B","2":"C","3":"D","4":"E","5":"F","8":"A","9":"B"},"GT":{"0":1,"1":1,"2":1,"3":1,"4":1,"5":1,"8":1,"9":1},"GQ":{"0":79,"1":39,"2":60,"3":99,"4":26,"5":62,"8":22,"9":65},"DP":{"0":12,"1":9,"2":39,"3":26,"4":9,"5":9,"8":35,"9":34}}'))
 test_df

+test_df = test_df.astype({'GQ': pd.Int64Dtype(), 'DP': pd.Int64Dtype()})
+
 output_path="test.tdb"

 ctx = tiledb.default_ctx()
@@ -51,4 +53,4 @@ if not tiledb.array_exists(output_path):
     print("Creating array at '%s'..." % output_path)
     tiledb.array.SparseArray.create(output_path, schema, ctx=ctx)

-tiledb.from_dataframe(output_path, test, sparse=True, mode="append")
+tiledb.from_pandas(output_path, test_df, sparse=True, mode="append")

(I've looked at numpy masked arrays a bit as well, but they don't seem to be widely-used, so we probably won't support unless there's a strong use-case)

I had to set sparse=True in from_dataframe to be able to write, although the schema is already present.

Thanks for pointing this out, will fix.

ihnorton · 2021-06-11T02:21:33Z

I had to set sparse=True in from_dataframe to be able to write, although the schema is already present.

Will be fixed by #593.

Hoeze · 2021-06-11T11:45:20Z

Ah, thanks for the hint!
Would it be possible to have automatic dtype conversion?
Some df.astype(dtypes_from_tiledb_schema) included in from_dataframe would give a huge comfort gain here 😄
Otherwise, a hint in the error message would be useful:
Input attribute/dimension 'GQ' is nullable but 'np.int32' is not!

(I've looked at numpy masked arrays a bit as well, but they don't seem to be widely-used, so we probably won't support unless there's a strong use-case)

Yes, they will anyway be refactored after NEP 47 is done.
(For now, they're still useful in case you do not want to carry a separate mask array with you, because pandas does not provide multidimensional arrays.)

ihnorton · 2021-06-11T14:03:21Z

Yes, I will improve the error message at very least.

Would it be possible to have automatic dtype conversion?
Some df.astype(dtypes_from_tiledb_schema) included in from_dataframe would give a huge comfort gain here

These typically induce a copy, which can be expensive (memory wise) for large input dataframes. I'm curious what is the use-case for storing a plain int64 array in a nullable attribute?

Yes, they will anyway be refactored after NEP 47 is done.

Thanks - I don't see any discussion of nullability/mask functionality in that document, so hopefully it is not too much of an afterthought (haven't read all the links though).

ihnorton · 2021-06-11T14:06:45Z

plain int64 array

(in other words, no way to represent nullability)

Hoeze · 2021-06-11T14:22:06Z

These typically induce a copy, which can be expensive (memory wise) for large input dataframes. I'm curious what is the use-case for storing a plain int64 array in a nullable attribute?

In this very concrete example, we might have variant data without corresponding genotype quality assigned.
This means, I have to somehow represent a missing value here:

store float NA instead of int
have a special integer that I interpret as "missing" in my code, e.g. (-1)
Add a separate "GT_missing" column

All of those solutions are not really nice.
For example, I'm always going for the third solution, but it requires special handling everywhere in my code.

Another example from xarray:
pydata/xarray#1194

Bottom line, I very rarely end up having a dataset with no missing values at all.
People just keep to implicitly store those as "None", "float.NA" or simiar because nullable types are not yet well supported in python, in contrast to e.g. Spark.

That's also why people with "Null" as last name might have a bad time 😁
https://www.wired.com/2015/11/null/

ihnorton · 2021-06-11T14:32:52Z

Thanks for the explanation, I agree it's a tricky issue. What I'm still unclear about is why to set the TileDB attribute as nullable if the input is always going to be np.int64 (because that's the only representation available, for the reasons you listed). If you are always casting to/from np.int64 then the nullability of this attribute/column is a no-op because the null/validity status will just be discarded.

Hoeze · 2021-06-11T16:26:06Z

What I'm still unclear about is why to set the TileDB attribute as nullable if the input is always going to be np.int64 (because that's the only representation available, for the reasons you listed).

Hm, I'm not sure if I got you right:
If I would have pd.Series([1, 2, 3, 4, None], dtype="Int32", name="GQ") I would be able to write this to the array, right?

Or is your point on how to store plain numpy arrays in conjunction with a boolean mask?

If you are always casting to/from np.int64 then the nullability of this attribute/column is a no-op because the null/validity status will just be discarded.

Exactly, when the dtype anyway matches, it's not doing anything.
That's why an implicit call to .astype() inside tiledb would make sense IMO.

Hoeze · 2021-06-11T16:34:38Z

In general, I believe you should aim for first-class TileDB support with Apache Arrow.
If you have comprehensive interop with it, people can figure out on how to represent their data with Arrow types themselves.
=> no fiddling with nullable Numpy types.
Also, you get multi-language support for free.

For example, a very big advantage of parquet is that you can read/write the same dataframe in literally every language that supports Apache Arrow. Now imagine replacing Parquet with TileDB 😁

ihnorton · 2021-06-11T16:41:42Z

What I'm specifically trying to understand is why you want to create Attr(name="GQ", nullable=True) when the input is (only?) np.int64. How do you expect us to read it back? As an np.int64 array - then why set nullable=True? Currently we read back nullable attributes as Pandas nullable types, because otherwise it would be a (potentially) lossy read, dropping the semantic nulls (if any).

Re Arrow, yes, agreed. We have to/from support for Arrow buffers in TileDB core, which we use internally by default in TileDB-Py for operations creating a Pandas dataframe. (Array.open_dataframe / Array.df[]). We are working on exposing this in general (to_arrow/ from_arrow in Python). There are some types we can't directly represent right now (eg lists and structs), although there are ways to somewhat work around that pretty efficiently, once the lower-level buffers are exposed; and we will continually expose more features as they are added in TileDB core.

Hoeze · 2021-06-11T16:55:25Z

What I'm specifically trying to understand is why you want to create Attr(name="GQ", nullable=True) when the input is (only?) np.int64. How do you expect us to read it back? As an np.int64 array - then why set nullable=True? Currently we read back nullable attributes as Pandas nullable types, because otherwise it would be a (potentially) lossy read, dropping the semantic nulls (if any).

As I mentioned previously, "GQ" can be missing depending on my data source.
JSON is schema-free, so in my code example it happens that Pandas automatically infers int64 as data type for GQ.
E.g. if one "GQ" would be missing, then the dtype of "GQ" would be inferred as float:

import json
test_df = pd.DataFrame.from_records(json.loads('{"chrom":{"0":"chr1","1":"chr1","2":"chr1","3":"chr1","4":"chr1","5":"chr1","8":"chr1","9":"chr1"},"log10_len":{"0":1,"1":1,"2":1,"3":1,"4":1,"5":1,"8":0,"9":0},"start":{"0":10108,"1":10108,"2":10108,"3":10108,"4":10108,"5":10108,"8":10143,"9":10143},"end":{"0":10114,"1":10114,"2":10114,"3":10114,"4":10114,"5":10114,"8":10144,"9":10144},"ref":{"0":"AACCCT","1":"AACCCT","2":"AACCCT","3":"AACCCT","4":"AACCCT","5":"AACCCT","8":"T","9":"T"},"alt":{"0":"A","1":"A","2":"A","3":"A","4":"A","5":"A","8":"C","9":"C"},"sample_id":{"0":"A","1":"B","2":"C","3":"D","4":"E","5":"F","8":"A","9":"B"},"GT":{"0":1,"1":1,"2":1,"3":1,"4":1,"5":1,"8":1,"9":1},"GQ":{"0":79,"2":60,"3":99,"4":26,"5":62,"8":22,"9":65},"DP":{"0":12,"1":9,"2":39,"3":26,"4":9,"5":9,"8":35,"9":34}}'))
test_df

However, I defined in the TileDB schema that the column should be int32(nullable=True).
That's why I'd expect that I can write anything to the array that can be casted to pd.Series(dtype="Int32").
When I read it back, I expect to also obtain some pd.Series(dtype="Int32") in case of a dataframe.

With numpy, it's more tricky. There you need some special handling, e.g. returning a Tuple[array[int32], array[bool]].
Otherwise, I could also work with pyarrow.array() as a return type :)

ihnorton · 2021-06-11T17:14:01Z

Got it! Apologies for belaboring the point, and I appreciate the explanation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot append Pandas dataframe to existing array #592

Cannot append Pandas dataframe to existing array #592

Hoeze commented Jun 11, 2021

ihnorton commented Jun 11, 2021

ihnorton commented Jun 11, 2021

Hoeze commented Jun 11, 2021 •

edited

Loading

ihnorton commented Jun 11, 2021 •

edited

Loading

ihnorton commented Jun 11, 2021

Hoeze commented Jun 11, 2021 •

edited

Loading

ihnorton commented Jun 11, 2021

Hoeze commented Jun 11, 2021 •

edited

Loading

Hoeze commented Jun 11, 2021 •

edited

Loading

ihnorton commented Jun 11, 2021

Hoeze commented Jun 11, 2021 •

edited

Loading

ihnorton commented Jun 11, 2021

Cannot append Pandas dataframe to existing array #592

Cannot append Pandas dataframe to existing array #592

Comments

Hoeze commented Jun 11, 2021

ihnorton commented Jun 11, 2021

ihnorton commented Jun 11, 2021

Hoeze commented Jun 11, 2021 • edited Loading

ihnorton commented Jun 11, 2021 • edited Loading

ihnorton commented Jun 11, 2021

Hoeze commented Jun 11, 2021 • edited Loading

ihnorton commented Jun 11, 2021

Hoeze commented Jun 11, 2021 • edited Loading

Hoeze commented Jun 11, 2021 • edited Loading

ihnorton commented Jun 11, 2021

Hoeze commented Jun 11, 2021 • edited Loading

ihnorton commented Jun 11, 2021

Hoeze commented Jun 11, 2021 •

edited

Loading

ihnorton commented Jun 11, 2021 •

edited

Loading

Hoeze commented Jun 11, 2021 •

edited

Loading

Hoeze commented Jun 11, 2021 •

edited

Loading

Hoeze commented Jun 11, 2021 •

edited

Loading

Hoeze commented Jun 11, 2021 •

edited

Loading