-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot append Pandas dataframe to existing array #592
Comments
Hi Florian, currently, nullable attributes require using Pandas nullable types for input columns. The following diff works for me:
(I've looked at numpy masked arrays a bit as well, but they don't seem to be widely-used, so we probably won't support unless there's a strong use-case)
Thanks for pointing this out, will fix. |
Will be fixed by #593. |
Ah, thanks for the hint!
Yes, they will anyway be refactored after NEP 47 is done. |
Yes, I will improve the error message at very least.
These typically induce a copy, which can be expensive (memory wise) for large input dataframes. I'm curious what is the use-case for storing a plain int64 array in a nullable attribute?
Thanks - I don't see any discussion of nullability/mask functionality in that document, so hopefully it is not too much of an afterthought (haven't read all the links though). |
(in other words, no way to represent nullability) |
In this very concrete example, we might have variant data without corresponding genotype quality assigned.
All of those solutions are not really nice. Another example from xarray: Bottom line, I very rarely end up having a dataset with no missing values at all. That's also why people with "Null" as last name might have a bad time 😁 |
Thanks for the explanation, I agree it's a tricky issue. What I'm still unclear about is why to set the TileDB attribute as nullable if the input is always going to be |
Hm, I'm not sure if I got you right: Or is your point on how to store plain numpy arrays in conjunction with a boolean mask?
Exactly, when the dtype anyway matches, it's not doing anything. |
In general, I believe you should aim for first-class TileDB support with Apache Arrow. For example, a very big advantage of parquet is that you can read/write the same dataframe in literally every language that supports Apache Arrow. Now imagine replacing Parquet with TileDB 😁 |
What I'm specifically trying to understand is why you want to create Re Arrow, yes, agreed. We have to/from support for Arrow buffers in TileDB core, which we use internally by default in TileDB-Py for operations creating a Pandas dataframe. ( |
As I mentioned previously, import json
test_df = pd.DataFrame.from_records(json.loads('{"chrom":{"0":"chr1","1":"chr1","2":"chr1","3":"chr1","4":"chr1","5":"chr1","8":"chr1","9":"chr1"},"log10_len":{"0":1,"1":1,"2":1,"3":1,"4":1,"5":1,"8":0,"9":0},"start":{"0":10108,"1":10108,"2":10108,"3":10108,"4":10108,"5":10108,"8":10143,"9":10143},"end":{"0":10114,"1":10114,"2":10114,"3":10114,"4":10114,"5":10114,"8":10144,"9":10144},"ref":{"0":"AACCCT","1":"AACCCT","2":"AACCCT","3":"AACCCT","4":"AACCCT","5":"AACCCT","8":"T","9":"T"},"alt":{"0":"A","1":"A","2":"A","3":"A","4":"A","5":"A","8":"C","9":"C"},"sample_id":{"0":"A","1":"B","2":"C","3":"D","4":"E","5":"F","8":"A","9":"B"},"GT":{"0":1,"1":1,"2":1,"3":1,"4":1,"5":1,"8":1,"9":1},"GQ":{"0":79,"2":60,"3":99,"4":26,"5":62,"8":22,"9":65},"DP":{"0":12,"1":9,"2":39,"3":26,"4":9,"5":9,"8":35,"9":34}}'))
test_df However, I defined in the TileDB schema that the column should be With numpy, it's more tricky. There you need some special handling, e.g. returning a |
Got it! Apologies for belaboring the point, and I appreciate the explanation. |
Hi, I'm trying to write an array like this:
However, the last line causes the following error:
Is there some mistake in my code?
PS: I had to set
sparse=True
infrom_dataframe
to be able to write, although the schema is already present.The text was updated successfully, but these errors were encountered: