Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing bugs #594

Open
Hoeze opened this issue Jun 13, 2021 · 8 comments
Open

Indexing bugs #594

Hoeze opened this issue Jun 13, 2021 · 8 comments

Comments

@Hoeze
Copy link

Hoeze commented Jun 13, 2021

Hi @ihnorton, sorry to bother once more, but I think I found a couple of bugs in conjunction with indexing.

Setup:

# +
import json

import tiledb
import numpy as np
import pandas as pd
import random
# -

import json
test_df = pd.DataFrame.from_records(json.loads('{"chrom":{"0":"chr1","1":"chr1","2":"chr2","3":"chr2","4":"chr1","5":"chr1","8":"chr1","9":"chr1"},"log10_len":{"0":1,"1":1,"2":1,"3":1,"4":1,"5":1,"8":0,"9":0},"start":{"0":10108,"1":10108,"2":10108,"3":10108,"4":10108,"5":10108,"8":10143,"9":10143},"end":{"0":10114,"1":10114,"2":10114,"3":10114,"4":10114,"5":10114,"8":10144,"9":10144},"ref":{"0":"AACCCT","1":"AACCCT","2":"AACCCT","3":"AACCCT","4":"AACCCT","5":"AACCCT","8":"T","9":"T"},"alt":{"0":"A","1":"A","2":"A","3":"A","4":"A","5":"A","8":"C","9":"C"},"sample_id":{"0":"A","1":"B","2":"C","3":"D","4":"E","5":"F","8":"A","9":"B"},"GT":{"0":1,"1":1,"2":1,"3":1,"4":1,"5":1,"8":1,"9":1},"GQ":{"0":79,"2":60,"3":99,"4":26,"5":62,"8":22,"9":65},"DP":{"0":12,"1":9,"2":39,"3":26,"4":9,"5":9,"8":35,"9":34}}'))
test_df

output_path="test.tdb"

ctx = tiledb.default_ctx()
ctx

# +
genotype_domain = tiledb.Domain(
    tiledb.Dim(name="chrom", domain=(None,None), tile=1, dtype=np.bytes_, ctx=ctx),
    tiledb.Dim(name="log10_len", domain=(0, np.iinfo(np.int8).max), tile=1, dtype=np.int8, ctx=ctx),
    tiledb.Dim(name="start", domain=(0, np.iinfo(np.int32).max), tile=100000, dtype=np.int32, ctx=ctx),
    tiledb.Dim(name="alt", domain=(None,None), tile=None, dtype=np.bytes_, ctx=ctx),
#     tiledb.Dim(name="end", domain=(1, np.iinfo(np.int32).max), dtype=np.int32, ctx=ctx),
    tiledb.Dim(name="sample_id", domain=(None,None), tile=None, dtype=np.bytes_, ctx=ctx),
    ctx=ctx,
)

string_filters = tiledb.FilterList([tiledb.ZstdFilter(level=-1),])
int_filters = tiledb.FilterList([tiledb.ByteShuffleFilter(), tiledb.ZstdFilter(level=-1),])
attrs = [
    tiledb.Attr(name='end', dtype='int32', var=False, nullable=False, filters=int_filters),
    tiledb.Attr(name='ref', dtype='S', nullable=False, filters=string_filters),
    tiledb.Attr(name='GT', dtype='int8', var=False, nullable=False, filters=int_filters),
    tiledb.Attr(name='GQ', dtype='int32', var=False, nullable=True, filters=int_filters),
    tiledb.Attr(name='DP', dtype='int32', var=False, nullable=True, filters=int_filters),
]
# -

schema = tiledb.ArraySchema(
    domain=genotype_domain,
    attrs=attrs,
    sparse=True,
    cell_order="hilbert",
#     capacity=10000,
    ctx=ctx,
)
schema

if tiledb.array_exists(output_path):
    print("Deleting array at '%s'..." % output_path)
    import shutil
    shutil.rmtree(output_path)
print("Creating array at '%s'..." % output_path)
tiledb.array.SparseArray.create(output_path, schema, ctx=ctx)

tiledb.from_dataframe(
    output_path,
    test_df.astype({
        'end': 'int32',
        'ref': 'S',
        'GT': 'int8',
        'GQ': 'Int32',
        'DP': 'Int32',
    }),
    sparse=True,
    mode="append"
)

tiledb.open_dataframe(output_path, use_arrow=True)

A = tiledb.open(output_path, ctx=ctx, mode='r')
A

Now my trials:

# works
A.query(use_arrow=True, coords=True).df[:]

# works
A["chr2"]

# +
# kernel breaks with `realloc(): invalid pointer`
# A.df["chr2"]
# kernel breaks with `realloc(): invalid pointer`
# A.query(use_arrow=True, coords=True).df["chr2"]
# -

# empty result
A.multi_index["chr2"]

# works
A["chr1", 0]

# works
A.multi_index[:]

# empty result
A.multi_index[:, 0]

# empty result
A.multi_index["chr2", 0]

# ---------------------------------------------------------------------------
# IndexError                                Traceback (most recent call last)
# <ipython-input-22-20675d6deb0c> in <module>
#      12 
#      13 # IndexError: invalid index type: <class 'list'>
# ---> 14 A[[("chr1", 0), ("chr1", 1),]]
#
# tiledb/libtiledb.pyx in tiledb.libtiledb.SparseArrayImpl.__getitem__()
#
# tiledb/libtiledb.pyx in tiledb.libtiledb.SparseArrayImpl.subarray()
#
# tiledb/libtiledb.pyx in tiledb.libtiledb.index_domain_subarray()
#
# IndexError: invalid index type: <class 'list'>
A[[
    ("chr1", 0), 
    ("chr1", 1),
]]
  • Is there a way to keep the python kernel from crashing, even when having invalid input to tiledb?
  • How to do key-based indexing? (A[[("chr1", 0), ("chr1", 1),]])
  • How to do multi_index indexing with DataFrame as return type?
@ihnorton
Copy link
Member

ihnorton commented Jun 14, 2021

Is there a way to keep the python kernel from crashing, even when having invalid input to tiledb?

kernel breaks with realloc(): invalid pointer

Not sure what is going on there, I will take a look and debug tomorrow.

How to do key-based indexing? (A[[("chr1", 0), ("chr1", 1),]])

I think what you want is: A.multi_index["chr1", [0,1]] which will select "chr1" on first dim, and select 0 and 1 on 2nd dim. [edit: use A.multi_index["chr1", [0,1], :, :, :]]

How to do multi_index indexing with DataFrame as return type?

A.df[<selection>] is for that -- same semantics as .multi_index (and shared implementation for the selection parsing).

@ihnorton
Copy link
Member

ihnorton commented Jun 14, 2021

For these:

# empty result
A.multi_index[:, 0]

# empty result
A.multi_index["chr2", 0]

# empty result
A.multi_index["chr2"]

It is a bug; I think we have a fix in progress in libtiledb core, but I will see if we can apply the substance of the following work around automatically in TileDB-Py.

RIght now you can work around it like this (place-holder : for each non-indexed dimension):

In [16]: A.multi_index["chr2", :, :, :, :]
retries: 0
Out[16]:
OrderedDict([('chrom', array([b'chr2', b'chr2'], dtype=object)),
             ('log10_len', array([1, 1], dtype=int8)),
             ('start', array([10108, 10108], dtype=int32)),
             ('alt', array([b'A', b'A'], dtype=object)),
             ('sample_id', array([b'C', b'D'], dtype=object)),
             ('end', array([10114, 10114], dtype=int32)),
             ('ref', array([b'AACCCT', b'AACCCT'], dtype=object)),
             ('GT', array([1, 1], dtype=int8)),
             ('GQ', array([60, 99], dtype=int32)),
             ('DP', array([39, 26], dtype=int32))])

Also, this works for .df:

In [11]: A.query(use_arrow=True, coords=True).df["chr2", :, :, :,:]
retries: 0
Out[11]:
  chrom  log10_len  start alt sample_id    end        ref  GT  GQ  DP
0  chr2          1  10108   A         C  10114  b'AACCCT'   1  60  39
1  chr2          1  10108   A         D  10114  b'AACCCT'   1  99  26

and this (for my suggestion):

In [14]: A.df["chr1", [0,1], :, :, :]
retries: 0
Out[14]:
  chrom  log10_len  start alt sample_id    end        ref  GT    GQ  DP
0  chr1          1  10108   A         A  10114  b'AACCCT'   1  79.0  12
1  chr1          1  10108   A         B  10114  b'AACCCT'   1   NaN   9
2  chr1          0  10143   C         B  10144       b'T'   1  65.0  34
3  chr1          0  10143   C         A  10144       b'T'   1  22.0  35
4  chr1          1  10108   A         E  10114  b'AACCCT'   1  26.0   9
5  chr1          1  10108   A         F  10114  b'AACCCT'   1  62.0   9

and this one (avoids the crash):

In [15]: A.query(use_arrow=True, coords=True).df["chr2", :, :, :, :]
retries: 0
Out[15]:
  chrom  log10_len  start alt sample_id    end        ref  GT  GQ  DP
0  chr2          1  10108   A         C  10114  b'AACCCT'   1  60  39
1  chr2          1  10108   A         D  10114  b'AACCCT'   1  99  26

@Hoeze
Copy link
Author

Hoeze commented Jun 14, 2021

Alright, thanks!

I also did another test for the key-value type access:
grafik
30it/s is not too great, compared to RocksDB where we reach up to 40,000 it/s.
Is there something I can do about?

@ihnorton
Copy link
Member

Hi @Hoeze, are you able to share any more details about how you are using RocksDB? I’m not very familiar with it yet, and we’d like to dig in to the comparison a bit more.

@stavrospapadopoulos
Copy link
Member

Hi @Hoeze, a couple more notes on key-value queries:

  1. Not sure how you use RocksDB, but I would suggest you use a single string dimension in TileDB for such queries.
  2. You may want to tweak the tile capacity and compression for key-value queries (e.g., shrink the tile size, use no compression for local deployments).
  3. Issuing all key-value queries using multi_range will be significantly faster than single-point queries in a loop.
  4. We have some upcoming optimizations in the read algorithm that will benefit significantly even key-value queries.
  5. We have not optimized for (local) key-value performance yet, there are a couple of extra improvements we can do if this becomes critical (e.g., use mmap, optimize the tile format and read algorithm for key-value queries using hashing, etc.)

If you provided with more details, we could take a closer look and see if there are any low-hanging fruits for optimizations here.

@Hoeze
Copy link
Author

Hoeze commented Jun 15, 2021

Hi, I set up a RocksDB example for you.
To run it, you need to download the database attached here.

from pathlib import Path
import rocksdb


class VariantDB:

    def __init__(self, path, rocksdb_options=None):
        if rocksdb_options is None:
            rocksdb_options = rocksdb.Options(
                create_if_missing=True,
                max_open_files=100,
            )

        self.db = rocksdb.DB(
            path,
            rocksdb_options,
            read_only=True
        )

    @staticmethod
    def _variant_to_byte(variant):
        return bytes(str(variant), 'utf-8')

    def _type(self, value):
        raise NotImplementedError()

    def _get(self, variant):
        if not variant.startswith('chr'):
            variant = 'chr%s' % variant
        return self.db.get(self._variant_to_byte(variant))

    def __getitem__(self, variant):
        maf = self._get(variant)
        if maf:
            return self._type(maf)
        else:
            raise KeyError('This variant is not in the db')

    def __contains__(self, variant):
        return self._get(variant) is not None

    def get(self, variant, default=None):
        try:
            return self[variant]
        except KeyError:
            return default


class VariantMafDB(VariantDB):

    def _type(self, value):
        return float(value)


mafdb_path = "maf.db"
mafdb = VariantMafDB(mafdb_path)
variants = [
        '1:10468:T>TAA',
        '1:10468:TCGCGG>T',
        "10:107494853:C>A",
        "10:107494857:C>A",
        "10:107494858:T>C",
        "10:107494873:C>T",
        "10:107494874:G>A",
        "10:107494905:GAGAA>G",
        "10:107494908:A>G",
        "10:107494929:T>C",
        "10:107494933:T>C",
        "10:107494935:G>A",
        "10:107494937:C>G",
        "10:107494941:CTTG>C",
        "10:107494942:T>A",
        "10:107494943:T>C",
        "10:107494960:G>T",
        "10:107494964:C>A",
        "10:107494979:G>A",
        "10:10749497:A>G",
        "10:107494988:T>C",
        "10:107494989:C>T",
        "10:10749498:C>T",
        "10:107494998:T>C",
        "10:10749499:G>A",
        "10:107495002:T>C",
    ]

Benchmark:

print(len(variants))
# 26
print([mafdb.get(v, 0) for v in variants])
# [0, 0, 0.00149653, 3.18451e-05, 3.18492e-05, 0.000223029, 0.014212, 3.18471e-05, 3.18573e-05, 3.18451e-05, 3.18634e-05, 3.18573e-05, 6.37349e-05, 3.18552e-05, 3.18431e-05, 0.000127502, 0.00350877, 3.18573e-05, 3.18471e-05, 3.18431e-05, 3.18492e-05, 0.0316323, 0.00012747, 3.18431e-05, 6.37105e-05, 3.18451e-05]
%timeit [mafdb.get(v, 0) for v in variants]
# 273 µs ± 4.22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  • This example database is only 3.6GB of size, but we are working with database >100GB of size.
  • 1. Not sure how you use RocksDB, but I would suggest you use a single string dimension in TileDB for such queries.
    
    This RocksDB is very simple: its key is a string and the value a float. However, this also means that RocksDB has no idea about the data structure.
    • Using the variant schema from my first post (without sample_id, log10_len and only a single attribute), TileDB should have a great chance to beat RocksDB since ordererd requests are always hitting the same fragment.
    • Also, keeping the different dimensions allows for range queries ("get all variants on chr1", etc.)
  • 2. You may want to tweak the tile capacity and compression for key-value queries (e.g., shrink the tile size, use no compression for local deployments).
    
    Thanks for the hint. I tried without any compression filters but I'm still reaching only ~50it/s.
  • 3. Issuing all key-value queries using `multi_range` will be significantly faster than single-point queries in a loop.
    
    Yes, but I think this is only possible with a single key, right? I.e. querying at once [("chr1", 0, 10108, "A"), ("chr1", 1, 10143, "C"),] will not work I believe?

Having a very fast TileDB solution with a single String dimension would already be a great improvement to us because:

  • RocksDB is completely schema-free
  • RocksDB does not support parallel writes

However, I believe that TileDB should also be able to improve on RocksDB speed when it is provided with some well-defined dimensions.

I hope my benchmark is of some use for you :)

@stavrospapadopoulos
Copy link
Member

Thanks for for the additional information @Hoeze, this is very valuable! We do have a lot of ideas on how to boost performance for this use case, as it is much simpler than the range queries we are currently performing and our current algorithms are an overkill. We will hopefully push them to the next couple of releases. Thanks again!

@nguyenv
Copy link
Collaborator

nguyenv commented Nov 17, 2021

The bugs listed have been fixed as of 0.11.0 (switching out tiledb.from_dataframe for tiledb.from_pandas).

(tiledb-3.9) vivian@mangonada:~/TileDB-Py/tiledb$ python ~/tiledb-bugs/indexing_bugs.py
Deleting array at 'test.tdb'...
Creating array at 'test.tdb'...
OrderedDict([('end', array([10114, 10114, 10114, 10144, 10144, 10114, 10114, 10114],
      dtype=int32)), ('ref', array([b'AACCCT', b'AACCCT', b'AACCCT', b'T', b'T', b'AACCCT', b'AACCCT',
       b'AACCCT'], dtype=object)), ('GT', array([1, 1, 1, 1, 1, 1, 1, 1], dtype=int8)), ('GQ', array([79,  0, 60, 65, 22, 99, 26, 62], dtype=int32)), ('DP', array([12,  9, 39, 34, 35, 26,  9,  9], dtype=int32)), ('chrom', array([b'chr1', b'chr1', b'chr2', b'chr1', b'chr1', b'chr2', b'chr1',
       b'chr1'], dtype=object)), ('log10_len', array([1, 1, 1, 0, 0, 1, 1, 1], dtype=int8)), ('start', array([10108, 10108, 10108, 10143, 10143, 10108, 10108, 10108],
      dtype=int32)), ('alt', array([b'A', b'A', b'A', b'C', b'C', b'A', b'A', b'A'], dtype=object)), ('sample_id', array([b'A', b'B', b'C', b'B', b'A', b'D', b'E', b'F'], dtype=object))])

We will be benchmarking the code soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants