-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Indexing bugs #594
Comments
Not sure what is going on there, I will take a look and debug tomorrow.
I think what you want is:
|
For these:
It is a bug; I think we have a fix in progress in libtiledb core, but I will see if we can apply the substance of the following work around automatically in TileDB-Py. RIght now you can work around it like this (place-holder
Also, this works for
and this (for my suggestion):
and this one (avoids the crash):
|
Hi @Hoeze, are you able to share any more details about how you are using RocksDB? I’m not very familiar with it yet, and we’d like to dig in to the comparison a bit more. |
Hi @Hoeze, a couple more notes on key-value queries:
If you provided with more details, we could take a closer look and see if there are any low-hanging fruits for optimizations here. |
Hi, I set up a RocksDB example for you. from pathlib import Path
import rocksdb
class VariantDB:
def __init__(self, path, rocksdb_options=None):
if rocksdb_options is None:
rocksdb_options = rocksdb.Options(
create_if_missing=True,
max_open_files=100,
)
self.db = rocksdb.DB(
path,
rocksdb_options,
read_only=True
)
@staticmethod
def _variant_to_byte(variant):
return bytes(str(variant), 'utf-8')
def _type(self, value):
raise NotImplementedError()
def _get(self, variant):
if not variant.startswith('chr'):
variant = 'chr%s' % variant
return self.db.get(self._variant_to_byte(variant))
def __getitem__(self, variant):
maf = self._get(variant)
if maf:
return self._type(maf)
else:
raise KeyError('This variant is not in the db')
def __contains__(self, variant):
return self._get(variant) is not None
def get(self, variant, default=None):
try:
return self[variant]
except KeyError:
return default
class VariantMafDB(VariantDB):
def _type(self, value):
return float(value)
mafdb_path = "maf.db"
mafdb = VariantMafDB(mafdb_path)
variants = [
'1:10468:T>TAA',
'1:10468:TCGCGG>T',
"10:107494853:C>A",
"10:107494857:C>A",
"10:107494858:T>C",
"10:107494873:C>T",
"10:107494874:G>A",
"10:107494905:GAGAA>G",
"10:107494908:A>G",
"10:107494929:T>C",
"10:107494933:T>C",
"10:107494935:G>A",
"10:107494937:C>G",
"10:107494941:CTTG>C",
"10:107494942:T>A",
"10:107494943:T>C",
"10:107494960:G>T",
"10:107494964:C>A",
"10:107494979:G>A",
"10:10749497:A>G",
"10:107494988:T>C",
"10:107494989:C>T",
"10:10749498:C>T",
"10:107494998:T>C",
"10:10749499:G>A",
"10:107495002:T>C",
] Benchmark: print(len(variants))
# 26
print([mafdb.get(v, 0) for v in variants])
# [0, 0, 0.00149653, 3.18451e-05, 3.18492e-05, 0.000223029, 0.014212, 3.18471e-05, 3.18573e-05, 3.18451e-05, 3.18634e-05, 3.18573e-05, 6.37349e-05, 3.18552e-05, 3.18431e-05, 0.000127502, 0.00350877, 3.18573e-05, 3.18471e-05, 3.18431e-05, 3.18492e-05, 0.0316323, 0.00012747, 3.18431e-05, 6.37105e-05, 3.18451e-05]
%timeit [mafdb.get(v, 0) for v in variants]
# 273 µs ± 4.22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Having a very fast TileDB solution with a single String dimension would already be a great improvement to us because:
However, I believe that TileDB should also be able to improve on RocksDB speed when it is provided with some well-defined dimensions. I hope my benchmark is of some use for you :) |
Thanks for for the additional information @Hoeze, this is very valuable! We do have a lot of ideas on how to boost performance for this use case, as it is much simpler than the range queries we are currently performing and our current algorithms are an overkill. We will hopefully push them to the next couple of releases. Thanks again! |
The bugs listed have been fixed as of 0.11.0 (switching out
We will be benchmarking the code soon. |
Hi @ihnorton, sorry to bother once more, but I think I found a couple of bugs in conjunction with indexing.
Setup:
Now my trials:
A[[("chr1", 0), ("chr1", 1),]]
)multi_index
indexing with DataFrame as return type?The text was updated successfully, but these errors were encountered: