-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDF representation #13
Comments
@bittremieux, would you then propose a group system that looks like this:
with metadata embedded as attributes for each relevant group? That seems very structured and efficient to read a given spectrum, but that would mean that one library would have lot of keys (up to a million or more) and you mentioned performance degradation in this case. |
That seems a pretty logical layout. And yes, for large-scale spectral libraries the number of keys will be in the millions, but still significantly lower than the layout that was previously suggested where each peak would be stored separately. This is a much better trade-off between ease of querying and performance, and I can't immediately come up with a better alternative. Also, by using groups the performance hit should be relatively modest, even for millions of keys. HDF5 was developed as a big data format after all. Here are some relevant threads on HDF5 performance:
|
Hi, the above structure |
Thanks for the input! |
Mz5 is a nice format in-between hdf5 and mzML. If you want input from @mwilhelm as well, I can ping him. Is there really no new binary format for scientific data? Another interesting approach was BiblioSpec, which is just Sqlite. Not sure how it scales for millions of millions of spectra (but I had Sqlite spectral libraries > 100GB which were fine and fast; years ago). Their trick is to not have a peak table but rather store the compressed version of an array as a binary blob https://github.com/ProteoWizard/pwiz/blob/dbd6221d39dc43b3b8a595bd9bd63661fee1daa6/pwiz_tools/BiblioSpec/src/LibToSqlite3.cpp#L262-L286 Original publication: https://www.ncbi.nlm.nih.gov/pubmed/18428681 |
There are loads of binary formats designed for one purpose or another. They tend to be tuned specifically to use-case, written for one language (usually C with a binding for another scripting language), and probably not capable of all of the features that HDF5 or SQLite3 have. Both HDF5 and SQLite3 can store array data, though HDF5 may have an edge on certain features being built-in, and even more plugin support. My experience has been that HDF5 also makes those arrays infinitely more transparent to the caller with less work to manipulate them before fully loading them into memory. I don't think the design goal is to create large arrays to be sliced and indexed on disk though. Does HDF5 support queries over attributes? Suppose you've got a repository-scale library of a mix of ETD and HCD spectra (or positive mode and negative mode spectra) over a range of different activation energies, and you want to find all the library entries of only one of those types AND has a precursor m/z within a some interval AND came from a mass analyzer having a PPM error tolerance below 10 PPM AND an activation energy close to what you used. When first considering the problem of storing a variable number of descriptive properties in SQLite3, I thought of what was done in mzDB (10.1074/mcp.O114.039115), which was to store serialized XML "param_tree"s, which queries cannot parse or index over. SQLite3 supports JSON object-valued columns and partial indices over them with the JSON1 extension that ships with the SQLite3 source amalgam already today. This would let us keep the "param_tree" simple storage and the query-ability without needing a huge Entity-Attribute-Value table. Note that JSON1 isn't ubiquitous or enabled by default though, so you might still need to statically link with or ship an up-to-date SQLite3 shared library. Alternatively, entity-attribute-value tables aren't evil, just uncomfortable to use and slow to completely enumerate. |
I also want to add that entity-attribute-values are a pain in the butt. Especially if you use it in a standard (SQL) setup. |
AFAIK you can't query entries in an HDF5 based on the value of their attributes. You can use attributes to store metadata, but to filter on a specific value you'd need to walk the entire tree and filter manually. Not great. Also +1 on the account that entity-attribute-values in SQL are pretty annoying. |
This is always a problem, but the issue is that there are potentially hundreds of attributes of metadata that we want to be able to capture. So we need a system that can capture and store any attribute. BUT, it is usually the case that there is only a relatively small subset of attributes that one would want to filter or search on. So, the data model and archival/transmission format needs to be able to store anything. The optimized active-use format probably should be able to store anything but then optimize search and filter on a subset of terms. Some RDBMSs have SPARSE matrix support, maybe other systems, too. Otherwise we could implement a sparse matrix system in a custom binary format. Or just reuse SQLite or HDF5 where the most common attributes are indexed as columns and the rest are just stored. But all that is very usage-specific. metabolomics will want different columns that proteomics. So I think the best goal is to make a nice format that can store anything with controlled vocabulary, and then everyone can try to implement their idea of a fast storage mechanism using the technology they like, which history has shown we can't all agree on. |
With SQLite3's JSON extension enabled, you can create partial indices over JSON-path expressions. CREATE INDEX idx_spectrum_is_etd ON [spectrum_table](id)
WHERE json_extract(params, '$.MS:1000598') != NULL; Then searching for ETD spectra will incur one index traversal followed by a fast series of reads from the spectrum table, whatever it is called. This does not require that every row in the spectrum table store a value in Of course, we can solve this problem in HDF5 by simply building an index array ourselves and storing it in the library too. This complicates library writing code because now we have to manage building and updating all the indices whenever we add a new entry or decide we want to add/remove an index. It also complicates reading code because it requires that the library reading code manually manipulate these index arrays. |
My idea for HDF5 was indeed to have an additional index array with the most common attributes as columns. We could make a small set of required attributes (e.g. precursor mz, charge...) and make it possible for the user to expand that list. Writing does indeed get a little more complex, but should still be manageable. SQLite with the JSON extension also looks nice, but that could easily be another representation of the format. |
I think I came into the middle of this without knowing the data model, and just engaged in pedantics over data format engineering. Is there something I can read that goes over the data model formally? I saw some Google Docs links, but it wasn't clear if they were current/authoritative. Wouldn't reading be trickier too, since there would need to be a way to introspect read requests for which indices to use? This would be tightly coupled to the reading interface of course. The more general/compose-able the interface is, the closer the implementation needs to get to implementing a query optimizer. It seems tempting to say indices are an implementation detail, but it makes library exchange more difficult unless you completely re-build the file on receipt. On the other hand, that might be necessary with a proliferation of back-ends in the first place, requiring libraries be exchanged in one of the less "optimized" formats before being tuned for searching by the application consuming it. Is this the intent? |
Over the last weeks I heard several comments offline from people asking about the current data model, not knowing the current state. Can we make a push to collect that first ? I'd expect that this would also form basis of reporting & discussions @ PSI-MS in San Diego next month. Yours, Steffen |
@RalfG any interest in giving this a try since most things are settled now? |
Are there any other binary implementation/representations planned? |
It was already decided to allow multiple representations (txt, json, csv, hdf...) of the new spectral library format, based on a common framework (required (meta)data, controlled vocabulary...). In this issue thread, we can discuss the best way to represent the spectral library format in HDF.
As a reference, the current TXT format looks like this:
And JSON (for one metadata item) would take the following shape:
Discussion spun off from issue #12:
@bittremieux:
@RalfG:
The text was updated successfully, but these errors were encountered: