HDF representation #13

RalfG · 2019-11-14T16:12:26Z

It was already decided to allow multiple representations (txt, json, csv, hdf...) of the new spectral library format, based on a common framework (required (meta)data, controlled vocabulary...). In this issue thread, we can discuss the best way to represent the spectral library format in HDF.

As a reference, the current TXT format looks like this:

MS:1008014|spectrum index=500
MS:1008013|spectrum name=AAAVDPTPAAPAR/2_0
MS:1008010|molecular mass=1208.6510
MS:1008015|spectrum aggregation type=MS:1008017|consensus spectrum
[1]MS:1008030|number of enzymatic termini=2
[1]MS:1001045|cleavage agent name=MS:1001251|Trypsin
MS:1001471|peptide modification details=0
...

And JSON (for one metadata item) would take the following shape:

    {
      "accession": "MS:1001045",
      "cv_param_group": "1",
      "name": "cleavage agent name",
      "value": "Trypsin",
      "value_accession": "MS:1001251"
    },

Discussion spun off from issue #12:

@bittremieux:

I think the design decisions can be quite different based on the data format, i.e. between text-based (CSV, TSV) and binary (HDF5).

Personally, with spectral libraries increasing in size, I'm strongly in favor of HDF5. HDF5 has built-in compression, resulting in much smaller files. Also, it's much easier and efficient to slice and dice the data.
With these big, repository-scale spectral libraries I think it's quite important to focus on IO and computational efficiency. Just the time required to read a multi-GB spectral library is non-negligible, making up a considerable part of the search time.

Taking that into consideration, the compact option seems considerably superior to me. You could go even further and just have two arrays per spectrum (m/z and intensity), which fits the HDF5 data model perfectly. This minimizes the query time (2 lookups to retrieve a spectrum versus 2 * k per spectrum with k peaks), and HDF5 was developed to store (binary) arrays.
Also, keep in mind that HDF5 performance can degrade quite significantly if millions of keys are used because the internal B-tree index can become quite unbalanced, leading to significant overhead during querying. Keeping the number of keys limited might thus be essential to achieve acceptable performance.

Related to your final question, the main consideration here is what the goal of this version of the format is. Is it readability, then CSV is obviously superior. But I don't care about readability here, as you mention in #11, there's already the text version (and to a lesser extent the JSON version). Instead, when going for HDF5 performance should be the main goal. And that means using HDF5 the way it was intended and storing values in arrays instead of each individual value separately. Make it as compact as possible, make spectrum reading efficient by storing the peaks in compact arrays.

@RalfG:

Thanks for the reply! I started looking into HDF and there's a lot more to it then I initially thought. Definitely the nested key system and metadata for each group would be very useful for the spectral library format. This means that an optimal HDF representation would look pretty different from this general tabular format. Since we want to make it possible with the specification to have multiple representations (txt, json, csv, hdf...), I propose that we keep this discussion about general tabular formats (such as csv/tsv) and move the discussion on the HDF format to a new issue.

The text was updated successfully, but these errors were encountered:

RalfG · 2019-11-14T16:24:56Z

@bittremieux, would you then propose a group system that looks like this:

.
└── library (group)
    ├── spectrum_001 (group)
    ├── spectrum_002 (group)
    ├── ...
    └── spectrum_n (group)
        ├── intensity (array)
        └── mz (array)

with metadata embedded as attributes for each relevant group?

That seems very structured and efficient to read a given spectrum, but that would mean that one library would have lot of keys (up to a million or more) and you mentioned performance degradation in this case.

bittremieux · 2019-11-15T00:30:39Z

That seems a pretty logical layout. And yes, for large-scale spectral libraries the number of keys will be in the millions, but still significantly lower than the layout that was previously suggested where each peak would be stored separately.

This is a much better trade-off between ease of querying and performance, and I can't immediately come up with a better alternative. Also, by using groups the performance hit should be relatively modest, even for millions of keys. HDF5 was developed as a big data format after all.

Here are some relevant threads on HDF5 performance:

sneumann · 2019-11-15T07:24:32Z

Hi, the above structure library (group) ... looks a lot like mzML to me. In fact, the OpenMS people already have a tool to search spectra in one mzML file against spectra in another mzML file for exactly that purpose. We used their prototype to export MassBank to mzML (MassBank/MassBank-data#31). With mz5 you get mzML encoded as HDF5 as well, and there are already tools to convert between mzML and mz5 forth and back. What's currently missing is the set of userParam and cvParam to keep the spectral-library specific attributes. Yours, Steffen

RalfG · 2020-01-21T10:03:57Z

Thanks for the input!
For future reference, here's the mz5 article: https://doi.org/10.1074/mcp.O111.011379

tkschmidt · 2020-01-21T11:17:30Z

Mz5 is a nice format in-between hdf5 and mzML. If you want input from @mwilhelm as well, I can ping him.
Just a meta-comment/question from my side:
Im a big fan of binary formats and enjoy hdf5s in R and Python, but the moment you develop tools in any other language its a hell. Even compiling it for your own C stuff is an adventure.

Is there really no new binary format for scientific data? Another interesting approach was BiblioSpec, which is just Sqlite. Not sure how it scales for millions of millions of spectra (but I had Sqlite spectral libraries > 100GB which were fine and fast; years ago). Their trick is to not have a peak table but rather store the compressed version of an array as a binary blob https://github.com/ProteoWizard/pwiz/blob/dbd6221d39dc43b3b8a595bd9bd63661fee1daa6/pwiz_tools/BiblioSpec/src/LibToSqlite3.cpp#L262-L286

Original publication: https://www.ncbi.nlm.nih.gov/pubmed/18428681

mobiusklein · 2020-01-29T01:44:46Z

There are loads of binary formats designed for one purpose or another. They tend to be tuned specifically to use-case, written for one language (usually C with a binding for another scripting language), and probably not capable of all of the features that HDF5 or SQLite3 have.
https://github.com/rainwoodman/bigfile
https://github.com/Blosc/bcolz

Both HDF5 and SQLite3 can store array data, though HDF5 may have an edge on certain features being built-in, and even more plugin support. My experience has been that HDF5 also makes those arrays infinitely more transparent to the caller with less work to manipulate them before fully loading them into memory. I don't think the design goal is to create large arrays to be sliced and indexed on disk though.

Does HDF5 support queries over attributes? Suppose you've got a repository-scale library of a mix of ETD and HCD spectra (or positive mode and negative mode spectra) over a range of different activation energies, and you want to find all the library entries of only one of those types AND has a precursor m/z within a some interval AND came from a mass analyzer having a PPM error tolerance below 10 PPM AND an activation energy close to what you used.

When first considering the problem of storing a variable number of descriptive properties in SQLite3, I thought of what was done in mzDB (10.1074/mcp.O114.039115), which was to store serialized XML "param_tree"s, which queries cannot parse or index over. SQLite3 supports JSON object-valued columns and partial indices over them with the JSON1 extension that ships with the SQLite3 source amalgam already today. This would let us keep the "param_tree" simple storage and the query-ability without needing a huge Entity-Attribute-Value table. Note that JSON1 isn't ubiquitous or enabled by default though, so you might still need to statically link with or ship an up-to-date SQLite3 shared library. Alternatively, entity-attribute-value tables aren't evil, just uncomfortable to use and slow to completely enumerate.

tkschmidt · 2020-02-04T10:34:06Z

I also want to add that entity-attribute-values are a pain in the butt. Especially if you use it in a standard (SQL) setup.

bittremieux · 2020-02-05T00:57:47Z

AFAIK you can't query entries in an HDF5 based on the value of their attributes. You can use attributes to store metadata, but to filter on a specific value you'd need to walk the entire tree and filter manually. Not great.

Also +1 on the account that entity-attribute-values in SQL are pretty annoying.

edeutsch · 2020-02-05T04:50:43Z

This is always a problem, but the issue is that there are potentially hundreds of attributes of metadata that we want to be able to capture. So we need a system that can capture and store any attribute. BUT, it is usually the case that there is only a relatively small subset of attributes that one would want to filter or search on. So, the data model and archival/transmission format needs to be able to store anything. The optimized active-use format probably should be able to store anything but then optimize search and filter on a subset of terms. Some RDBMSs have SPARSE matrix support, maybe other systems, too. Otherwise we could implement a sparse matrix system in a custom binary format. Or just reuse SQLite or HDF5 where the most common attributes are indexed as columns and the rest are just stored. But all that is very usage-specific. metabolomics will want different columns that proteomics. So I think the best goal is to make a nice format that can store anything with controlled vocabulary, and then everyone can try to implement their idea of a fast storage mechanism using the technology they like, which history has shown we can't all agree on.

mobiusklein · 2020-02-05T14:58:33Z

With SQLite3's JSON extension enabled, you can create partial indices over JSON-path expressions.

CREATE INDEX idx_spectrum_is_etd ON [spectrum_table](id)
    WHERE json_extract(params, '$.MS:1000598') != NULL;

Then searching for ETD spectra will incur one index traversal followed by a fast series of reads from the spectrum table, whatever it is called. This does not require that every row in the spectrum table store a value in params under the key "MS:1000598", but it does essentially replicate whether or not that value is present for all spectra in the index.

Of course, we can solve this problem in HDF5 by simply building an index array ourselves and storing it in the library too. This complicates library writing code because now we have to manage building and updating all the indices whenever we add a new entry or decide we want to add/remove an index. It also complicates reading code because it requires that the library reading code manually manipulate these index arrays.

RalfG · 2020-02-05T16:35:48Z

My idea for HDF5 was indeed to have an additional index array with the most common attributes as columns. We could make a small set of required attributes (e.g. precursor mz, charge...) and make it possible for the user to expand that list. Writing does indeed get a little more complex, but should still be manageable.

SQLite with the JSON extension also looks nice, but that could easily be another representation of the format.

mobiusklein · 2020-02-06T03:17:25Z

I think I came into the middle of this without knowing the data model, and just engaged in pedantics over data format engineering. Is there something I can read that goes over the data model formally? I saw some Google Docs links, but it wasn't clear if they were current/authoritative.

Wouldn't reading be trickier too, since there would need to be a way to introspect read requests for which indices to use? This would be tightly coupled to the reading interface of course. The more general/compose-able the interface is, the closer the implementation needs to get to implementing a query optimizer. It seems tempting to say indices are an implementation detail, but it makes library exchange more difficult unless you completely re-build the file on receipt. On the other hand, that might be necessary with a proliferation of back-ends in the first place, requiring libraries be exchanged in one of the less "optimized" formats before being tuned for searching by the application consuming it. Is this the intent?

sneumann · 2020-02-06T07:41:27Z

Over the last weeks I heard several comments offline from people asking about the current data model, not knowing the current state. Can we make a push to collect that first ? I'd expect that this would also form basis of reporting & discussions @ PSI-MS in San Diego next month. Yours, Steffen

edeutsch · 2023-09-29T15:53:48Z

@RalfG any interest in giving this a try since most things are settled now?

hechth · 2023-10-21T13:18:17Z

Are there any other binary implementation/representations planned?

RalfG mentioned this issue Nov 14, 2019

Tabular format: compact version of peak level table? #12

Closed

meowcat mentioned this issue Nov 14, 2019

Future: find a way to bridge with the upcoming HUPO-PSI format MassBank/MassBank-web#207

Open

RalfG changed the title ~~HDF representation of spectral libraries within the new specification~~ HDF representation Jun 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDF representation #13

HDF representation #13

RalfG commented Nov 14, 2019

RalfG commented Nov 14, 2019 •

edited

Loading

bittremieux commented Nov 15, 2019

sneumann commented Nov 15, 2019 •

edited

Loading

RalfG commented Jan 21, 2020

tkschmidt commented Jan 21, 2020 •

edited

Loading

mobiusklein commented Jan 29, 2020

tkschmidt commented Feb 4, 2020

bittremieux commented Feb 5, 2020

edeutsch commented Feb 5, 2020

mobiusklein commented Feb 5, 2020

RalfG commented Feb 5, 2020

mobiusklein commented Feb 6, 2020

sneumann commented Feb 6, 2020

edeutsch commented Sep 29, 2023

hechth commented Oct 21, 2023

HDF representation #13

HDF representation #13

Comments

RalfG commented Nov 14, 2019

RalfG commented Nov 14, 2019 • edited Loading

bittremieux commented Nov 15, 2019

sneumann commented Nov 15, 2019 • edited Loading

RalfG commented Jan 21, 2020

tkschmidt commented Jan 21, 2020 • edited Loading

mobiusklein commented Jan 29, 2020

tkschmidt commented Feb 4, 2020

bittremieux commented Feb 5, 2020

edeutsch commented Feb 5, 2020

mobiusklein commented Feb 5, 2020

RalfG commented Feb 5, 2020

mobiusklein commented Feb 6, 2020

sneumann commented Feb 6, 2020

edeutsch commented Sep 29, 2023

hechth commented Oct 21, 2023

RalfG commented Nov 14, 2019 •

edited

Loading

sneumann commented Nov 15, 2019 •

edited

Loading

tkschmidt commented Jan 21, 2020 •

edited

Loading