-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tabular format: compact version of peak level table? #12
Comments
I think the design decisions can be quite different based on the data format, i.e. between text-based (CSV, TSV) and binary (HDF5). Personally, with spectral libraries increasing in size, I'm strongly in favor of HDF5. HDF5 has built-in compression, resulting in much smaller files. Also, it's much easier and efficient to slice and dice the data. Taking that into consideration, the compact option seems considerably superior to me. You could go even further and just have two arrays per spectrum (m/z and intensity), which fits the HDF5 data model perfectly. This minimizes the query time (2 lookups to retrieve a spectrum versus 2 * k per spectrum with k peaks), and HDF5 was developed to store (binary) arrays. Related to your final question, the main consideration here is what the goal of this version of the format is. Is it readability, then CSV is obviously superior. But I don't care about readability here, as you mention in #11, there's already the text version (and to a lesser extent the JSON version). Instead, when going for HDF5 performance should be the main goal. And that means using HDF5 the way it was intended and storing values in arrays instead of each individual value separately. Make it as compact as possible, make spectrum reading efficient by storing the peaks in compact arrays. |
Thanks for the reply! I started looking into HDF and there's a lot more to it then I initially thought. Definitely the nested key system and metadata for each group would be very useful for the spectral library format. This means that an optimal HDF representation would look pretty different from this general tabular format. Since we want to make it possible with the specification to have multiple representations (txt, json, csv, hdf...), I propose that we keep this discussion about general tabular formats (such as csv/tsv) and move the discussion on the HDF format to a new issue (#13). |
Following the current JSON format, the tabular format (HDF, CSV, TSV...) would have four tables, one for each data level (
library
,spectrum
,peak
andpeak interpretation
) with the following columns:cv_param_group
,accession
,name
,value_accession
, andvalue
(and some additional grouping columns):Library level
Spectrum level
Peak level
Peak interpretation level
This works perfectly fine for the
library
,spectrum
andpeak interpretation
levels (where there are a lot of possible attributes per entry), but for thepeak level
, it might be better to have a compact form:Peak level (compact)
This could be extended with a few optional columns.
To keep everything well standardized and machine readable, I would add an additional table
Peak level columns
defining the used columns thePeak level (compact)
table, which could also contain info about the used units (if applicable). E.g.:Peak level columns additional table
To summarize:
Questions:
peak level
?value
,value_accession
duality, which we do not have at thepeak
level, as all values are just numbers. What does everyone think about the "full-on compact form" idea?The text was updated successfully, but these errors were encountered: