StudyEntry model changes #179

j-coll · 2020-03-18T12:40:52Z

This ticket contains a list of improvements in the Variant data model, mainly in the StudyEntry model proposed for the next major version. Some of these changes break the compatibility with previous versions.

1 - Replace `List<List<String>>` samplesData with `List<SampleEntry>` samples

The current implementation of variant data model makes difficult:

to associate samples and files
have multiple files per sample indexed

Current

StudyEntry {
  ...
  List<String> format;
  List<List<String>> samplesData;
  ...
}

Required

/**
 * New data model
 */
SampleEntry {
  String sampleId;        // Optional
  Integer fileIndex;       // Mandatory if files is not excluded
  List<String> data;     // Mandatory
}

StudyEntry {
  ...
  List<String> sampleDataKeys;
  List<SampleEntry> samples;
  ...
}

Implementation notes

Few important implementation notes:

fileIndex points to the files array and it is mandatory unless files are excluded
It is worth mentioning that the first value in samples.data list is always the genotype field (GT), even in somatic studies
the samples.data values are sorted following the field sampleDataKeys in the Study antre (see below)

2 - Rename `StudyEntry.format` to `StudyEntry.sampleDataKeys`

The field name format was taken from the VCF file specification, and, unless you experience with VCF files, it's hard to guess the content from its name. This is renamed to sampleDataKeys and it specifies the keys in the samples.data array.

3 - Add Issues to `StudyEntry`

You can follow this at #177

4 - Rename `FileEntry.attributes` to `FileEntry.data`

This change is made to be consistent with the name with the SampleEntry.data field

5 - Replace `map<VariantStats>` with `array<VariantStats>`

To add more homogeneity to the data model, instead of having a map of cohosrtId -> VariantStats, change it to a list of VariantStats. This requires to add a field id in the model VariantStats

6 - Remove unused `hgvs` field in Variant

This field was added to VariantAnnotation and therefore has not been used for a long time.

7 - Replace `FileEntry.call` string with record

Instead of having a single String with the variant and the alleleIdx separated by a colon, replace it with a small model with two fields:

FileEntry {
  fileId: ""
  data: {}
  call : {
    variantId: "chr:pos:ref:alt[,secAlts]",
    alleleIndex: N
  }
}

Differently to the previous string call field, the new "call.variantId" field starts with the chromosome. The correct way to parse it is with new Variant(call.getVariantId())

Tasks

1 - SampleEntry.samplesData to SampleEntry.samples
2 - StudyEntry.format to StudyEntry.sampleDataKeys
3 - Add StudyEntry.issues
4 - FileEntry.attributes to FileEntry.data
5 - map<VariantStats> to array<VariantStats>
6 - Remove hgvs
7 - Replace string FileEntry.call with specific record

The text was updated successfully, but these errors were encountered:

… samples. #179

j-coll added this to the v2.0.0 milestone Mar 18, 2020

j-coll self-assigned this Mar 18, 2020

j-coll added a commit that referenced this issue Mar 18, 2020

models: Replace List<List<String>> samplesData with List<SampleEntry>…

360283d

… samples. #179

j-coll mentioned this issue Mar 18, 2020

Adapt code to Biodata Variant model changes opencb/opencga#1553

Open

imedina added the enhancement label Mar 20, 2020

j-coll added a commit that referenced this issue Mar 20, 2020

models: Rename StudyEntry.format to StudyEntry.sampleDataKeys #179

a81b6c4

j-coll added a commit that referenced this issue Mar 20, 2020

models: Replace StudyEntry.samples map with list #179

c97f2bf

j-coll added a commit that referenced this issue Mar 20, 2020

models: Remove hgvs from Variant model. #179

93591c2

This was referenced Mar 23, 2020

Remove SAMPLE_ID and FILE_IDX from Format. Add INCLUDE_SAMPLE_ID opencb/opencga#1555

Open

Rename VariantQueryParams 'format', 'includeFormat' and 'info' opencb/opencga#1556

Open

j-coll added a commit that referenced this issue Mar 25, 2020

models: Replace FileEntry.call with dedicated model #179

6875e4a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StudyEntry model changes #179

StudyEntry model changes #179

j-coll commented Mar 18, 2020 •

edited

Loading

StudyEntry model changes #179

StudyEntry model changes #179

Comments

j-coll commented Mar 18, 2020 • edited Loading

1 - Replace List<List<String>> samplesData with List<SampleEntry> samples

Current

Required

Implementation notes

2 - Rename StudyEntry.format to StudyEntry.sampleDataKeys

3 - Add Issues to StudyEntry

4 - Rename FileEntry.attributes to FileEntry.data

5 - Replace map<VariantStats> with array<VariantStats>

6 - Remove unused hgvs field in Variant

7 - Replace FileEntry.call string with record

Tasks

j-coll commented Mar 18, 2020 •

edited

Loading

1 - Replace `List<List<String>>` samplesData with `List<SampleEntry>` samples

2 - Rename `StudyEntry.format` to `StudyEntry.sampleDataKeys`

3 - Add Issues to `StudyEntry`

4 - Rename `FileEntry.attributes` to `FileEntry.data`

5 - Replace `map<VariantStats>` with `array<VariantStats>`

6 - Remove unused `hgvs` field in Variant

7 - Replace `FileEntry.call` string with record