Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StudyEntry model changes #179

Open
7 tasks done
j-coll opened this issue Mar 18, 2020 · 0 comments
Open
7 tasks done

StudyEntry model changes #179

j-coll opened this issue Mar 18, 2020 · 0 comments
Assignees
Milestone

Comments

@j-coll
Copy link
Member

j-coll commented Mar 18, 2020

This ticket contains a list of improvements in the Variant data model, mainly in the StudyEntry model proposed for the next major version. Some of these changes break the compatibility with previous versions.

1 - Replace List<List<String>> samplesData with List<SampleEntry> samples

The current implementation of variant data model makes difficult:

  1. to associate samples and files
  2. have multiple files per sample indexed
Current
StudyEntry {
  ...
  List<String> format;
  List<List<String>> samplesData;
  ...
}
Required
/**
 * New data model
 */
SampleEntry {
  String sampleId;        // Optional
  Integer fileIndex;       // Mandatory if files is not excluded
  List<String> data;     // Mandatory
}

StudyEntry {
  ...
  List<String> sampleDataKeys;
  List<SampleEntry> samples;
  ...
}
Implementation notes

Few important implementation notes:

  1. fileIndex points to the files array and it is mandatory unless files are excluded
  2. It is worth mentioning that the first value in samples.data list is always the genotype field (GT), even in somatic studies
  3. the samples.data values are sorted following the field sampleDataKeys in the Study antre (see below)

2 - Rename StudyEntry.format to StudyEntry.sampleDataKeys

The field name format was taken from the VCF file specification, and, unless you experience with VCF files, it's hard to guess the content from its name. This is renamed to sampleDataKeys and it specifies the keys in the samples.data array.

3 - Add Issues to StudyEntry

You can follow this at #177

4 - Rename FileEntry.attributes to FileEntry.data

This change is made to be consistent with the name with the SampleEntry.data field

5 - Replace map<VariantStats> with array<VariantStats>

To add more homogeneity to the data model, instead of having a map of cohosrtId -> VariantStats, change it to a list of VariantStats. This requires to add a field id in the model VariantStats

6 - Remove unused hgvs field in Variant

This field was added to VariantAnnotation and therefore has not been used for a long time.

7 - Replace FileEntry.call string with record

Instead of having a single String with the variant and the alleleIdx separated by a colon, replace it with a small model with two fields:

FileEntry {
  fileId: ""
  data: {}
  call : {
    variantId: "chr:pos:ref:alt[,secAlts]",
    alleleIndex: N
  }
}

Differently to the previous string call field, the new "call.variantId" field starts with the chromosome. The correct way to parse it is with new Variant(call.getVariantId())

Tasks

  • 1 - SampleEntry.samplesData to SampleEntry.samples
  • 2 - StudyEntry.format to StudyEntry.sampleDataKeys
  • 3 - Add StudyEntry.issues
  • 4 - FileEntry.attributes to FileEntry.data
  • 5 - map<VariantStats> to array<VariantStats>
  • 6 - Remove hgvs
  • 7 - Replace string FileEntry.call with specific record
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants