Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GFA format and ML path format #23

Open
bricoletc opened this issue Mar 5, 2021 · 1 comment
Open

GFA format and ML path format #23

bricoletc opened this issue Mar 5, 2021 · 1 comment

Comments

@bricoletc
Copy link
Member

@leoisl @mbhall88 could you paste here an example of the ML-path format that pandora uses, to describe a ML path with respect to the linearised prg?

Then I can implement it in gramtools.

Or, we could move to expressing this ML path on a GFA; this means pandora and gramtools would need to use the GFA produced by make_prg.

@leoisl
Copy link
Collaborator

leoisl commented Mar 8, 2021

This is the current output of a sample example in the make_prg branch:

2 samples
Sample toy_sample_1
1 loci with denovo variants
GC00010897
9 nodes 
(0 [0, 110) ATGCAGATACGTGAACAGGGCCGCAAAATTCAGTGCATCCGCACCGTGTACGACAAGGCCATTGGCCGGGGTCGGCAGACGGTCATTGCCACACTGGCCCGCTATACGAC)
(1 [113, 114) G)
(3 [121, 171) GAAATGCCCACGACCGGGCTGGATGAGCTGACAGAGGCCGAACGCGAGAC)
(4 [174, 175) G)
(6 [182, 301) CTGGCCGAATGGCTGGCCAAGCGCCGGGAAGCCTCGCAGAAGTCGCAGGAGGCCTACACGGCCATGTCTGCGGATCGGTGGCTGGTCACGCTGGCCAAGGCCATCAGGGAAGGGCAGGA)
(7 [304, 308) ACTG)
(9 [319, 360) CGCCCCGAACAGGCGGCCGCGATCTGGCACGGCATGGGGGA)
(10 [364, 365) G)
(12 [374, 491) GTCGGCAAGGCCTTGCGCAAGGCTGGTCACGCGAAGCCCAAGGCGGTCAGAAAGGGCAAGCCGGTCGATCCGGCTGATCCCAAGGATCAAGGGGAGGGGGCACCAAAGGGGAAATGA)
2 denovo variants for this locus
toy_sample_1.GC00010897	44	.	C	T	10.7923	.	DP=1;SGB=-0.379885;MQ0F=0;AC=1;AN=1;DP4=0,0,1,0;MQ=42	GT:PL:GP:GQ	1:40,0:-2.14748e+09,0:127
toy_sample_1.GC00010897	422	.	A	T	10.7923	.	DP=1;SGB=-0.379885;MQ0F=0;AC=1;AN=1;DP4=0,0,1,0;MQ=42	GT:PL:GP:GQ	1:40,0:-2.14748e+09,0:127
Sample toy_sample_2
1 loci with denovo variants
GC00006032
11 nodes 
(0 [0, 145) TTGAGTAAAACAATCCCCCGCGCTTATATAAGCGCGTTGATATTTTTAATTATTAACAAGCAACATCATGCTAATACAGACATACAAGGAGATCATCTCTCTTTGCCTGTTTTTTATTATTTCAGGAGTGTAAACACATTTTCCG)
(2 [152, 153) T)
(3 [156, 169) CTCCCTGGCTAAT)
(5 [176, 177) A)
(6 [180, 237) ACCACATTGGCATTTATGGAGCACATCACAATATTTCAATACCATTAAAGCACTGCA)
(8 [245, 246) T)
(9 [249, 267) CAAAATGAAACACTGCGA)
(11 [276, 277) T)
(12 [281, 290) ATTAAAATT)
(14 [299, 300) A)
(15 [304, 312) TTTCAATT)
1 denovo variants for this locus
toy_sample_2.GC00006032	49	.	A	G	10.7923	.	DP=1;SGB=-0.379885;MQ0F=0;AC=1;AN=1;DP4=0,0,1,0;MQ=42	GT:PL:GP:GQ	1:40,0:-2.14748e+09,0:127

Variants are now described as VCF records, but ML path representation is still a proprietary internal format. Getting one example of one node:
(3 [156, 169) CTCCCTGGCTAAT)

3 is an internal id that pandora gives to this node, should be completely ignored;
[156, 169) is the interval that the sequence of this node CTCCCTGGCTAAT spans in the textual representation of the PRG;

One issue is that we use this sequence interval in the textual representation to match PRG nodes to nodes in the recursion tree. I guess the proper solution would be make_prg giving an id for each node in the PRG, so any tool processing a PRG can refer to a node by its id

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants