GitHub - hall-lab/assembly_validation

Assembly evaluation pipeline

This pipeline evaluates assemblies by aligning them to a reference, calling variants, quantifying those variants, and, if applicable, comparing them to gold-standard variant lists. The end result is a jupyter R notebook template that can be run to generate plots and a summary statistic table.

Inputs

assembly_list - tab-delimited file with the fields <assembly name> <first fasta> <second fasta> <truth name> <assembly id> <datatype> If an assembly contains only one fasta, the second fasta can be set to an empty file. If no truthset is available, <truth name> should be NONE. Note that the <assembly name> should consist of the sample name followed by a period, followed by arbitrary other text. Sample name should not contain a period.
populations - tab-delimited file connecting sample to population. Fields are: <sample> <population code> <superpopulation code> <sex>
ref - reference fasta (not gzipped!)
ref_index - fai index of reference fasta
truth_list_snp_indel - tab-delimited file with the fields <truth name> <vcf location>
truth_list_confident_regions - tab-delimited file with the fields <truth name> <bed location>
truth_list_sv_vcf - tab-delimited file with the fields <truth name> <vcf location>
gap_file - bed file denoting reference gaps
str_track - bed file denoting simple repeats
seg_dup_track - bed file denoting segmental duplications
jupyter_notebook_template - path to VariantCalling.ipynb

Alignment and variant calling methodology

Contigs from both fasta files (possibly representing two different haplotypes) are combined into a single fasta, with the suffixes _H1 and _H2 appended to contig names from the first and second fasta file, respectively. If the second fasta file is empty, the first fasta file is simply taken as-is.
Contigs are aligned to the given reference using minimap2 with the asm5 preset.
paftools call is used to call variants from the cs tags in the minimap alignment.
A custom perl script is used to "genotype" the variant calls:
- If site is covered by 1 contig, call a homozygous variant (1/1)
- If site is covered by 2+ contigs and the variant appears 2+ times, call a homozygous variant (1/1)
- If site is covered by 2+ contigs and different variants appear, call a heterozygous variant (1/2)
- If site is covered by 2+ contigs and only one variant appears, call heterozygous variant (0/1)
Custom python scripts are used to call SVs based on split reads and classify them by type.
Indels >= 40bp from step 3 are combined with SVs from step 5 to form a combined set of SVs.

Outputs from workflow

new.ipynb is a jupyter notebook that can be opened and run to produce the final plots/table

Outputs from jupyter notebook

happy.sensitivity.png - Gives sensitivity to the snp/indel truthset as determined by hap.py
het_fates.png - For each variant in the truthset, show whether it was called heterozygous, homozygous alt, or not called (homozygous ref) in the variant callset from the assembly.
<assembly name>.no_gaps_coverage_summary.png - Gives the percentage of chr1-22 non-gap reference bases at each coverage 0X-9X, as well as 10X+
<assembly name>.no_gaps_coverage.png - Same thing, broken down by chr and including chrX and chrY
<assembly name>.ours.nonRep.lengths.txt.sv_lengths.png - Histogram of INS and DEL lengths. Between 0 and 1000 bp the bin size is 10bp. There are also bins for 1001-10000bp, 10001-100000bp, 100001-1000000bp, and 1Mb+.
nonrepSvCounts.png - Counts for all SVs that weren't classified as STR or SegDup, broken down by SV type
segDupSvCounts.png - Counts for all SVs that weren't classified as STR but did match the SegDup track with either breakpoint.
strSvCounts.png - Counts for all SVs that matched the STR track with either breakpoint.
sv.sensitivity.png - Percentage of the SV truthset found in the assembly using a variety of comparison techniques. SVs are stratified by whether either breakpoint matched the STR track, SegDup track, or neither (Non-repetitive)
sv.sensitivityCounts.png - Gives the counts of SVs from the truthset detected by each comparison method.
truthHetIndelFates.png - For each heterozygous indel in the truthset, show whether it was called heterozygous, homozygous alt, or not called (homozygous ref) using 1bp, 10bp, or 50bp slop.
truthHetIndelFatesErrorsOnly.png - Same as above but only shows the errors to give more detailed counts.
truthset.repetitive.count.png - Counts the number of unique elements from the STR and SegDup tracks covered by the SV truthset, and determines how many of those were also found in the SV callset.
truthset.repetitive.sensitivity.png - Same as above but displayed as sensitivity.
stats.tsv - Tab-delimited file with the following summary stats for each assembly:
- ID: from the input assembly_list
- Name: from the input assembly_list
- snv_count: Total SNVs found
- deletions: Total small (1-50 bp) deletions found
- insertions: Total small (1-50 bp) insertions found
- sv_count: Total SVs found
- snv_detection_sensitivity: SNV sensitivity ignoring genotype errors (TRUTH.TP + FP.gt)/TRUTH.TOTAL in hap.py metrics
- indel_detection_sensitivity: Indel sensitivity ignoring genotype errors (TRUTH.TP + FP.gt)/TRUTH.TOTAL in hap.py metrics
- snv_detection_specificity: SNV specificity ignoring genotype errors (QUERY.TP + FP.gt)/QUERY.TOTAL in hap.py metrics
- indel_detection_specificity: Indel specificity ignoring genotype errors (QUERY.TP + FP.gt)/QUERY.TOTAL in hap.py metrics
- sv_sensitivity: SV sensitivity using bedtools pairtopair with 50 bp slop at each breakpoint
- sv_specificity: SV specificity using bedtools pairtopair with 50 bp slot at each breakpoint
- het_snp_sensitivity: Sensitivity for heterozygous snps only
- het_indel_sensitivity: Sensitivity for heterozygous indels only
- hom_snp_sensitivity: Sensitivity for homozygous snps only
- hom_indel_sensitivity: Sensitivity for homozygous indels only
- DEL: Total deletions detected from split-read alignments
- INS: Total insertions detected from split-read alignments
- DUP: Total duplications detected from split-read alignments
- INV: Total inversions detected from split-read alignments
- TRANS: Total translocations detected from split-read alignments
- sv_nonrep_count: Total SVs that don't overlap STR or SEGDUP regions
- sv_str_count: Total SVs that overlap STR regions
- sv_segdup_count: Total SVs that overlap SEGDUP but not STR regions

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
docker		docker
truth_sets		truth_sets
LICENSE		LICENSE
README.md		README.md
VariantCalling.ipynb		VariantCalling.ipynb
analyze_assembly.wdl		analyze_assembly.wdl
plot_assemblies.inputs.json		plot_assemblies.inputs.json
plot_assemblies.wdl		plot_assemblies.wdl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Assembly evaluation pipeline

Inputs

Alignment and variant calling methodology

Outputs from workflow

Outputs from jupyter notebook

About

Releases

Packages

Contributors 2

Languages

License

hall-lab/assembly_validation

Folders and files

Latest commit

History

Repository files navigation

Assembly evaluation pipeline

Inputs

Alignment and variant calling methodology

Outputs from workflow

Outputs from jupyter notebook

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages