Skip to content

Latest commit

 

History

History
75 lines (60 loc) · 4.26 KB

README.md

File metadata and controls

75 lines (60 loc) · 4.26 KB

gatk4-cnn-variant-filter

Purpose :

This repo provides workflows that takes advantage of GATKs CNN tool which is a deep learning approach to filter variants based on Convolutional Neural Networks.

Please read the following discussion to learn more about the CNN tool: Deep Learning in GATK4.

cram2filtered.wdl

This workflow takes an input CRAM/BAM to call variants with HaplotypeCaller then filters the calls with the CNNVariant neural net tool using the filtering model specified.

The site-level scores are added to the INFO field of the VCF. The architecture arguments, info_key and tensor_type arguments MUST be in agreement (e.g. 2D models must have tensor_type of read_tensor and info_key of CNN_2D, 1D models have tensor_type of reference and info_key of CNN_1D). The INFO field key will be 1D_CNN or 2D_CNN depending on the neural net architecture used for inference. The architecture arguments specify pre-trained networks. New networks can be trained by the GATK tools: CNNVariantWriteTensors and CNNVariantTrain. The CRAM could be generated by the single-sample pipeline. If you would like test the workflow on a more representative example file, use the following CRAM file as input and change the scatter count from 4 to 200: gs://gatk-best-practices/cnn-h38/NA12878_NA12878_IntraRun_1_SM-G947Y_v1.cram.

Requirements/expectations :

  • CRAM/BAM
  • BAM Index (if input is BAM)

Output :

  • Filtered VCF and its index.

cram2model.wdl

This optional workflow is for advanced users who would like to train a CNN model for filtering variants.

Requirements/expectations :

  • CRAM
  • Truth VCF and its index
  • Truth Confidence Interval Bed

Output :

  • Model HD5
  • Model JSON
  • Model Plots PNG

run_happy.wdl

This optional evaluation and plotting workflow runs a filtering model against truth data (e.g. NIST Genomes in a Bottle, Synthic Diploid Truth Set ) and plots the accuracy.

Requirements/expectations :

  • File of VCF Files
  • Truth VCF and its index
  • Truth Confidence Interval Bed

Output :

  • Evaluation summary
  • Plots

Software version notes :

  • GATK 4.1
  • samtools 1.3.1
  • Cromwell version support
    • Successfully tested on v37
    • Does not work on versions < v23 due to output syntax

Important Note :

  • The provided JSON is meant to be a ready to use example JSON template of the workflow. It is the user’s responsibility to correctly set the reference and resource input variables using the GATK Tool and Tutorial Documentations.
  • Relevant reference and resources bundles can be accessed in Resource Bundle.
  • Runtime parameters are optimized for Broad's Google Cloud Platform implementation.
  • For help running workflows on the Google Cloud Platform or locally please view the following tutorial (How to) Execute Workflows from the gatk-workflows Git Organization.
  • The following material is provided by the GATK Team. Please post any questions or concerns to one of our forum sites : GATK , FireCloud or Terra , WDL/Cromwell.
  • Please visit the User Guide site for further documentation on our workflows and tools.

LICENSING :

This script is released under the WDL source code license (BSD-3) (see LICENSE in https://github.com/broadinstitute/wdl). Note however that the programs it calls may be subject to different licenses. Users are responsible for checking that they are authorized to run all programs before running this script.