WGS-MGI-Natter 🐍

Snakemake workflow to process WGS sequencing data from MGI Instrument

Requirements

To run this analysis, a couple of dependencies need to be met:

python ≥ 3.8
snakemake ≥ 5.22.1
Singularity ≥ 3.7

In addition, the template.cfg needs to be modified to your individual data. There are comments in the file for guidance. As an example, one may check the test.cfg. The used reference file should be indexed both with samtools and bwa-mem2. For detailed instructions on how to prep the reference, please refer to the section Prep reference. There are also suitable files available at https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0.

Usage

The workflow may be started doing something like this:

snakemake \
  -s main.smk \
  --configfile test.cfg \
  --use-singularity \
  --singularity-args " --cleanenv --bind /Path/to/data/and/references"

Prep reference

Starting of with the latest version of the human genome as a fasta file, we use samtools to index the file like so:

samtools faidx genome.fasta

In addition to the fasta file, the {genome}_assembly_report.txt and the par_align.gff are needed. The former to build the right subset of regions, excluding alternate loci, and rename them, and the latter to mask the pseudo-autosomal region (PAR) on chromosome Y. Starting with generating a list for the subset of sequences:

sort -k1,1V {genome}_assembly_report.txt | awk -v FS="\t" '$8 == "Primary Assembly" || $8 == "non-nuclear" {print $7}' > subset_ids.txt

With this list we can do the actual subset:

samtools faidx genome.fasta -r subset_ids.txt -o genome_subset.fasta

The next step is to mask the PAR on chromosome Y. We extract the positions from the par_align.gff:

sed -E 's/.*Target=([^;]+).*/\1/g' par_align.gff | awk -v OFS="\t" '$0 !~ "^#" {print $1, $2-1, $3}'  > parY.bed

and then use bedtools to do the actual masking:

bedtools maskfasta -fi genome_subset.fasta -bed parY.bed -fo genome_subset_masked.fasta

Finally, the sequence headers need to be renamed to replace the GenBank Accession Numbers:

awk -v FS="\t" 'NR==FNR {header[">"$5] = ">"$1" "$5" "$7" "$10; next} $0 ~ "^>" {$0 = header[$0]}1' {genome}_assembly_report.txt genome_subset_masked.fasta > genome_alignment.fasta

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.github/workflows		.github/workflows
src		src
test		test
.gitignore		.gitignore
README.md		README.md
main.smk		main.smk
template.cfg		template.cfg
test.cfg		test.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WGS-MGI-Natter 🐍

Requirements

Usage

Prep reference

About

Releases

Packages

Languages

marrip/wgs_mgi_natter

Folders and files

Latest commit

History

Repository files navigation

WGS-MGI-Natter 🐍

Requirements

Usage

Prep reference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages