Skip to content

Pathogen whole genome sequence (WGS) data analysis pipeline

License

Notifications You must be signed in to change notification settings

xiaoli-dong/pathogenseq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pathogenseq

pathogenseq is a pathogen whole genome sequence (WGS) data analysis pipeline, which inclues sequence quality checking, quality control, taxonomy assignment, assembly, assembly quality assessment, assembled contig annotation, mlst, antimicrobial resistance, virulome, plasmid, and taxonomy prediciton.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible.

Pipeline summary

By default, the pipeline supports both short and long reads:

  • Sequence quality check and quality control
    • Short reads
      • Short Illumina reads quality checks (FastQC)
      • Short read quality control (BBDuk | fastp)
      • Short read statistics (seqkit stats)
      • Taxonomic assignment and contamination check (Kraken2)
    • Long reads
      • Nanopore long read quality checks (NanoPlot)
      • Nanopore long read adapter trimming (Porechop)
      • Nanopore long read quality and length filter (chopper)
      • Nanopore long read statistics (seqkit stats)
  • Assembly
    • Short read assembly with user choice of the assemblers (Spades | Skesa | Unicycler | megahit | shovill
    • Long read assembly is following the steps below:
      • Nanopore long read de novo assembly (Flye)
      • Circular Flye contigs are rotated to start in the center of the contig (in-house perl script)
      • Long read polishing and consensus generating (Medaka)
      • Short-read polishing while short reads are available:
  • Assembly quality check
    • Rapid assessment of genome assembly completeness and contamination using machine learning approach (CheckM2)
    • Rapid taxonomic identification of microbial pathogens from assemblies and also the assement of the sample relatedness (gambit)
    • Assembly depth prediciton and reproting using (Minimap2, samtools and in-house sripts)
  • Genome annotation
    • Gene prediction and annotation (Bakta)
    • Identify acquired antimicrobial resistance genes in the assembled contigs (AMRFinderPlus)
    • Scan contig files against traditional PubMLST typing schemes (mlst)
    • Typing and reconstruction of plasmid sequences from assembled contigs (MOB-suite)
    • Virulome detection (abricate with VFDB)
  • Tools for special organism
    • Mycobacterium tuberculosis lineage and drug resistance prediciton based on quality controlled illumina and nanopore reads using (TB-Profiler)
    • Streptococcus pneumoniae capsular typeing based on quality controlled illumina reads using (PneumoCaT)
    • Streptococcus pyogenes emm-typing based on assembled contigs using (emmtyper)
    • Streptococcus agalactiae, Group B Streptococcus, serotyping based on assembled contigs using (GBS-SBG)
  • Summarize and generate the analysis report, software version control reports

Pipeline reference databases

Quick Start

The workflow uses nextflow to manage compute and software resources, as such nextflow will need to be installed before attempting to run the workflow. The workflow can currently be run using either singularity or conda to provide isolation of the required software. Both methods are automated out-of-the-box provided either docker or singularity is installed. It is not required to clone or download the git repository in order to run the workflow.

Note If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.

Install any of Docker, Singularity, Podman, Shifter or Charliecloud for full pipeline reproducibility (please only use Conda as a last resort; see docs)

Check workflow options

You can clone or download the pathogenseq from github to local computer or you can directly run the pipeline from github. To check the pipeline command line options:

# running directly from github without downloading or cloning
nextflow run xiaoli-dong/pathogenseq -r revision_number(e.g:8657a20) --help

# download the pipeine and run the program from the local computer

nextflow run main.nf --help
N E X T F L O W  ~  version 23.04.1
Launching `main.nf` [spontaneous_kowalevski] DSL2 - revision: 4ef093f544


------------------------------------------------------
  xiaoli-dong/pathogenseq v1.1.0
------------------------------------------------------
Typical pipeline command:

  nextflow run xiaoli-dong/pathogenseq --input samplesheet.csv --outdir results --platform illumina -profile singularity

Input/output options
  --input                            [string]  Path to comma-separated samplesheet file containing information about the samples in the experiment.
  --outdir                           [string]  The output directory where the results will be saved. You have to use absolute paths to storage on Cloud 
                                               infrastructure. 
  --platform                         [string]  Specifies the platform used to generate the sequences - available options are 'illumina|nanopore'. [default: 
                                               illumina] 
  --igenomes_ignore                  [boolean] Whether ignore igenome configuration loading.
  --email                            [string]  Email address for completion summary.
  --multiqc_title                    [string]  MultiQC report title. Printed as page header, used for filename if not otherwise specified.

illumina_options
  --illumina_reads_qc_tool           [string]  Specifies the reads triming and qc tool to use - available options are 'fastp|bbduk'. [default: fastp]
  --illumina_reads_assembler         [string]  Specifies the illumina reads assembly tool to use - available options are 
                                               'megahit|spades|skesa|unicycler|shovill'. [default: unicycler] 
  --min_tbp_for_assembly_illumina    [integer] Required total basepairs of the reads to get into assembly stage. [default: 1000000]
  --skip_illumina_reads_qc           [boolean] Skip illumina read quality control step. [default: false]
  --skip_illumina_dehost             [boolean] Skip illumina read quality control step. [default: false]
  --skip_illumina_reads_assembly     [boolean] Skip illumina read assembly step. [default: false]
  --hostile_human_ref_bowtie2        [string]  hostile human genome index file [default: 
                                               /nfs/APL_Genomics/db/prod/hostile/bowtie2_indexes/human-t2t-hla-argos985] 

nanopore_options
  --hostile_human_ref_minimap2       [string]  hostile human reference genome for minimap2 [default: 
                                               /nfs/APL_Genomics/db/prod/hostile/minimap2_ref/human-t2t-hla-argos985.fa.gz] 
  --skip_nanopore_dehost             [boolean] Skip illumina read quality control step. [default: false]
  --nanopore_reads_assembler         [string]  Specifies the nanopore reads assembly tool to use - available options are 'flye+medaka'. [default: 
                                               flye+medaka] 
  --min_tbp_for_assembly_nanopore    [integer] Required total basepairs of the reads to get into assembly stage. [default: 1000000]
  --skip_polypolish                  [boolean] Skip contig polish steps with illumina reads using polypolish tool. [default: false]
  --skip_polca                       [boolean] Skip contig polish steps with illumina reads using polca tool. [default: false]
  --skip_nanopore_reads_qc           [boolean] Skip nanopore read quality control step. [default: false]
  --skip_nanopore_reads_assembly     [boolean] Skip nanopore read assembly step. [default: false]
  --skip_illumina_reads_polish       [boolean] Skip contig polishing steps with illumina reads. [default: false]

annotation_options
  --bakta_db                         [string]  Path to bakta database. [default: /nfs/APL_Genomics/db/prod/bakta/db]
  --checkm2_db                       [string]  Path to checkm2 database. [default: /nfs/APL_Genomics/db/prod/CheckM2_database/uniref100.KO.1.dmnd]
  --amrfinderplus_db                 [string]  null
  --skip_checkm2                     [boolean] Skip checkm2 step. [default: false]
  --skip_bakta                       [boolean] Skip bakta step. [default: false]
  --skip_mlst                        [boolean] Skip mlst step. [default: false]
  --skip_mobsuite                    [boolean] Skip skip_mobsuite step. [default: false]
  --skip_virulome                    [boolean] Skip virulome step. [default: false]
  --skip_multiqc                     [boolean] null [default: false]
  --skip_amr                         [boolean] Skip amr step. [default: false]
  --skip_depth_and_coverage          [boolean] Skip assembly depth calculation step. [default: false]

taxonomic_tool_options
  --kraken2_db                       [string]  Specify path to kraken2 database [default: /nfs/APL_Genomics/db/prod/kraken2/k2_standard_08gb_20220926]
  --gambit_db                        [string]  Path to gambit database. [default: /nfs/APL_Genomics/db/prod/gambit]
  --skip_illumina_kraken2            [boolean] Skip kraken2 with illumina reads step. [default: false]
  --skip_nanopore_kraken2            [boolean] Skip kraken2 with nanopore reads step. [default: true]
  --skip_gambit                      [boolean] Skip gambit step. [default: false]

special_tool_options
  --gbssbg_db                        [string]  Path to GBS-SBG database. [default: /nfs/APL_Genomics/db/prod/gbs-sbg/GBS-SBG.fasta]
  --skip_tbprofiler                  [boolean] Skip Mycobacterium tuberculosis lineage and drug resistance analysis. [default: true]
  --skip_emmtyper                    [boolean] Skip emm-typing of Streptococcus pyogenes. [default: true]
  --skip_pneumocat                   [boolean] Skip capsular typing of Streptococcus pneumoniae using pneumocat. [default: true]
  --skip_gbssbg                      [boolean] Skip serotyping of Streptococcus agalactiae using GBS-SBG. [default: true]

!! Hiding 25 params, use --show_hidden_params to show them !!
------------------------------------------------------
If you use xiaoli-dong/pathogenseq for your analysis please cite:

* The nf-core framework
  https://doi.org/10.1038/s41587-020-0439-x

* Software dependencies
  https://github.com/xiaoli-dong/pathogenseq/blob/master/CITATIONS.md
------------------------------------------------------

Prepare required samplesheet input

The pathogenseq pipeline requires user to provide a csv format samplesheet, which contains the sequenence information for each sample, as input. See below for what the samplesheet looks like:

samplesheet.csv

sample,fastq_1,fastq_2,long_fastq,basecaller_mode
sample1,shortreads_1.fastq.gz,shortreads_2.fastq.gz,longreads.fastq.gz,r1041_e82_400bps_hac_v4.2.0
sample2,shortreads.fastq,NA,longreads.fastq.gz,r1041_e82_400bps_sup_v4.2.0
sample3,NA,NA,longreads.fastq.gz,NA
sample4,shortreads_1.fastq.gz,shortreads_2.fastq.gz,NA

The csv format samplesheet has five required columns:

  • The first row of the csv file is the header describing the columns
  • Each row represents a unique sample to be processed, the first colum is the unique sample id
  • When the information for a particular column is missing, please fill the column with "NA"
  • The "fastq_1" and "fastq_2" columns are reserved for supplying the short sequence files
  • "basecaller_mode" is for user to provide the Nanopore basecalling model, for example: r1041_e82_400bps_hac_v4.2.0. The availble models for medaka_consensus from medaka (v1.8.0) are listed at bewlow:

medaka model, (default: r1041_e82_400bps_sup_v4.2.0).

Choices: r103_fast_g507 r103_hac_g507 r103_min_high_g345 r103_min_high_g360 r103_prom_high_g360 r103_sup_g507 r1041_e82_260bps_fast_g632 r1041_e82_260bps_hac_g632 r1041_e82_260bps_hac_v4.0.0 r1041_e82_260bps_hac_v4.1.0 r1041_e82_260bps_sup_g632 r1041_e82_260bps_sup_v4.0.0 r1041_e82_260bps_sup_v4.1.0 r1041_e82_400bps_fast_g615 r1041_e82_400bps_fast_g632 r1041_e82_400bps_hac_g615 r1041_e82_400bps_hac_g632 r1041_e82_400bps_hac_v4.0.0 r1041_e82_400bps_hac_v4.1.0 r1041_e82_400bps_hac_v4.2.0 r1041_e82_400bps_sup_g615 r1041_e82_400bps_sup_v4.0.0 r1041_e82_400bps_sup_v4.1.0 r1041_e82_400bps_sup_v4.2.0 r104_e81_fast_g5015 r104_e81_hac_g5015 r104_e81_sup_g5015 r104_e81_sup_g610 r10_min_high_g303 r10_min_high_g340 r941_e81_fast_g514 r941_e81_hac_g514 r941_e81_sup_g514 r941_min_fast_g303 r941_min_fast_g507 r941_min_hac_g507 r941_min_high_g303 r941_min_high_g330 r941_min_high_g340_rle r941_min_high_g344 r941_min_high_g351 r941_min_high_g360 r941_min_sup_g507 r941_prom_fast_g303 r941_prom_fast_g507 r941_prom_hac_g507 r941_prom_high_g303 r941_prom_high_g330 r941_prom_high_g344 r941_prom_high_g360 r941_prom_high_g4011 r941_prom_sup_g507 r941_sup_plant_g610

Run the pipeline:

  • If you are using singularity then the pipeline will auto-detect this and attempt to download the Singularity images directly as opposed to performing a conversion from Docker images. If you are persistently observing issues downloading Singularity images directly due to timeout or network issues then please use the --singularity_pull_docker_container parameter to pull and convert the Docker image instead. Alternatively, it is highly recommended to use the nf-core download command to pre-download all of the required containers before running the pipeline and to set the NXF_SINGULARITY_CACHEDIR or singularity.cacheDir Nextflow options to be able to store and re-use the images from a central location for future pipeline runs.
  • If you are using conda, it is highly recommended to use the NXF_CONDA_CACHEDIR or conda.cacheDir settings to store the environments in a central location for future pipeline runs.
# Runing the pipeline from remote github
nextflow run xiaoli-dong/pathogenseq \
  -r seven_character_github_revision_number (e.g: 8657a20)
  --input samplesheet.csv \
  -profile <docker|singularity|podman|shifter|charliecloud|conda/institute> \
  --outdir results \
  --platform <illumina|nanopore>

# run the pipeline with a local clone
nextflow run your_path_to/pathogenseq/main.nf \
  --input samplesheet.csv \
  -profile <docker|singularity|podman|shifter|charliecloud|conda/institute> \
  --outdir results \
  --platform <illumina|nanopore>

# an example command to launch the pipeline from local computer and run it with ```singularity``` configraton profile.
nextflow run your_path_to/pathogenseq/main.nf \
  -profile singularity \
  --input samplesheet.csv \
  --outdir results_singularity \
  --platform <illumina|nanopore>

# an example commnad to launch the pipeline from a local clone and run it with ```conda``` configraton profile.
nextflow run your_path_to/pathogenseq/main.nf \
  -profile conda \
  --input samplesheet.csv \
  --outdir results_conda \
  --platform <illumina|nanopore>
  • Notes: Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.
# an example command to launch the pipeline from local computer and run it with ```singularity``` and a local configraton profile.
nextflow run your_path_to/pathogenseq/main.nf \
  -profile singularity \
  -c my.config \
  --outdir results_singularity \
  --platform <illumina|nanopore> \
  --illumina_reads_assembler shovill

## content of the my.config file
params {
    config_profile_name        = 'myconfig'
    config_profile_description = 'this is a demo'

    // resource configuration
    max_cpus   = 8
    max_memory = '36.GB'
    max_time   = '12.h'
    // Input data
    input = './samplesheet.csv'
}

Pipeline output

Fig 1. Pathogenseq output top level layout

Fig 2. Pathogenseq pipeline_info directory layout

Fig 3. Pathogenseq report directory layout

From the above screenshots, you can see:
  • Results of each sample included in the analysis go to its own directry.
  • pipeline_info directory contains software version control information, nextflow workflow report, resource usage report, task report.
  • report directory contains the summary files of each analysis task
Example pathogenseq data analysis outputs for a particalur sample

Credits

pathogenseq was originally written by Xiaoli Dong. Extensive support was provided from other co-authors on the scientific or technical input required for the pipeline:

  • Dr. Matthew Croxen
  • Dr. Tarah Lynch

Notes

with conda, fastqc 0.11.9--0 is not working. I need to change the version to 0.12.1.

mobsuite is not working with 3.0.3 in conda, I need to change the mobsuite version to 3.1.4

Reference