Skip to content

Latest commit

 

History

History
executable file
·
304 lines (261 loc) · 23.9 KB

README.md

File metadata and controls

executable file
·
304 lines (261 loc) · 23.9 KB

PikaVirus

A workflow for viral mapping-based discovery in metagenomic samples.

GitHub Actions CI Status GitHub Actions Linting Status Nextflow

install with bioconda Docker Get help on Slack

Introduction

PikaVirus is a bioinformatics best-practise analysis pipeline for metagenomic analysis following a new approach, based on eliminatory k-mer analysis, followed by assembly and posterior contig-binning.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker containers making installation trivial and results highly reproducible.

Quick Start

  1. Install nextflow

  2. Install any of Docker, Singularity or Podman for full pipeline reproducibility (please only use Conda as a last resort; see docs)

  3. Download the pipeline and test it on a minimal dataset with a single command:

    nextflow run nf-core/pikavirus -profile test,<docker/singularity/podman/conda/institute>

    Please check nf-core/configs to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use -profile <institute> in your command. This will enable either docker or singularity and set the appropriate execution settings for your local compute environment.

  4. Start running your own analysis!

    nextflow run nf-core/pikavirus -profile <docker/singularity/podman/conda> --input '*_R{1,2}.fastq.gz'

See usage docs for all of the available options when running the pipeline.

Pipeline Summary

By default, the pipeline currently performs the following:

  • Sequencing quality control (FastQC)
  • Trimming of low-quality regions in the reads (FastP)
  • Trimmed sequences quality control (FastQC)
  • Identification isolation of viral, bacterial, fungal and unknown reads (Kraken2)
  • Assembly of unknow reads (MetaQuast) and mapping against databases (Kaiju) to identify new possible pathogens
  • Selection of suitable viral, bacterial and fungal references from the provided directory (MASH)
  • Alignment of viral, bacterial and fungal reads against reference genomes to ensure the presence of certain organisms (Bowtie2)

Documentation

The PikaVirus pipeline comes with documentation about the pipeline: usage and output.

The PikaVirus Database

Composed of a datasheet obtained from the NCBI. Versatile, though huge. Instructions on how to download it

Tutorial

Downloading the database

PikaVirus requires a big database. A huge one in fact. However, thats why the Download_assembly.py script exists. This script will download all the viral, fungal or bacterial databases available in RefSeq, GenBank or both.

Untrusted assemblies

While testing, we have come across some assemblies that did not gather the necessary requirements for the optimal functioning of PikaVirus. Here is a list of all of them so far, and the reasons we have tagged them as excludable.

Assembly Reason for exclusion
GCA_006449155.1 Assembly too short (20 pb)
GCA_006449195.1 Assembly too short (27 pb)
GCA_006449235.1 Assembly too short (32 bp)
GCA_006449275.1 Assembly too short (44 bp)
GCA_006449355.1 Assembly too short (25 bp)
GCA_006449395.1 Assembly too short (18 bp)
GCA_006449175.1 Assembly too short (17 bp)
GCA_006449215.1 Assembly too short (20 bp)
GCA_006449255.1 Assembly too short (32 bp)
GCA_006449335.1 Assembly too short (12 bp)
GCA_006449375.1 Assembly too short (17 bp)
GCA_006449415.1 Assembly too short (19 bp)
GCA_001857805.1 No identification potential
GCA_001857825.1 No identification potential
GCA_001857745.1 No identification potential
GCA_001857805.1 No identification potential
GCA_001857825.1 No identification potential
GCA_013086015.1 No identification potential
GCA_013088685.1 No identification potential
GCA_013088695.1 No identification potential
GCA_013088705.1 No identification potential
GCA_013088715.1 No identification potential
GCA_013088725.1 No identification potential
GCA_013088735.1 No identification potential
GCA_013088745.1 No identification potential
GCA_013088755.1 No identification potential
GCA_013088765.1 No identification potential
GCA_013088775.1 No identification potential
GCA_013088785.1 No identification potential
GCA_013088795.1 No identification potential
GCA_013088805.1 No identification potential
GCA_013088815.1 No identification potential
GCA_013088825.1 No identification potential
GCA_013088835.1 No identification potential
GCA_013088845.1 No identification potential
GCA_013088855.1 No identification potential
GCA_013088865.1 No identification potential
GCA_013088875.1 No identification potential
GCA_013088885.1 No identification potential
GCA_013088895.1 No identification potential
GCA_013088905.1 No identification potential
GCA_013088915.1 No identification potential
GCA_013088925.1 No identification potential
GCA_013088935.1 No identification potential
GCA_013088945.1 No identification potential
GCA_013088955.1 No identification potential
GCA_013088965.1 No identification potential
GCA_013088975.1 No identification potential
GCA_013088985.1 No identification potential
GCA_013088995.1 No identification potential
GCA_013089005.1 No identification potential
GCA_013089015.1 No identification potential
GCA_013089025.1 No identification potential
GCA_013089035.1 No identification potential
GCA_013089045.1 No identification potential
GCA_013089055.1 No identification potential
GCA_013089065.1 No identification potential
GCA_013089075.1 No identification potential
GCA_013089085.1 No identification potential
GCA_013089095.1 No identification potential
GCA_013089105.1 No identification potential
GCA_013089115.1 No identification potential
GCA_013089125.1 No identification potential
GCA_013089135.1 No identification potential
GCA_013089145.1 No identification potential
GCA_013089155.1 No identification potential
GCA_013089165.1 No identification potential
GCA_013089175.1 No identification potential
GCA_013089185.1 No identification potential
GCA_013089195.1 No identification potential
GCA_013089205.1 No identification potential
GCA_013089215.1 No identification potential
GCA_013089225.1 No identification potential
GCA_013089235.1 No identification potential
GCA_013089245.1 No identification potential
GCA_013089255.1 No identification potential
GCA_013089265.1 No identification potential
GCA_013089275.1 No identification potential
GCA_013089285.1 No identification potential
GCA_013089295.1 No identification potential
GCA_013089305.1 No identification potential
GCA_013089315.1 No identification potential
GCA_013089325.1 No identification potential
GCA_013089335.1 No identification potential
GCA_013089345.1 No identification potential
GCA_013089355.1 No identification potential
GCA_013089365.1 No identification potential
GCA_013089375.1 No identification potential
GCA_013089385.1 No identification potential
GCA_013089395.1 No identification potential
GCA_013089405.1 No identification potential
GCA_013089415.1 No identification potential
GCA_013089425.1 No identification potential
GCA_013089435.1 No identification potential
GCA_013089445.1 No identification potential
GCA_013089455.1 No identification potential
GCA_013089465.1 No identification potential
GCA_013089475.1 No identification potential
GCA_013089485.1 No identification potential
GCA_013089495.1 No identification potential
GCA_013089505.1 No identification potential
GCA_013089515.1 No identification potential
GCA_013089525.1 No identification potential
GCA_013089535.1 No identification potential
GCA_013089545.1 No identification potential
GCA_013089555.1 No identification potential
GCA_013089565.1 No identification potential
GCA_013089575.1 No identification potential
GCA_013089585.1 No identification potential
GCA_013089595.1 No identification potential
GCA_013089605.1 No identification potential
GCA_013089615.1 No identification potential
GCA_013089625.1 No identification potential
GCA_013089635.1 No identification potential
GCA_013089645.1 No identification potential
GCA_013089655.1 No identification potential
GCA_013089665.1 No identification potential
GCA_013089675.1 No identification potential
GCA_013089685.1 No identification potential
GCA_013089695.1 No identification potential
GCA_013089705.1 No identification potential
GCA_013089715.1 No identification potential
GCA_013089725.1 No identification potential
GCA_013089735.1 No identification potential
GCA_013089745.1 No identification potential
GCA_013089755.1 No identification potential
GCA_013089765.1 No identification potential
GCA_013089775.1 No identification potential
GCA_013089785.1 No identification potential
GCA_013089795.1 No identification potential
GCA_013089805.1 No identification potential
GCA_013089815.1 No identification potential
GCA_013089825.1 No identification potential
GCA_013089835.1 No identification potential
GCA_013089845.1 No identification potential
GCA_013089855.1 No identification potential
GCA_013089865.1 No identification potential
GCA_013089875.1 No identification potential
GCA_013089885.1 No identification potential
GCA_013089895.1 No identification potential
GCA_013089905.1 No identification potential
GCA_013089915.1 No identification potential
GCA_013089925.1 No identification potential
GCA_013089935.1 No identification potential
GCA_013089945.1 No identification potential
GCA_013089955.1 No identification potential
GCA_013089965.1 No identification potential
GCA_013089975.1 No identification potential
GCA_013089985.1 No identification potential
GCA_013089995.1 No identification potential
GCA_013090005.1 No identification potential
GCA_013090015.1 No identification potential
GCA_013090025.1 No identification potential
GCA_013090035.1 No identification potential
GCA_013090045.1 No identification potential
GCA_013096315.1 No identification potential

NOTE: some of these assemblies may have been removed from the NCBI. In addition, there might be a lot of other assemblies.

Credits

PikaVirus 2.0 was originally written by Guillermo Jorge Gorines Cordero, under supervision of the BU-ISCIII team in Madrid, Spain.

PikaVirus has been developed under the nf-core community guidelines, tools and best practises, despite not being an official nf-core pipeline. You can cite the nf-core publication as follows:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x. ReadCube: Full Access Link

We thank the following people for their extensive assistance in the development of this pipeline:

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

Citations

In addition, references of tools and data used in this pipeline are as follows:

Improved metagenomic analysis with Kraken 2.

Derrick E Wood, Jennifer Lu & Ben Langmead.

Genome biology 2019 Nov 28. doi: 10.1186/s13059-019-1891-0

fastp: an ultra-fast all-in-one FASTQ preprocessor.

Shifu Chen, Yanqing Zhou, Yaru Chen, Jia Gu.

Bioinformatics, Volume 34, Issue 17, 01 September 2018, Pages i884–i890,. doi: 10.1093/bioinformatics/bty560

Bioconda: sustainable and comprehensive software distribution for the life sciences

Björn Grüning, Ryan Dale, Andreas Sjödin, Brad A. Chapman, Jillian Rowe, Christopher H. Tomkins-Tinch, Renan Valieris, Johannes Köster & The Bioconda Team

Nature Methods volume 15, pages 475–476(2018). doi 10.1038/s41592-018-0046-7

Mash: fast genome and metagenome distance estimation using MinHash

Brian D. Ondov, Todd J. Treangen, Páll Melsted, Adam B. Mallonee, Nicholas H. Bergman, Sergey Koren & Adam M. Phillippy

Genome Biology 17, Article number: 132 (2016). doi 10.1186/s13059-016-0997-x