Skip to content

Latest commit

 

History

History
180 lines (129 loc) · 7.16 KB

README.md

File metadata and controls

180 lines (129 loc) · 7.16 KB

The dntap (de novo transcriptome analysis pipeline)

About

This pipeline has been designed to process RNA-seq data. It allows to obtain results and reports from your data very easily by simply fill a configuration file and launch a command line. It takes advantages of Snakemake workflow engine. You can refer to the Snakemake publication for more details.

The pipeline is composed of 6 steps:

  1. .fastq evaluation (FASTQC)
  2. Quality filtering .fastq (Trimmomatic)
  3. Newly filtered .fastq evaluation (FASTQC)
  4. De novo transcriptome assembly (Trinity)
  5. Assembly metrics evaluation (Transrate)
  6. Protein coding domains prediction (Transdecoder)
  7. Functional annotation of prediction protein coding domains (InterProScan 5)

Snakemake auto-generated diagram

alt text

Quick example

dnatp takes 2 arguments:

  1. The snakefile dntap.py
  2. The configuration file dntap_config.yaml
snakemake --snakefile dntap.py --configfile dntap_config.yaml --cores 20
  • It is possible to provided a naximum of cores to be used by dntap.py steps, here: --cores 20.

Version

1.0

Pre-requirements and installation

The pipeline is provided with ready-to-use binairies and sources for:

  • FastQC (v0.11.5)
  • Trimmomatic (v0.36)
  • Transrate (v1.0.3)
  • TransDecoder (v3.0.1)

However it requires installation for:

Note that the current version of the pipeline do not take advantages of the ipr_lookup service of InterProScan. However you need to download and install the Panther database if you want InterProScan to perform research on it.

Configuration file

Software location

To use the pipeline proprely you must start with filling software location in the dntap_config.yaml

software:
    # You must provide absolute path to the executable file
    # e.g. fastqc: /path/to/fastqc/install/directory/fastqc
    
    fastqc: path/to/src/FastQC/fastqc
    trimmomatic: path/to/src/Trimmomatic-0.36/trimmomatic-0.36.jar
    trinity: /path/to/src/trinityrnaseq-Trinity-v2.4.0/Trinity
    transrate: /path/to/src/transrate-1.0.3-linux-x86_64/transrate
    transdecoder_longorfs: /path/to/src/TransDecoder-3.0.1/TransDecoder.LongOrfs
    transdecoder_predict: /path/to/src/TransDecoder-3.0.1/TransDecoder.Predict
    interproscan: /path/to/src/interproscan-5.24-63.0/interproscan.sh

If Trinity install location is /usr/local/trinityrnaseq-Trinity-v2.4.0/Trinity simply remplace the absolute path location in the dntap_config.yaml as follow:

trinity: /usr/local/trinityrnaseq-Trinity-v2.4.0/Trinity

Inputs and parameters

You must specify in the config file if you are using single-end or paired-end .fastq file(s).

data_type:
    # You must specify either 'pe' or 'se' depending on the use of paired-end 
    # files of single-end file respectively.

    type: pe

Finally you can provide .fastq files location to be process by the pipeline.

samples:
    # You must provide absolute path to paired-end RNA-seq file (.fastq / .fq).

    forward: /path/to/sample/reads.left.fq
    reverse: /path/to/sample/reads.right.fq
    single: none

If you are using single-end .fastq file instead if paired-end .fastq files then change as follow:

samples:
    # You must provide absolute path to paired-end RNA-seq file (.fastq / .fq).

    forward: none
    reverse: none
    single: /path/to/sample/single.fq

Note that a file called "none" is created in the input .fastq directory for algorithm facilities. It will be fixed in a further version of the pipeline.

Change parameters if needed:

trimmomatic_params:
    MINLEN:32 SLIDINGWINDOW:10:20 LEADING:5 TRAILING:5
    
trinity_params:
    max_memory: 20G
    
transdecoder_params:
    min_protein_len: 100
    
interproscan_params:
    out_format: tsv
    db: TIGRFAM, SFLD, ProDom, Hamap, SMART, CDD, ProSiteProfiles, ProSitePatterns, SUPERFAMILY, PRINTS, PANTHER, Gene3D, PIRSF, Pfam, Coils

You also have the possibility to set maximum threads to be use by each step of the pipeline:

threads:
    # You can set maximum threads to be use be each step.

    fastqc: 20
    trimmomatic: 6
    trinity: 20
    transrate: 20
    transdecoder: 20
    interproscan: 20

Note that it cannot exceed the maximum number provided in the command line --cores 20.

Contacts

Arnaud Meng

Ph.D. student in Bioinformatics

Institut de Biologie Paris Seine

Linkedin | ResearchGate

References

A first version of the pipeline was presented at European Council of Computational Biology (ECCB) 2016

Meng A, Bittner L, Corre E et al. De novo transcriptome assembly dedicated pipeline and its specific application to non-model marine planktonic organisms. F1000Research 2016, 5:2643 (poster) (doi: 10.7490/f1000research.1113381.1)

Included softwares

FASTQC S. Andrews. FastQC A Quality Control tool for High Throughput Sequence Data. (2014)

Trimmomatic Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: A flexible trimmer for Illumina Sequence Data. Bioinformatics btu170 (2014). doi:10.1093/bioinformatics/btu170

Trinity Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome (Trinity). Nat Biotech 29, 644–652 (2011).

Transrate Smith-Unna, R., Boursnell, C., Patro, R., Hibberd, J. & Kelly, S. TransRate: reference free quality assessment of de novo transcriptome assemblies. Genome Res. gr.196469.115 (2016). doi:10.1101/gr.196469.115

Transdecoder Haas, B. J. et al. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat. Protocols 8, 1494–1512 (2013).

InterProScan 5 Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 1236–1240 (2014).