This pipeline has been designed to process RNA-seq data. It allows to obtain results and reports from your data very easily by simply fill a configuration file and launch a command line. It takes advantages of Snakemake workflow engine. You can refer to the Snakemake publication for more details.
The pipeline is composed of 6 steps:
.fastq
evaluation (FASTQC)- Quality filtering
.fastq
(Trimmomatic) - Newly filtered
.fastq
evaluation (FASTQC) - De novo transcriptome assembly (Trinity)
- Assembly metrics evaluation (Transrate)
- Protein coding domains prediction (Transdecoder)
- Functional annotation of prediction protein coding domains (InterProScan 5)
dnatp takes 2 arguments:
- The snakefile
dntap.py
- The configuration file
dntap_config.yaml
snakemake --snakefile dntap.py --configfile dntap_config.yaml --cores 20
- It is possible to provided a naximum of cores to be used by
dntap.py
steps, here:--cores 20
.
1.0
The pipeline is provided with ready-to-use binairies and sources for:
- FastQC (v0.11.5)
- Trimmomatic (v0.36)
- Transrate (v1.0.3)
- TransDecoder (v3.0.1)
However it requires installation for:
- Snakemake
- Trinity (2.4.0)
- InterProScan (v5.24-63)
Note that the current version of the pipeline do not take advantages of the ipr_lookup service
of InterProScan.
However you need to download and install the Panther
database if you want InterProScan to perform research on it.
To use the pipeline proprely you must start with filling software location in the dntap_config.yaml
software:
# You must provide absolute path to the executable file
# e.g. fastqc: /path/to/fastqc/install/directory/fastqc
fastqc: path/to/src/FastQC/fastqc
trimmomatic: path/to/src/Trimmomatic-0.36/trimmomatic-0.36.jar
trinity: /path/to/src/trinityrnaseq-Trinity-v2.4.0/Trinity
transrate: /path/to/src/transrate-1.0.3-linux-x86_64/transrate
transdecoder_longorfs: /path/to/src/TransDecoder-3.0.1/TransDecoder.LongOrfs
transdecoder_predict: /path/to/src/TransDecoder-3.0.1/TransDecoder.Predict
interproscan: /path/to/src/interproscan-5.24-63.0/interproscan.sh
If Trinity install location is /usr/local/trinityrnaseq-Trinity-v2.4.0/Trinity
simply remplace the absolute path location in the dntap_config.yaml
as follow:
trinity: /usr/local/trinityrnaseq-Trinity-v2.4.0/Trinity
You must specify in the config file if you are using single-end or paired-end
.fastq
file(s).
data_type:
# You must specify either 'pe' or 'se' depending on the use of paired-end
# files of single-end file respectively.
type: pe
Finally you can provide .fastq
files location to be process by the pipeline.
samples:
# You must provide absolute path to paired-end RNA-seq file (.fastq / .fq).
forward: /path/to/sample/reads.left.fq
reverse: /path/to/sample/reads.right.fq
single: none
If you are using single-end .fastq
file instead if paired-end .fastq
files
then change as follow:
samples:
# You must provide absolute path to paired-end RNA-seq file (.fastq / .fq).
forward: none
reverse: none
single: /path/to/sample/single.fq
Note that a file called "none" is created in the input .fastq
directory for
algorithm facilities. It will be fixed in a further version of the pipeline.
Change parameters if needed:
trimmomatic_params:
MINLEN:32 SLIDINGWINDOW:10:20 LEADING:5 TRAILING:5
trinity_params:
max_memory: 20G
transdecoder_params:
min_protein_len: 100
interproscan_params:
out_format: tsv
db: TIGRFAM, SFLD, ProDom, Hamap, SMART, CDD, ProSiteProfiles, ProSitePatterns, SUPERFAMILY, PRINTS, PANTHER, Gene3D, PIRSF, Pfam, Coils
You also have the possibility to set maximum threads to be use by each step of the pipeline:
threads:
# You can set maximum threads to be use be each step.
fastqc: 20
trimmomatic: 6
trinity: 20
transrate: 20
transdecoder: 20
interproscan: 20
Note that it cannot exceed the maximum number provided in the command line --cores 20
.
Arnaud Meng
Ph.D. student in Bioinformatics
Institut de Biologie Paris Seine
A first version of the pipeline was presented at European Council of Computational Biology (ECCB) 2016
Meng A, Bittner L, Corre E et al. De novo transcriptome assembly dedicated pipeline and its specific application to non-model marine planktonic organisms. F1000Research 2016, 5:2643 (poster) (doi: 10.7490/f1000research.1113381.1)
FASTQC S. Andrews. FastQC A Quality Control tool for High Throughput Sequence Data. (2014)
Trimmomatic Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: A flexible trimmer for Illumina Sequence Data. Bioinformatics btu170 (2014). doi:10.1093/bioinformatics/btu170
Trinity Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome (Trinity). Nat Biotech 29, 644–652 (2011).
Transrate Smith-Unna, R., Boursnell, C., Patro, R., Hibberd, J. & Kelly, S. TransRate: reference free quality assessment of de novo transcriptome assemblies. Genome Res. gr.196469.115 (2016). doi:10.1101/gr.196469.115
Transdecoder Haas, B. J. et al. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat. Protocols 8, 1494–1512 (2013).
InterProScan 5 Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 1236–1240 (2014).