Pipeline used for single-cell RNAseq read alignment in the Bradham Lab at Boston University.
The pipeline is implemented using SnakeMake*.
To install the pipeline, simply clone this repository and install the required conda
environments using the provided specification files. This pipeline has only been tested in a Linux environment. It is not guaranteed to work on a Mac or Windows machine.
Two conda environments are required to run the pipeline:
- Alignment
Create the alignment environment using the requirements.txt
. In a terminal, with an accessible conda
installation, issue the following command:
conda create -n alignment --file alignment_spec.txt
- MultiQC
The MultiQC environment should be built using the following instructions*:
conda create -n multiqc pip --no-default-packages
source activate multiqc
pip install --upgrade --force-reinstall git+https://github.com/ewels/MultiQC.git --ignore-installed certifi
You may also need to install Cython
for some package use in the multiqc
environment.
This can be installed using the conda install cython
command.
*Note, on some systems the Python 3 version of MultiQC
fails due to the click
library failing to deal with strings properly. If this is the case, specify python=2.7
upon environment creation.
The pipeline is created using SnakeMake, therefore executing the pipeline is the same as any other SnakeMake pipeline.
Perform a dry run:
Navigate to the head of the repository. In your terminal issue the following command: snakemake -np
Run the pipeline:
Navigate to the head of the repository. In your terminal issue the following command: snakemake
This pipeline performs the necessary operations to take single-cell RNAseq data from paired-end raw reads to a normalized expression matrix. This transformation is done using the following tools/steps.
input
: raw reads (.fastq.gz
)
output
: trimmed and filtered reads (.fastq.gz
)
Perform quality control using the fastp by trimming low quality regions and adapter sequences in reads, and filtering reads with too many ambiguous bases (Ns) or reads with low sequence complexity.
bioRxiv Pre-Print
Shifu Chen, Yanqing Zhou, Yaru Chen, Jia Gu. fastp: an ultra-fast all-in-one FASTQ preprocessor. bioRxiv 274100; doi: https://doi.org/10.1101/274100
input
: trimmed and filtered reads (fastq.gz
)
output
: aligned reads (.bam
, .sam
)
Align filtered reads to the provided genome using STAR.
Original Paper
Dobin, A. Davis CA, Schlesinger F, Drenkow J. Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013. 29. 1. pp 15-21.
input
: filtered alignments (.bam
)
output
: raw read count matrix (.csv
)
Retrieve fragment counts of paired end data using featureCounts.
Original Paper
Liao Y, Smyth GK, Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014. 30. 7. pp 923-930.
input
: read counts (.csv
)
ouput
: filtered counts (.csv
)
Remove genes without any associated counts across all cells. Remove cells with greater than 90% dropout.
input
: raw read count matrix (.csv
)
output
: within-sample normalized count matrix.
Normalize read counts using SCnorm if dropout is below 80%, otherwise use scran
Original Papers
Bacher R, Chu LF, Leng N, Gasch AP, Thomson JA, Stewart RM, Newton M, Kendziorski C. SCnorm: robust normalization of single-cell RNA-seq data. Nature Methods. 2017 Jun 1;14(6):584-6.
input
: within-sample normalized count matrix.
output
: batched removed normalized count matrix.
Remove batch effects using mutual nearest neighbors (MNN).
Orginal Paper
Haghverdi L, Lun ATL, Mordan MD, Marioni JC. Batch effects in single-cell RNA-sequencing data are corrected by matching mutaul nearest neighbors. Nature Biotechnology. 2018. 26:421-427.
* This pipeline is currently being developed and does not exist in a complete/functional state.