nf-core · LorenzoS96 · Oct 29, 2024 · Oct 29, 2024 · Oct 30, 2024 · Oct 31, 2024
diff --git a/docs/usage/DEanalysis/de_rstudio.md b/docs/usage/DEanalysis/de_rstudio.md
diff --git a/docs/usage/DEanalysis/img/DESeq_function.png b/docs/usage/DEanalysis/img/DESeq_function.png
diff --git a/docs/usage/DEanalysis/img/Excalidraw_RNAseq.png b/docs/usage/DEanalysis/img/Excalidraw_RNAseq.png
diff --git a/docs/usage/DEanalysis/img/MA_plot.png b/docs/usage/DEanalysis/img/MA_plot.png
diff --git a/docs/usage/DEanalysis/img/PCA.png b/docs/usage/DEanalysis/img/PCA.png
diff --git a/docs/usage/DEanalysis/img/RNA_seq_scheme_tutorial.png b/docs/usage/DEanalysis/img/RNA_seq_scheme_tutorial.png
diff --git a/docs/usage/DEanalysis/img/count_distribution.png b/docs/usage/DEanalysis/img/count_distribution.png
diff --git a/docs/usage/DEanalysis/img/dds_comparison.png b/docs/usage/DEanalysis/img/dds_comparison.png
diff --git a/docs/usage/DEanalysis/img/enrichment_plot.png b/docs/usage/DEanalysis/img/enrichment_plot.png
diff --git a/docs/usage/DEanalysis/img/heatmap_de_genes.png b/docs/usage/DEanalysis/img/heatmap_de_genes.png
diff --git a/docs/usage/DEanalysis/img/hierarchical_clustering.png b/docs/usage/DEanalysis/img/hierarchical_clustering.png
diff --git a/docs/usage/DEanalysis/img/nf-core-rnaseq_metro_map_grey.png b/docs/usage/DEanalysis/img/nf-core-rnaseq_metro_map_grey.png
diff --git a/docs/usage/DEanalysis/img/overdispersion.png b/docs/usage/DEanalysis/img/overdispersion.png
diff --git a/docs/usage/DEanalysis/img/plotCounts.png b/docs/usage/DEanalysis/img/plotCounts.png
diff --git a/docs/usage/DEanalysis/img/project_R.png b/docs/usage/DEanalysis/img/project_R.png
diff --git a/docs/usage/DEanalysis/img/volcanoplot.png b/docs/usage/DEanalysis/img/volcanoplot.png
diff --git a/docs/usage/DEanalysis/img/workdir_RStudio.png b/docs/usage/DEanalysis/img/workdir_RStudio.png
diff --git a/docs/usage/DEanalysis/index.md b/docs/usage/DEanalysis/index.md
@@ -0,0 +1,45 @@
+---
+enableToc: false
+title: "nf-core/rnaseq tutorial"
+---
+
+## Welcome
+
+These pages are a tutorial workshop for the [Nextflow](https://www.nextflow.io) pipeline [nf-core/rnaseq](https://nf-co.re/rnaseq).
+
+In this workshop, we will recap the application of next generation sequencing to identify differentially expressed genes. You will learn how to use the rnaseq pipeline out this data-intensive workflow efficiently. We will cover topics such as configuration of the pipeline, code execution and data interpretation.
+
+Please note that this is not an introductory workshop, and we will assume some basic familiarity with Nextflow.
+
+By the end of this workshop, you will be able to:
+
+- analyse simple NGS datasets with the nf-core/rnaseq workflow
+- understand the key concepts behind RNAseq differential expression analysis
+- customise some of its features for your own analyses
+- integrate different sources of information to interpret the results
+
+Let's get started!
+
+## Running with Gitpod
+
+In order to run this using GitPod, please make sure:
+
+1. You have a GitHub account: if not, create one [here](https://github.com/signup)
+2. Once you have a GitHub account, sign up for GitPod using your GitHub user [here](https://gitpod.io/login/) choosing "continue with GitHub".
+
+Now you're all set and can use the following button to launch the service:
+
+[![Open in GitPod](https://img.shields.io/badge/Gitpod-%20Open%20in%20Gitpod-908a85?logo=gitpod)](https://gitpod.io/#https://github.com/lescai-teaching/rnaseq-tutorial)
+
+## Additional documentation
+
+- You can find detailed documentation on **Nextflow** [here](https://www.nextflow.io/docs/latest/)
+- You can find additional training on [these pages](https://training.nextflow.io)
+
+## Credits & Copyright
+
+This training material has been written and completed during the [nf-core](https://nf-co.re) Hackathon in Barcellona, 2024, by Lorenzo Sola, Francesco Lescai and Mariangela Santorsola, with special thanks to Victoria Cepeda for her contributions to the tutorial's revision. The tutorial is aimed at anyone who is interested in using nf-core pipelines for their studies or research activities.
+
+The Docker image and Gitpod environment used in this repository have been created by [Seqera](https://seqera.io) but have been made open-source ([CC BY-NC-ND](https://creativecommons.org/licenses/by-nc-nd/4.0/)) for the community.
+
+All examples and descriptions are licensed under the [Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License](http://creativecommons.org/licenses/by-nc-nd/4.0/).
diff --git a/docs/usage/DEanalysis/interpretation.md b/docs/usage/DEanalysis/interpretation.md
@@ -0,0 +1,73 @@
+# Interpretation
+
+Once DE genes have been identified, the next crucial step is to interpret the results. This involves the inspection of tables and plots generated during the analysis to understand the biological significance of the data. In this part of the tutorial, we will delve into the results by discussing the significant DE genes and we will explore various plots generated during the analysis.
+
+!!! note
+
+    The results illustrated in this section might show slight variations compared to your runs due to randomness in the STAR algorithm. This randomness arises from using variable seed values and parallel processing, leading to minor differences in results between runs on the same data. These small discrepancies are not biologically significant and may affect counts and subsequent plots (such as PCA and count plots). However, the overall patterns and main findings should remain consistent. While exact reproducibility is ideal, minor variations are acceptable in practice, as long as they do not impact the main conclusions of the study.
+
+## Quality control plots
+
+The first plot we will examine is the Principal Component Analysis (PCA) plot. Since we're working with simulated data, our metadata is relatively simple, consisting of just three variables: sample, condition and replica. In a typical RNA-seq experiment, however, metadata can be complex and encompass a wide range of variables that could contribute to sample variation, such as sex, age and developmental stage.
+
+![PCA](./img/PCA.png)
+
+By plotting the PCA on the PC1 and PC2 axes, using `condition` as the main variable of interest, we can quickly identify the primary source of variation in our data. By accounting for this variation in our design model, we should be able to detect more differentially expressed genes related to `condition`. When working with real data, it's often useful to plot the data using different variables to explore how much variation is explained by the first two PCs. Depending on the results, it may be informative to examine variation on additional PC axes, such as PC3 and PC4, to gain a more comprehensive understanding of the data.
+
+Next, we will examine the hierarchical clustering plot to explore the relationships between samples based on their gene expression profiles. The heatmap is organized such that samples with similar expression profiles are close to each other, allowing us to identify patterns in the data.
+
+![Cluster](./img/hierarchical_clustering.png)
+
+Remember that to create this plot, we utilized the `dist()` function, so in the legend on the right, a value of 0 corresponds to high correlation, while a value of 5 corresponds to very low correlation. Similar to PCA, we can see that samples tend to cluster together according to `condition`, indeed we can observe a high degree of correlation between the three control samples and between the three treated samples.
+
+Overall, the integration of these plots suggests that we are working with high-quality data and we can confidently proceed to the differential expression analysis.
+
+## Differential expression results
+
+From this point, we will examine plots that are generated after the differential expression analysis. These plots are not quality control (QC) plots, but rather plots that help us to interpret the results.
+After running the `results()` function, a good way to start to have an idea about the results is to look at the MA plot.
+
+![RNAseq](./img/MA_plot.png)
+
+By default, genes are coloured in blue if the padj is less than 0.1 and the log2foldchange greater than or less than 0. Genes that fall outside the plotting region are represented as open triangles. Note that we have not yet applied a filter to select only significant, which we define as those with a padj value less than 0.5 and a log2 fold change of at least 1 or -1.
+
+After filtering our genes of interest according to our threshold, let's have a look to our significatnt genes:
+
+```bash
+gene                baseMean        log2FoldChange     lfcSE          stat            pvalue              padj
+ENSG00000205726     121645.5908     2.894480           0.1515387      19.100600       2.496005e-81        5.840651e-79
+ENSG00000142192     51065.3192      3.025489           0.1891258      15.997230       1.335883e-57        1.562983e-55
+ENSG00000142156     20805.8078      2.977705           0.2159277      13.790287       2.915972e-43        2.274458e-41
+ENSG00000159231     458.9277        -1.194777          0.3058100      -3.906926       9.347790e-05        5.468457e-03
+ENSG00000156282     481.7624        1.095272           0.2969594      3.688289        2.257672e-04        1.056590e-02
+```
+
+To gain a comprehensive overview of the transcriptional profile, the volcano plot represents a highly informative tool.
+
+![Volcano_plot](./img/volcanoplot.png)
+
+The treatment induced differential expression in five genes, with one downregulated and four upregulated. This plot visually represents the numerical results reported in the table above.
+
+After the identification of DE genes, it's informative to visualise the expression of specific genes of interest. The `plotCounts()` function applied directly on the `dds` object allows us to examine individual gene expression profiles without accessing the full `res` object.
+
+![Counts](./img/plotCounts.png)
+
+In our example, post-treatment, we observe a significant increase in the expression of the _ENSG00000142192_ gene, highlighting its responsiveness to the experimental conditions.
+
+Finally, we can create a heatmap using the normalised expression counts of DE genes. The resulting heatmap visualises how the expression of significant genes varies across samples. Each row represents a gene, and each column represents a sample. The color intensity in the heatmap reflects the normalised expression levels: red colors indicate higher expression, while blue colors indicate lower expression.
+
+![Heatmap](./img/heatmap_de_genes.png)
+
+By examining the heatmap, we can visually identify the expression patterns of our five significant differentially expressed genes. This visualization allows us to identify how these genes respond to the treatment. The heatmap provides a clear and intuitive way to explore gene expression dynamics.
+
+## Over Representation Analysis (ORA)
+
+Finally, we can attempt to assign biological significance to our differentially expressed genes through **Enrichment Analysis (ORA)**. The ORA analysis identifies specific biological pathways, molecular functions and cellular processes, according to the **Gene Ontology (GO)** database, that are enriched within our differentially expressed genes.
+
+![Enrichment](./img/enrichment_plot.png)
+
+The enrichment analysis reveals a possible involvement of cellular structures and processes, including "clathrin-coated pit", "dendritic spine", "neuron spine" and "endoplasmic reticulum lumen". These terms suggest a focus on cellular transport, structural integrity and protein processing, especially in neural contexts. This pattern points to pathways related to cellular organization and maintenance, possibly playing an important role in the biological condition under study.
+
+## Conclusions
+
+In this tutorial, we have walked through the steps of performing differential expression analysis using DESeq2, from preparing the data to interpreting the results. We have seen how to identify differentially expressed genes, visualise the results and perform enrichment analysis to gain insights into the biological significance of the results. By following this tutorial, you should now be able to perform differential expression analysis using DESeq2 and interpret the results in the context of your own research question.
diff --git a/docs/usage/DEanalysis/rnaseq.md b/docs/usage/DEanalysis/rnaseq.md
@@ -0,0 +1,122 @@
+# The nf-core/rnaseq pipeline
+
+In order to carry out a RNA-Seq analysis we will use the nf-core pipeline [rnaseq](https://nf-co.re/rnaseq/3.14.0).
+
+## Overview
+
+The pipeline is organised following the diffent blocks: pre-processing, alignment (or lightweight alignment) and quantification, post-processing and final QC.
+
+![Metromap](./img/nf-core-rnaseq_metro_map_grey.png)
+
+In each process, the users can choose among a range of different options. Importantly, the users can decide to follow one of the two different routes in the alignment and quantification step:
+
+- alignment and quantification (stage 2);
+- pseudoalignment and quantification (stage 3).
+
+## Experimental Design
+
+The number of reads and the number of biological replicates are two critical factors that researchers need to carefully consider during the design of a RNA-seq experiment. While it may seem intuitive that having a large number of reads is always desirable, an excessive number can lead to unnecessary costs and computational burdens, without providing significant improvements in quality of the data. Instead, it is often more beneficial to prioritise the number of biological replicates, as it allows to capture the natural biological variation of the data. Biological replicates involve collecting and sequencing RNA from distinct biological samples (e.g., different individuals, tissues, or time points), helping to detect genuine changes in gene expression.
+
+!!! note
+
+    This concept must not be confused with technical replicates that asses the technical variability of the sequencing platform by sequencing the same RNA sample multiple time.
+
+To obtain optimal results, it is crucial to balance the number of biological replicates and sequencing depth. While deeper sequencing improves the detection of lowly expressed genes, it reaches a plateau, beyond which no additional benefits are gained. Statistical power calculations can inform experimental design by estimating the optimal number of reads and replicates required. For instance, this approach helps establish a suitable log2 fold change (log2FC) threshold for the DE analysis. By incorporating multiple biological replicates into the design and optimizing sequencing depth, researchers can enhance the statistical power of the analysis, reducing the number of false positive results and increasing the reliability of the findings.
+
+## Library design
+
+RNA-seq library design involves critical decisions, including the choice between paired-end and single-end sequencing. Paired-end sequencing provides valuable information on structural variations and transcript isoforms, improving mapping accuracy, especially for longer transcripts and repetitive regions. In contrast, single-end sequencing, where only one end of the fragment is sequenced, can be more cost-effective while still providing high-quality data for gene expression analysis. The decision between paired-end and single-end sequencing ultimately depends on the research question and experimental goals. Paired-end sequencing is preferred for novel transcript identification or isoform characterization, while single-end sequencing is sufficient for gene expression quantification. The type of RNA (e.g., mRNA or total RNA), read length, budget and computational resources can impact the choice.
+
+## Reference genome
+
+nf-core pipelines make use of the Illumina iGenomes collection as [reference genomes](https://nf-co.re/docs/usage/reference_genomes).
+Before starting the analysis, the users might want to check whether the genome they need is part of this collection.
+They also might want to consider downloading the reference locally, when running on premises: this would be useful for multiple runs and to speed up the analysis. In this case the parameter `--igenomes_base` might be used to pass the root directory of the downloaded references.
+
+One might also need to use custom files: in this case the users might either provide specific parameters at command line (`--fasta` option followed by the genome of choiche), or create a config file adding a new section to the `genome` object. See [here](https://nf-co.re/docs/usage/reference_genomes#custom-genomes) for more details.
+
+We will follow this specific approach in this tutorial, since the data we will be using have been simulated on chromosome 21 of the Human GRCh38 reference, and we have prepared genome fasta and genome index containing only this chromosome locally. The two files are `/workspace/gitpod/training/data/refs/Homo_sapiens_assembly38_chr21.fa` and `/workspace/gitpod/training/data/refs/Homo_sapiens_assembly38_chr21.fa.fai`, respectively.
+
+## Reference annoation
+
+The reference annotation plays a crucial role in the RNA-seq analysis. Without a high-quality reference annotation, RNA-seq analysis would result in inaccurate or incomplete results. The reference annotation provides a precise guide for aligning sequencing reads to specific genomic regions, allowing to identify genes, transcripts and regulatory elements. This is particularly important for identifying novel transcripts and alternative splicing events.
+
+nf-core pipelines make use of the Illumina iGenomes collection also as [reference annotation](https://nf-co.re/docs/usage/reference_genomes).
+The reference annotations are vastly out of date with respect to current annotations and miss certain features. So, the general recommendation is to download a newest annotation version compatible with the genome. A user can utilize the `--gtf` or the `--gff` options to specify the annottation files of choiche, or create a config file adding a new section to the `genome` object.
+
+Similarly to the approach utilised for the genome, in this tutorial we will follow this approach. The annotation files include only the annotated transcripts on chromosome 21 of the Human GRCh38 reference genome and we have already prepared these files locally. The two files are `/workspace/gitpod/training/data/refs/gencode_v29_chr21.gff` and `/workspace/gitpod/training/data/refs/gencode_v29_transcripts_chr21.fa`, respectively.
+
+## Input files
+
+The input data should be provided in a CSV file, according to a format that is largely common for nf-core pipelines.
+The format is described in the [rnaseq usage page](https://nf-co.re/rnaseq/3.14.0/docs/usage).
+In the tutorial, the input file is `/workspace/gitpod/training/data/reads/rnaseq_samplesheet.csv`.
+
+## Running nf-core/rnaseq
+
+In the following sections we will:
+
+- prepare our references;
+- set our computational resources in order to be able to run the pipeline on a gitpod VM;
+- edit the optional settings;
+- run the pipeline.
+
+## Reference and annotation files
+
+Following the considerations above, we will first of all edit the `nextflow.config` file in our working directory to add a new genome.
+It is sufficient to add the following code to the `parameters` directive in the config.
+
+```groovy
+igenomes_base = '/workspace/gitpod/training/data/refs/'
+genomes {
+        'GRCh38chr21' {
+            fasta                 = "${params.igenomes_base}/sequence/Homo_sapiens_assembly38_chr21.fasta"
+            fasta_fai             = "${params.igenomes_base}/sequence/Homo_sapiens_assembly38_chr21.fasta.fai"
+            gff                   = "${params.igenomes_base}/gencode_v29_chr21_parsed.gff"
+            transcript_fasta      = "${params.igenomes_base}/gencode.v29.transcripts_chr21.fa"
+            star_index            = "${params.igenomes_base}/star_index_chr21.tar.gz"
+            salmon_index          = "${params.igenomes_base}/salmon_index_chr21.tar.gz"
+	}
+}
+```
+
+To speed up the analysis we will include the `star_index` and the `salmon_index` in the config. These files have already been created locally.
+
+## Computing resources
+
+Based on the choices we made when starting up the gitpod environment, we recommend to use the following additional parameters.
+They can also be added to the parameters directive in the config file we just edited.
+
+```groovy
+params {
+    max_cpus      = 2
+    max_memory    = '6.5GB'
+    max_time      = '2.h'
+}
+```
+
+## Launching the pipeline
+
+Now we are ready to launch the pipeline, and we can use the following command line:
+
+```bash
+nextflow run nf-core/rnaseq -r 3.12.0 \
+--input /workspace/gitpod/training/data/reads/rnaseq_samplesheet.csv \
+--outdir ./results_star_salmon \
+--genome GRCh38chr21 \
+--aligner star_salmon \
+--pseudo_aligner salmon \
+--skip_biotype_qc \
+--skip_stringtie \
+--skip_bigwig \
+--skip_umi_extract \
+--skip_trimming \
+--skip_fastqc \
+--skip_markduplicates \
+--skip_dupradar \
+--skip_rseqc \
+--skip_qualimap
+```
+
+Notice that we will run the pipeline with STAR as aligner and Salmon in alignment-based mode to quantify gene expression. We will also run the pipeline with Salmon in mapping-based mode to perform a lightweight alignment and quantification.
+The `skip` parameters were inserted to reduce the running time.