Skip to content

Commit

Permalink
2.2.5 with Dockerfile
Browse files Browse the repository at this point in the history
  • Loading branch information
weber8thomas committed Jan 10, 2024
1 parent b6e72b9 commit 8749b5f
Show file tree
Hide file tree
Showing 3 changed files with 419 additions and 101 deletions.
217 changes: 116 additions & 101 deletions docs/workshop.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,13 @@
# Mosaicatcher workshop

## AFAC

Context: You are working on a fancy project and Jan is suggesting at some point to generate some Strand-seq data

That's the first time you are working with Strand-seq and you are starting to panic

But You remember that you heard that a complex tool was developed in the lab in order to process in a systematic way Strand-seq data: tadam MOSAICATCHER



Prerequisite: I asked you to select a sample to process during today's workshop
TD: feedback on how they traced back the name of the sample, the associated run/flowcell, the date when it was sequenced ...

Expand All @@ -17,9 +16,7 @@ So here's the plan for today:
Small intro (~20/30 min) about Mosaicatcher, the different steps, options, branches, possibilities

Outputs examples
SV trustfullness


SV trustfullness

Web report analysis of RPE-MIXTURE

Expand All @@ -43,28 +40,39 @@ Hands on: pipeline install, module load, test data execution

vim scNOVA_input_user/input_subclonality.txt






Then trigger the pipeline on YOUR data

Once this is running, web report analysis together with questions

Then, Strand-scape
Still in beta, some microservices instable, main application for QC and web report consultation
Remove MC trigger, too complex in the backend
Cell selection with username
Still in beta, some microservices instable, main application for QC and web report consultation
Remove MC trigger, too complex in the backend
Cell selection with username

cp --preserve=timestamps FROM_ TO_

snakemake ...

## Technical prerequisites

- SSHFS/SFTP connection to visualise/download/access files created (WinSCP/FileZilla/Cyberduck)
- Functional terminal connected to the EMBL cluster (if not follow SSH key configuration here: https://www.embl.org/internal-information/it-services/hpc-resources-heidelberg/)
- Have a workspace on /g/korbel

## Workshop prerequisites

---
- Pick a sample name to be processed
- Download this MosaiCatcher report: https://oc.embl.de/index.php/s/WBgrzBjyzdYdVJA/download

## EMBL cheatsheet

### connect to seneca

ssh USERNAME@seneca.embl.de

### connect to login nodes

ssh USERNAME@login0[1,2,3,4].embl.de (login01 to login04)

**ℹ️ Important Note**

Expand All @@ -86,6 +94,19 @@ Snakemake important arguments/options
--rerun-triggers
--touch

## MosaiCatcher important files

- Counts: PARENT_FOLDER/SAMPLE_NAME/counts/SAMPLE_NAME.txt.raw.gz
- Counts statistics: PARENT_FOLDER/SAMPLE_NAME/counts/SAMPLE_NAME.info_raw
- Ashleys predictions: PARENT_FOLDER/SAMPLE_NAME/cell_selection/labels.tsv
- Counts plot: PARENT_FOLDER/SAMPLE_NAME/plots/CountComplete.raw.pdf
- Count normalied plot: PARENT_FOLDER/SAMPLE_NAME/plots/CountComplete.normalised.pdf
- Phased W/C regions: PARENT_FOLDER/SAMPLE_NAME/strandphaser/strandphaser_phased_haps_merged.txt
- SV calls (stringent): PARENT_FOLDER/SAMPLE_NAME/mosaiclassifier/sv_calls/stringent_filterTRUE.tsv
- SV calls (lenient): PARENT_FOLDER/SAMPLE_NAME/mosaiclassifier/sv_calls/lenient_filterFALSE.tsv
- Plots folder: PARENT_FOLDER/SAMPLE_NAME/plots/
- scNOVA outputs:

## CLI usage of the pipeline

### Quick Start
Expand All @@ -102,147 +123,141 @@ Notes
- Config definition is crucial / via command line or via YAML file, will define where to stop, which mode, which branch, which options to be used
- Profile

2.
2. Load snakemake

A. Use module load OR create a dedicated conda environment

```bash
module load snakemake ...
module load snakemake/7.32.4-foss-2022b
```

## <!--

**ℹ️ Note**

- Please be careful of your conda/mamba setup, if you applied specific constraints/modifications to your system, this could lead to some versions discrepancies.
- mamba is usually preferred but might not be installed by default on a shared cluster environment

---
Give a look at the folder structure:

```bash
conda create -n snakemake -c bioconda -c conda-forge -c defaults -c anaconda snakemake
tree -h .tests/data_CHR17
```

B. Activate the dedicated conda environment
Similar to this

Parent_folder
|-- Sample_1
| `-- fastq
| |-- Cell_01.1.fastq.gz
| |-- Cell_01.2.fastq.gz
| |-- Cell_02.1.fastq.gz
| |-- Cell_02.2.fastq.gz
| |-- Cell_03.1.fastq.gz
| |-- Cell_03.2.fastq.gz
| |-- Cell_04.1.fastq.gz
| `-- Cell_04.2.fastq.gz
|
`-- Sample_2
`-- fastq
|-- Cell_21.1.fastq.gz
|-- Cell_21.2.fastq.gz
|-- Cell_22.1.fastq.gz
|-- Cell_22.2.fastq.gz
|-- Cell_23.1.fastq.gz
|-- Cell_23.2.fastq.gz
|-- Cell_24.1.fastq.gz
`-- Cell_24.2.fastq.gz

````bash
conda activate snakemake
``` -->
````
**Reminder:** You will need to verify that this conda environment is activated and provide the right snakemake before each execution (`which snakemake` command should output like \<FOLDER>/\<USER>/[ana|mini]conda3/envs/snakemake/bin/snakemake)
3. Run on example data on only one small chromosome (`<disk>` must be replaced by your disk letter/name)
1. Run on example data on only one small chromosome (`<disk>` must be replaced by your disk letter/name)
First using the `--dry-run` option of snakemake to make sure the Graph of Execution is properly connected. (In combination with `--dry-run`, we use the `local/conda` profile as snakemake still present a bug by looking for the singularity container).
```bash
snakemake \
--cores 6 \
--configfile .tests/config/simple_config.yaml \
--profile workflow/snakemake_profiles/local/conda/ \
--dry-run
--config \
data_location=.tests/data_CHR17 \ # DATA LOCATION
ashleys_pipeline=True \ # DOWNLOAD & TRIGGER ASHLEYS QC UPSTREAM MODULE
ashleys_pipeline_only=True \ # STOP AFTER ASHLEYS QC - VALIDATION PURPOSE
multistep_normalisation=True \ # TRIGGER MARCO'S MULTISTEP NORMALISATION
MultiQC=True \ # TRIGGER samtools stats, FastQC & MultiQC reporting
--profile workflow/snakemake_profiles/local/conda/ \ # EXECUTION PROFILE TO BE USED
--dry-run # ONLY CHECK IF EVERYTHING CONNECTS WELL AND READY FOR COMPUTING
````

If no error message, you are good to go!

```bash
# Snakemake Profile: if singularity installed: workflow/snakemake_profiles/local/conda_singularity/
# Snakemake Profile: if singularity NOT installed: workflow/snakemake_profiles/local/conda/
snakemake \
--cores 6 \
--configfile .tests/config/simple_config.yaml \
--profile workflow/snakemake_profiles/local/conda_singularity/ \
--singularity-args "-B /disk:/disk"
--config \
data_location=.tests/data_CHR17 \ # DATA LOCATION
ashleys_pipeline=True \ # DOWNLOAD & TRIGGER ASHLEYS QC UPSTREAM MODULE
ashleys_pipeline_only=True \ # STOP AFTER ASHLEYS QC - VALIDATION PURPOSE
multistep_normalisation=True \ # TRIGGER MARCO'S MULTISTEP NORMALISATION
MultiQC=True \ # TRIGGER samtools stats, FastQC & MultiQC reporting
--profile workflow/snakemake_profiles/HPC/slurm_EMBL/ \
--cores 24
```

4. Generate report on example data

```bash
snakemake \
--cores 6 \
--configfile .tests/config/simple_config.yaml \
--profile workflow/snakemake_profiles/local/conda_singularity/ \
--singularity-args "-B /disk:/disk" \
--report report.zip \
--report-stylesheet workflow/report/custom-stylesheet.css
cat .tests/data_CHR17/RPE-BM510/counts/RPE-BM510.info_raw
zcat .tests/data_CHR17/RPE-BM510/counts/RPE-BM510.txt.raw.gz | less
cat .tests/data_CHR17/RPE-BM510/cell_selection/labels.tsv
```

---
Look at the plots

**ℹ️ Note**

- Steps 0 - 2 are required only during first execution
- After the first execution, do not forget to go in the git repository and to activate the snakemake environment
.tests/data_CHR17/RPE-BM510/plots

---

---

**ℹ️ Note for 🇪🇺 EMBL users**

- Use the following profile to run on EMBL cluster: `--profile workflow/snakemake_profiles/HPC/slurm_EMBL`

---

## 🔬​ Start running your own analysis

Following commands show you an example using local execution (not HPC or cloud)

1. Start running your own Strand-Seq analysis
REPORT

```bash
snakemake \
--cores <N> \
--config \
data_location=<INPUT_DATA_FOLDER> \
--profile workflow/snakemake_profiles/local/conda_singularity/

```

2. Generate report

```bash
snakemake \
--cores <N> \
--cores 6 \
--configfile .tests/config/simple_config.yaml \
--config \
data_location=<INPUT_DATA_FOLDER> \
--profile workflow/snakemake_profiles/local/conda_singularity/ \
--report <INPUT_DATA_FOLDER>/<REPORT.zip> \
data_location=.tests/data_CHR17 \ # DATA LOCATION
ashleys_pipeline=True \ # DOWNLOAD & TRIGGER ASHLEYS QC UPSTREAM MODULE
ashleys_pipeline_only=False \ # STOP AFTER ASHLEYS QC - VALIDATION PURPOSE
multistep_normalisation=True \ # TRIGGER MARCO'S MULTISTEP NORMALISATION
MultiQC=True \ # TRIGGER samtools stats, FastQC & MultiQC reporting
--profile workflow/snakemake_profiles/HPC/slurm_EMBL/ \
--cores 24 \
--report TEST_DATA_REPORT.zip \
--report-stylesheet workflow/report/custom-stylesheet.css
```

## System requirements

This workflow is meant to be run in a Unix-based operating system (tested on Ubuntu 18.04 & CentOS 7).

Minimum system requirements vary based on the use case. We highly recommend running it in a server environment with 32+GB RAM and 12+ cores.
Questions???

- [Conda install instructions](https://conda.io/miniconda.html)
- [Singularity install instructions](https://sylabs.io/guides/3.0/user-guide/quick_start.html#quick-installation-steps)

## Detailed usage

### 🐍 1. Mosaicatcher basic conda environment install

MosaiCatcher leverages snakemake built-in features such as execution within container and conda predefined modular environments. That's why it is only necessary to create an environment that relies on [snakemake](https://github.com/snakemake/snakemake) (to execute the pipeline) and [pandas](https://github.com/pandas-dev/pandas) (to handle basic configuration). If you plan to generate HTML Web report including plots, it is also necessary to install [imagemagick](https://github.com/ImageMagick/ImageMagick).
SCNOVA

If possible, it is also highly recommended to install and use `mamba` package manager instead of `conda`, which is much more efficient.
mkdir -p .tests/data_CHR17/RPE-BM510/scNOVA_input_user
awk 'BEGIN {FS=OFS="\t"} NR==1 {print "Filename", "Subclonality"} NR>1 && $2==1 {sub(/\.sort\.mdup\.bam/, "", $1); print $1, "clone"}' .tests/data_CHR17/RPE-BM510/cell_selection/labels.tsv > .tests/data_CHR17/RPE-BM510/scNOVA_input_user/input_subclonality.txt

```bash
conda install -c conda-forge mamba
mamba create -n snakemake -c bioconda -c conda-forge -c defaults -c anaconda snakemake
conda activate mosaicatcher_env
snakemake \
--cores 6 \
--configfile .tests/config/simple_config.yaml \
--config \
data_location=.tests/data_CHR17 \ # DATA LOCATION
ashleys_pipeline=True \ # DOWNLOAD & TRIGGER ASHLEYS QC UPSTREAM MODULE
ashleys_pipeline_only=False \ # CONTINUES AFTER ASHLEYS QC - VALIDATION PURPOSE
multistep_normalisation=True \ # TRIGGER MARCO'S MULTISTEP NORMALISATION
MultiQC=True \ # TRIGGER samtools stats, FastQC & MultiQC reporting
scNOVA=True \
--profile workflow/snakemake_profiles/HPC/slurm_EMBL/ \
--cores 24
```

### ⤵️ 2. Clone repository & go into workflow directory

After cloning the repo, go into the `workflow` directory which correspond to the pipeline entry point.
########################################################################

```bash
git clone --recurse-submodules https://github.com/friendsofstrandseq/mosaicatcher-pipeline.git
cd mosaicatcher-pipeline
```

### ⚙️ 3. MosaiCatcher execution (without preprocessing)

Expand Down
Loading

0 comments on commit 8749b5f

Please sign in to comment.