ℹ️ Important Note
From 2.2.0, you don't need to clone both ashleys-qc-pipeline preprocessing module and mosaicatcher-pipeline. By using ashleys_pipeline_only=True
combined with ashleys_pipeline=True
in the configuration of MosaiCatcher, this will stop the execution after the generation of the files related to ashleys-qc-pipeline. This allow you to use a single pipeline, repository and container and limit the potential conflicts by processing the same folder (data_location) by different repositories and set of files (including the workflow/data/ref_genomes references files).
-
[Optional] Install Singularity
-
A. Create a dedicated conda environment
ℹ️ Note
- Please be careful of your conda/mamba setup, if you applied specific constraints/modifications to your system, this could lead to some versions discrepancies.
- mamba is usually preferred but might not be installed by default on a shared cluster environment
conda create -n snakemake -c bioconda -c conda-forge -c defaults -c anaconda snakemake
- B. Activate the dedicated conda environment
conda activate snakemake
Reminder: You will need to verify that this conda environment is activated and provide the right snakemake before each execution (which snakemake
command should output like <FOLDER>/<USER>/[ana|mini]conda3/envs/snakemake/bin/snakemake)
- Clone the repository
git clone --recurse-submodules https://github.com/friendsofstrandseq/mosaicatcher-pipeline.git && cd mosaicatcher-pipeline
- Run on example data on only one small chromosome (
<disk>
must be replaced by your disk letter/name)
First using the --dry-run
option of snakemake to make sure the Graph of Execution is properly connected. (In combination with --dry-run
, we use the local/conda
profile as snakemake still present a bug by looking for the singularity container).
snakemake \
--cores 6 \
--configfile .tests/config/simple_config.yaml \
--profile workflow/snakemake_profiles/local/conda/ \
--dry-run
If no error message, you are good to go!
# Snakemake Profile: if singularity installed: workflow/snakemake_profiles/local/conda_singularity/
# Snakemake Profile: if singularity NOT installed: workflow/snakemake_profiles/local/conda/
snakemake \
--cores 1 \
--configfile .tests/config/simple_config.yaml \
--profile workflow/snakemake_profiles/local/conda_singularity/ \
--singularity-args "-B /disk:/disk"
- Generate report on example data
snakemake \
--cores 1 \
--configfile .tests/config/simple_config.yaml \
--profile workflow/snakemake_profiles/local/conda_singularity/ \
--singularity-args "-B /disk:/disk" \
--report report.zip \
--report-stylesheet workflow/report/custom-stylesheet.css
ℹ️ Note
- Steps 0 - 2 are required only during first execution
- After the first execution, do not forget to go in the git repository and to activate the snakemake environment
ℹ️ Note for 🇪🇺 EMBL users
- Use the following profile to run on EMBL cluster:
--profile workflow/snakemake_profiles/HPC/slurm_EMBL
If you wish to reproduce results generated in the MosaiCatcher v2 publication, you can follow these different steps:
- Download and prepare data
As the complete data was above 50GB, data was stored into 2 separated zenodo repositories: https://zenodo.org/record/7696695 & https://zenodo.org/record/7697329. You can then first create a directory in a location with at least 300GB of free space, download the 3 tar.gz and extract them:
# Specify your location
YOUR_LOCATION=DEFINE_YOUR_PATH_HERE
# Create directory
mkdir -p $YOUR_LOCATION/mosaicatcher_publication_data/COMPRESSED
# Download tar.gz parts
wget https://zenodo.org/record/7696695/files/MOSAICATCHER_DATA_PAPER.tar.gz.part1?download=1 -P $YOUR_LOCATION/mosaicatcher_publication_data/COMPRESSED
wget https://zenodo.org/record/7697329/files/MOSAICATCHER_DATA_PAPER.tar.gz.part2?download=1 -P $YOUR_LOCATION/mosaicatcher_publication_data/COMPRESSED
wget https://zenodo.org/record/7697329/files/MOSAICATCHER_DATA_PAPER.tar.gz.part3?download=1 -P $YOUR_LOCATION/mosaicatcher_publication_data/COMPRESSED
# Extract
for f in *part*; do tar -xvf "$f" -C $YOUR_LOCATION/mosaicatcher_publication_data; done
You can then verify checksum to make sure files were downloaded properly.
# Verify checksum
for f in C7 RPE-BM510 RPE1-WT LCL; do
cd $YOUR_LOCATION/"$f"/fastq && md5sum -c *.md5 && cd $YOUR_LOCATION
done
If all the checks were successfull, you can now remove this directory before running the pipeline.
rm -r $YOUR_LOCATION/mosaicatcher_publication_data/COMPRESSED
- Run the pipeline
A generic slurm profile is provided in the the following directory: mosaicatcher-pipeline/workflow/snakemake_profiles/HPC/slurm_generic. Feel free to edit, create another profile specify to your cluster institute.
snakemake \
--config \
data_location=$YOUR_LOCATION/mosaicatcher_publication_data \
reference=T2T \
split_qc_plot=True \
ashleys_pipeline=True \
multistep_normalisation=True \
multistep_normalisation_for_SV_calling=False hgsvc_based_normalized_counts=True \
--profile workflow/snakemake_profiles/HPC/slurm_generic/
- Generate the report
snakemake \
--config \
data_location=$YOUR_LOCATION/mosaicatcher_publication_data \
reference=T2T \
split_qc_plot=True \
ashleys_pipeline=True \
multistep_normalisation=True \
multistep_normalisation_for_SV_calling=False hgsvc_based_normalized_counts=True \
--profile workflow/snakemake_profiles/HPC/slurm_generic/
--report $YOUR_LOCATION/MosaiCatcher_V2_publication_data.report.zip --report-stylesheet workflow/report/custom-stylesheet.css
You can then extract the zip archive and open the report.html
file to access the HTML report.
Following commands show you an example using local execution (not HPC or cloud)
- Start running your own Strand-Seq analysis
snakemake \
--cores <N> \
--config \
data_location=<INPUT_DATA_FOLDER> \
--profile workflow/snakemake_profiles/local/conda_singularity/
- Generate report
snakemake \
--cores <N> \
--config \
data_location=<INPUT_DATA_FOLDER> \
--profile workflow/snakemake_profiles/local/conda_singularity/ \
--report <INPUT_DATA_FOLDER>/<REPORT.zip> \
--report-stylesheet workflow/report/custom-stylesheet.css
This workflow is meant to be run in a Unix-based operating system (tested on Ubuntu 18.04 & CentOS 7).
Minimum system requirements vary based on the use case. We highly recommend running it in a server environment with 32+GB RAM and 12+ cores.
MosaiCatcher leverages snakemake built-in features such as execution within container and conda predefined modular environments. That's why it is only necessary to create an environment that relies on snakemake (to execute the pipeline) and pandas (to handle basic configuration). If you plan to generate HTML Web report including plots, it is also necessary to install imagemagick.
If possible, it is also highly recommended to install and use mamba
package manager instead of conda
, which is much more efficient.
conda install -c conda-forge mamba
mamba create -n snakemake -c bioconda -c conda-forge -c defaults -c anaconda snakemake
conda activate mosaicatcher_env
After cloning the repo, go into the workflow
directory which correspond to the pipeline entry point.
git clone --recurse-submodules https://github.com/friendsofstrandseq/mosaicatcher-pipeline.git
cd mosaicatcher-pipeline
It is important to follow these rules for Strand-Seq single-cell BAM data
- BAM file name ending by suffix:
.sort.mdup.bam
- One BAM file per cell
- Sorted and indexed
- If BAM files are not indexed, please use a writable folder in order that the pipeline generate itself the index
.bai
files
- If BAM files are not indexed, please use a writable folder in order that the pipeline generate itself the index
- Timestamp of index files must be newer than of the BAM files
- Each BAM file must contain a read group (
@RG
) with a common sample name (SM
), which must match the folder name (sampleName
above)
From Mosaicatcher version ≥ 1.6.1, a snakemake checkpoint rule was defined (filter_bad_cells) to automatically remove from the analysis cells flagged as low-quality.
The ground state to define this status (usable/not usable) is determined from column pass1 (enough coverage to process the cell) in mosaic count info file (see example below). The value of pass1 is not static and can change according the window size used (the larger the window, the higher the number of reads to retrieve in a given bin).
Cells flagged as low-quality cells are listed in the following TSV file: <OUTPUT>/cell_selection/<SAMPLE>/labels.tsv
.
In its current flavour, MosaiCatcher requires that input data must be formatted the following way :
Parent_folder
|-- Sample_1
| `-- bam
| |-- Cell_01.sort.mdup.bam
| |-- Cell_02.sort.mdup.bam
| |-- Cell_03.sort.mdup.bam
| `-- Cell_04.sort.mdup.bam
|
`-- Sample_2
`-- bam
|-- Cell_21.sort.mdup.bam
|-- Cell_22.sort.mdup.bam
|-- Cell_23.sort.mdup.bam
`-- Cell_24.sort.mdup.bam
In a Parent_Folder
, create a subdirectory Parent_Folder/sampleName/
for each sample
. Your Strand-seq BAM files of this sample go into the following folder:
bam
for the total set of BAM files
Using the classic behavior, cells flagged as low-quality will only be determined based on coverage see Note here.
Previous version of MosaiCatcher (version ≤ 1.5) needed not only a all
directory as described above, but also a selected
folder (now renamed bam
), presenting only high-quality selected libraries wished to be processed for the rest of the analysis.
You can still use this behavior by enabling the config parameter either by the command line: =True
or by modifying the corresponding entry in the config/config.yaml file.
Thus, in a Parent_Folder
, create a subdirectory Parent_Folder/sampleName/
for each sample
. Your Strand-seq BAM files of this sample go into the following folder:
all
for the total set of BAM filesselected
for the subset of BAM files that were identified as high-quality (either by copy or symlink)
Your <INPUT>
directory should look like this:
Parent_folder
|-- Sample_1
| |-- bam
| | |-- Cell_01.sort.mdup.bam
| | |-- Cell_02.sort.mdup.bam
| | |-- Cell_03.sort.mdup.bam
| | `-- Cell_04.sort.mdup.bam
| `-- selected
| |-- Cell_01.sort.mdup.bam
| `-- Cell_04.sort.mdup.bam
|
`-- Sample_2
|-- bam
| |-- Cell_01.sort.mdup.bam
| |-- Cell_02.sort.mdup.bam
| |-- Cell_03.sort.mdup.bam
| `-- Cell_04.sort.mdup.bam
`-- selected
|-- Cell_03.sort.mdup.bam
`-- Cell_04.sort.mdup.bam
Using the
input_bam_legacy
parameter, cells flagged as low-quality will be determined both based on their presence in theselected
folder presented above and on coverage see Note here.
Using the input_bam_legacy
parameter, only intersection between cells present in the selected folder and with enough coverage will be kept. Example: if a library is present in the selected folder but present a low coverage see Note here, this will not be processed.
From Mosaicatcher version ≥ 1.6.1, it is possible to use ashleys-qc-pipeline preprocessing module as part of MosaiCatcher. To do so, the user need to enable the config parameter ashleys_pipeline=True
and create a directory respecting the structure below (based on the model used for BAM inputs):
Parent_folder
|-- Sample_1
| `-- fastq
| |-- Cell_01.1.fastq.gz
| |-- Cell_01.2.fastq.gz
| |-- Cell_02.1.fastq.gz
| |-- Cell_02.2.fastq.gz
| |-- Cell_03.1.fastq.gz
| |-- Cell_03.2.fastq.gz
| |-- Cell_04.1.fastq.gz
| `-- Cell_04.2.fastq.gz
|
`-- Sample_2
`-- fastq
|-- Cell_21.1.fastq.gz
|-- Cell_21.2.fastq.gz
|-- Cell_22.1.fastq.gz
|-- Cell_22.2.fastq.gz
|-- Cell_23.1.fastq.gz
|-- Cell_23.2.fastq.gz
|-- Cell_24.1.fastq.gz
`-- Cell_24.2.fastq.gz
Thus, in a Parent_Folder
, create a subdirectory Parent_Folder/sampleName/
for each sample
. Each Strand-seq FASTQ files of this sample need to go into the fastq
folder and respect the following syntax: <CELL>.<1|2>.fastq.gz
, 1|2
corresponding to the pair identifier.
Informations and modes of execution can be found on the ashleys-qc-pipeline documentation.
ashleys_pipeline=True
andinput_bam_legacy=True
are mutually exclusive- We advice to limit to the number of 2 the "" characters in your file names (Ashleys-qc ML tool will react weidly with more than 3 "")
After defining your configuration, you can launch the pipeline the following way:
snakemake \
--cores <N> \
--config \
data_location=<INPUT_FOLDER> \
--profile workflow/snakemake_profiles/local/conda/
After defining your configuration, you can launch the pipeline the following way:
snakemake \
--cores <N> \
--config \
data_location=<INPUT_FOLDER> \
--profile workflow/snakemake_profiles/local/conda_singularity/ --singularity-args "-B /<mouting_point>:/<mounting_point>"
ℹ️ Note
It is possible to provide multiple mouting points between system and cointainer using as many -B
as needed in the singularity-args
command like the following: "-B /<mouting_point1>:/<mounting_point1> -B /<mouting_point2>:/<mounting_point2>"
For EMBL users, you don't need to specify this as this is already part of the execution profile (workflow/snakemake_profiles/HPC/slurm_EMBL)
ℹ️ Note
It is recommended to first run the command and check if there are any anomalies with the --dry-run
option
If you are experiencing any issues with conda-frontend snakemake option, please use --conda-frontend conda
instead of mamba
MosaiCatcher can be executed on HPC using Slurm by leveraging snakemake profile feature. Current Slurm profile [workflow/snakemake_profiles/HPC/slurm_EMBL/config.yaml
] was defined and tested on EMBL HPC cluster but can be modified, especially regarding partition setting.
Workflow HPC execution usually needs to deal with out of memory (OOM) errors, out of disk space, abnormal paths or missing parameters for the scheduler. To deal with OOM, we are currently using snakemake restart feature that allows to automatically double allocated memory to the job at each attempt (limited to 6 for the moment). Then, if a job fails to run with the default 1GB of memory allocated, it will be automatically restarted tith 2GB at the 2nd attempt, 4GB at the 3rd, etc.
To execute MosaiCatcher on HPC, use the following command.
snakemake \
--config \
data_location=<INPUT_FOLDER> \
--singularity-args "-B /<mounting_point>:/<mounting_point>" \
--profile workflow/snakemake_profiles/HPC/slurm_generic/
The logs
and errors
directory will be automatically created in the current directory, corresponding respectively to the output
and error
parameter of the sbatch
command.
Optionally, it's also possible to generate a static interactive HTML report to explore the different plots produced by MosaiCatcher, using the same command as before and just adding at the end the following:
--report <report>.zip --report-stylesheet workflow/report/custom-stylesheet.css
, which correspond to the following complete command:
snakemake \
--config \
data_location=<INPUT_FOLDER> \
--singularity-args "-B /<mounting_point>:/<mounting_point>" \
--profile workflow/snakemake_profiles/slurm_generic/ \
--report <report>.zip --report-stylesheet workflow/report/custom-stylesheet.css
ℹ️ Note
The zip file produced can be heavy (~1GB for 24 HGSVC samples ; 2000 cells) if multiple samples are processed in parallel in the same output folder.
From 1.9.0, it's now possible to run MosaiCatcher in order to genotype a given list of positions specified in a bed file. To do so, arbigent
config parameter need to be set to True
.
Thus, an alternative branch of the pipeline will be executed instead of the classic one, targetting results produced by ArbiGent.
snakemake \
--cores <N> \
--config \
data_location=<INPUT_DATA_FOLDER> \
arbigent=True \
--profile workflow/snakemake_profiles/local/conda_singularity/
A generic BED file is provided in workflow/data/arbigent/manual_segmentation.bed
. A custom BED file can be specified by modifying arbigent_bed_file
config parameter to the path of your choice.
ℹ️ Note
If you modify the chromosome list to be processed (remove chrX & chrY for instance), a dedicated rule will extract only from the BED file, the rows matching the list of chromosomes to be analysed.
From 1.9.2, it's now possible to run scNOVA directly from MosaiCatcher in order to determine the nucleosome occupancy associated to the SV calls provided during MosaiCatcher execution.
To do so, mosaicatcher must be executed during a first step in order to generate the SV calls and associated plots/tables required to determine subclonality.
Once the subclonality was determined, a table respecting the following template must be provided at this path <DATA_LOCATION>/<SAMPLE>/scNOVA_input_user/input_subclonality.txt
where the different clones are named clone<N>
:
Filename | Subclonality |
---|---|
TALL3x1_DEA5_PE20406 | clone2 |
TALL3x1_DEA5_PE20414 | clone2 |
TALL3x1_DEA5_PE20415 | clone1 |
TALL3x1_DEA5_PE20416 | clone1 |
TALL3x1_DEA5_PE20417 | clone1 |
TALL3x1_DEA5_PE20418 | clone1 |
TALL3x1_DEA5_PE20419 | clone1 |
TALL3x1_DEA5_PE20421 | clone1 |
TALL3x1_DEA5_PE20422 | clone1 |
Once done, you can run the exact same command as previously and set scNOVA
config parameter to True
like the following:
snakemake \
--cores <N> \
--config \
data_location=<INPUT_DATA_FOLDER> \
scNOVA=True \
--profile workflow/snakemake_profiles/local/conda_singularity/
ℹ️ Note
scNOVA related snakemake rules and scripts only support conda execution at the moment
From 2.1.0, it's now possible to use multistep normalisation (library size normalisation, GC correction, Variance Stabilising Transformation) resulting counts file, not only for visualisation purpose, but also to rely on it during the SV calling framework. To do so
snakemake \
--cores <N> \
--config \
data_location=<INPUT_DATA_FOLDER> \
multistep_normalisation=True \
multistep_normalisation_for_SV_calling=True \
--profile workflow/snakemake_profiles/local/conda_singularity/
From 2.2.2, scTRIP multiplot (from Marco Cosenza) is now compatible with MosaiCatcher. The single requirement is to clone scTRIP multiplot repository (please reach out Marco if you want to access the repository, currently private) inside workflow/scripts/plotting/scTRIP_multiplot
.
By default, scTRIP multiplot is set to false, to enable it, please update the config/config.yaml
file or use scTRIP_multiplot=True
in the config section of the command line. An example of a scTRIP multiplot is available here
ℹ️ Note
multistep_normalisation_for_SV_calling
is mutually exclusive with hgsvc_based_normalisation
If you already use a previous version of mosaicatcher-pipeline, here is a short procedure to update it:
- First, update all origin/ refs to latest:
git fetch --all
- Jump to a new version (for example 2.2.2) & pull code:
git checkout 2.2.2 && git pull
Then, to initiate or update git snakemake_profiles submodule:
git submodule update --init --recursive