Merge pull request nf-core#50 from LouisLeNezet/nf-core-template-merg…

…e-2.14.1 Nf core template merge 2.14.1
LouisLeNezet · May 13, 2024 · fd603b0 · fd603b0
2 parents e9337f4 + e8cb498
commit fd603b0
Show file tree

Hide file tree

Showing 307 changed files with 15,016 additions and 626 deletions.
diff --git a/.editorconfig b/.editorconfig
@@ -31,3 +31,9 @@ indent_size = unset
 # ignore python and markdown
 [*.{py,md}]
 indent_style = unset
+
+[/docs/*.xml]
+indent_style = unset
+
+[/docs/images/metro/*.xml]
+indent_style = unset
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -1,5 +1,6 @@
 name: nf-core CI
 # This workflow runs the pipeline with the minimal test dataset to check that it completes without any syntax errors
+
 on:
   push:
     branches:
@@ -26,6 +27,11 @@ jobs:
         NXF_VER:
           - "23.04.0"
           - "latest-everything"
+        TEST_PROFILE:
+          - "test"
+          - "test_sim"
+          - "test_quilt"
+          - "test_stitch"
     steps:
       - name: Check out pipeline code
         uses: actions/checkout@0ad4b8fadaa221de15dcec353f45205ec38ea70b # v4
@@ -43,4 +49,4 @@ jobs:
         # For example: adding multiple test runs with different parameters
         # Remember that you can parallelise this by using strategy.matrix
         run: |
-          nextflow run ${GITHUB_WORKSPACE} -profile test,docker --outdir ./results
+          nextflow run ${GITHUB_WORKSPACE} -profile "${{ matrix.TEST_PROFILE }}",docker --outdir ./results
diff --git a/.gitignore b/.gitignore
@@ -6,3 +6,5 @@ results/
 testing/
 testing*
 *.pyc
+*.code-workspace
+.nf-test*
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -9,8 +9,23 @@ Initial release of nf-core/phaseimpute, created with the [nf-core](https://nf-co
 
 ### `Added`
 
+### `Changed`
+
+- [#18](https://github.com/nf-core/phaseimpute/pull/18)
+  - Maps and region by chromosome
+  - update tests config files
+  - correct meta map propagation
+  - Test impute and test sim works
+- [#19](https://github.com/nf-core/phaseimpute/pull/19) - Changed reference panel to accept a csv, update modules and subworkflows (glimpse1/2 and shapeit5)
+- [#20](https://github.com/nf-core/phaseimpute/pull/20) - Added automatic detection of vcf contigs for the reference panel and automatic renaming available
+- [#22](https://github.com/nf-core/phaseimpute/pull/20) - Add validation step for concordance analysis. Input channels changed to match inputs steps. Outdir folder organised by steps. Modules config by subworkflows.
+- [#26](https://github.com/nf-core/phaseimpute/pull/26) - Added QUILT method
+
 ### `Fixed`
 
+- [#15](https://github.com/nf-core/phaseimpute/pull/15) - Changed test csv files to point to nf-core repository
+- [#16](https://github.com/nf-core/phaseimpute/pull/16) - Removed outdir from test config files
+
 ### `Dependencies`
 
 ### `Deprecated`
diff --git a/CITATIONS.md b/CITATIONS.md
@@ -10,9 +10,21 @@
 
 ## Pipeline tools
 
-- [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
+- [QUILT](https://pubmed.ncbi.nlm.nih.gov/34083788/)
 
-  > Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data [Online].
+  > Davies, R. W., Kucka, M., Su, D., Shi, S., Flanagan, M., Cunniff, C. M., ... & Myers, S. (2021). Rapid genotype imputation from sequence with reference panels. Nature genetics, 53(7), 1104-1111.
+
+- [GLIMPSE](https://www.nature.com/articles/s41588-020-00756-0)
+
+  > Rubinacci, S., Ribeiro, D. M., Hofmeister, R. J., & Delaneau, O. (2021). Efficient phasing and imputation of low-coverage sequencing data using large reference panels. Nature Genetics, 53(1), 120-126.
+
+- [Shapeit](https://odelaneau.github.io/shapeit5/)
+
+  > Hofmeister RJ, Ribeiro DM, Rubinacci S., Delaneau O. (2023). Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank. Nature Genetics doi: https://doi.org/10.1038/s41588-023-01415-w
+
+- [bcftools](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3198575/)
+
+  > Li, H. (2011). A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics, 27(21), 2987-2993.
 
 - [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)
 

diff --git a/README.md b/README.md
@@ -19,50 +19,43 @@
 
 ## Introduction
 
-**nf-core/phaseimpute** is a bioinformatics pipeline that ...
+**nf-core/phaseimpute** is a bioinformatics pipeline to phase and impute genetic data. Different steps are available each corresponding to a dedicated modes.
 
-<!-- TODO nf-core:
-   Complete this sentence with a 2-3 sentence summary of what types of data the pipeline ingests, a brief overview of the
-   major pipeline sections and the types of output it produces. You're giving an overview to someone new
-   to nf-core here, in 15-20 seconds. For an example, see https://github.com/nf-core/rnaseq/blob/master/README.md#introduction
--->
+### Main steps of the pipeline
 
-<!-- TODO nf-core: Include a figure that guides the user through the major workflow steps. Many nf-core
-     workflows use the "tube map" design for that. See https://nf-co.re/docs/contributing/design_guidelines#examples for examples.   -->
-<!-- TODO nf-core: Fill in short bullet-pointed list of the default steps in the pipeline -->
+The **phaseimpute** pipeline is constituted of 5 main steps:
 
-1. Read QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
-2. Present QC for raw reads ([`MultiQC`](http://multiqc.info/))
+| Metro map                                                              | Modes                                                                                                                                                                                                                                                                                                                                                                                                                         |
+| ---------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| <img src="docs/images/metro/MetroMap.png" alt="metromap" width="800"/> | - **Pre-processing**: Phasing, QC, variant filtering, variant annotation of the reference panel <br> - **Phase**: Phasing of the target dataset on the reference panel <br> - **Simulate**: Simulation of the target dataset from high quality target data <br> - **Concordance**: Concordance between the target dataset and a truth dataset <br> - **Post-processing**: Variant filtering based on their imputation quality |
 
 ## Usage
 
 > [!NOTE]
 > If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline) with `-profile test` before running the workflow on actual data.
 
-<!-- TODO nf-core: Describe the minimum required steps to execute the pipeline, e.g. how to prepare samplesheets.
-     Explain what rows and columns represent. For instance (please edit as appropriate):
-
+The basic usage of this pipeline is to impute a target dataset based on a phased panel.
 First, prepare a samplesheet with your input data that looks as follows:
 
 `samplesheet.csv`:
 
 ```csv
-sample,fastq_1,fastq_2
-CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz
+sample,bam,bai
+1_BAM_1X,/path/to/.bam,/path/to/.bai
 ```
 
-Each row represents a fastq file (single-end) or a pair of fastq files (paired end).
-
--->
+Each row represents a bam file with its index file.
 
 Now, you can run the pipeline using:
 
-<!-- TODO nf-core: update the following command to include all required parameters for a minimal example -->
-
 ```bash
 nextflow run nf-core/phaseimpute \
    -profile <docker/singularity/.../institute> \
    --input samplesheet.csv \
+   --genome "GRCh38" \
+   --panel <phased_reference_panel.vcf.gz> \
+   --steps "impute" \
+   --tools "glimpse1" \
    --outdir <OUTDIR>
 ```
 
@@ -72,6 +65,19 @@ nextflow run nf-core/phaseimpute \
 
 For more details and further functionality, please refer to the [usage documentation](https://nf-co.re/phaseimpute/usage) and the [parameter documentation](https://nf-co.re/phaseimpute/parameters).
 
+## Description of the different mode of the pipeline
+
+Here is a short description of the different mode of the pipeline.
+For more information please refer to the [documentation](https://nf-core.github.io/phaseimpute/usage/).
+
+| Mode               | Flow chart                                                                               | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
+| ------------------ | ---------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| **Preprocessing**  | <img src="docs/images/metro/PreProcessing.png" alt="phase_metro" width="600"/>           | The preprocessing mode is responsible to the preparation of the multiple input file that will be used by the phasing process. <br> The main processes are : <br> - **Haplotypes phasing** of the reference panel using [**Shapeit5**](https://odelaneau.github.io/shapeit5/). <br> - **Filter** the reference panel to select only the necessary variants. <br> - **Chunking the reference panel** in a subset of region for all the chromosomes. <br> - **Extract** the positions where to perform the imputation.                                                                                                                                                                                                                                                                                                                                                                   |
+| **Phasing**        | <img src="docs/images/metro/Phase.png" alt="phase_metro" width="600"/>                   | The phasing mode is the core mode of this pipeline. <br> It is constituted of 3 main steps: <br> - **Phasing**: Phasing of the target dataset on the reference panel using either: <br> &emsp; - [**Glimpse1**](https://odelaneau.github.io/GLIMPSE/glimpse1/index.html) <br> &emsp; It's come with the necessety to compute the genotype likelihoods of the target dataset. <br> &emsp; This step is done using [BCFTOOLS_mpileup](https://samtools.github.io/bcftools/bcftools.html#mpileup) <br> &emsp; - [**Glimpse2**](https://odelaneau.github.io/GLIMPSE/glimpse2/index.html) For this step the reference panel is transformed to binary chunks. <br> &emsp; - [**Stitch**](https://github.com/rwdavies/stitch) <br> &emsp; - [**Quilt**](https://github.com/rwdavies/QUILT) <br> - **Ligation**: all the different chunks are merged together. <br> - **Sampling** (optional) |
+| **Simulate**       | <img src="docs/images/metro/Simulate.png" alt="simulate_metro" width="600"/>             | The simulation mode is used to create artificial low informative genetic information from high density data. This allow to compare the imputed result to a _truth_ and therefore evaluate the quality of the imputation. <br> For the moment it is possible to simulate: <br> - Low-pass data by **downsample** BAM or CRAM using [SAMTOOLS_view -s]() at different depth <br> - Genotype data by **SNP selecting** the position used by a designated SNP chip. <br> The simulation mode will also compute the **Genotype likelihoods** of the high density data.                                                                                                                                                                                                                                                                                                                     |
+| **Concordance**    | <img src="docs/images/metro/Concordance.png" alt="concordance_metro" width="600"/>       | This mode compare two vcf together to compute a summary of the differences between them. <br> To do so it use either: <br> - [**Glimpse1**](https://odelaneau.github.io/GLIMPSE/glimpse1/index.html) concordance process. <br> - [**Glimpse2**](https://odelaneau.github.io/GLIMPSE/glimpse2/index.html) concordance process <br> - Or convert the two vcf fill to `.zarr` using [**Scikit allele**](https://scikit-allel.readthedocs.io/en/stable/) and [**anndata**](https://anndata.readthedocs.io/en/latest/) before comparing the SNPs.                                                                                                                                                                                                                                                                                                                                          |
+| **Postprocessing** | <img src="docs/images/metro/PostProcessing.png" alt="postprocessing_metro" width="600"/> | This final process unable to loop the whole pipeline for increasing the performance of the imputation. To do so it filter out the best imputed position and rerun the analysis using this positions.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
+
 ## Pipeline output
 
 To see the results of an example test run with a full size dataset refer to the [results](https://nf-co.re/phaseimpute/results) tab on the nf-core website pipeline page.
@@ -80,16 +86,20 @@ For more details about the output files and reports, please refer to the
 
 ## Credits
 
-nf-core/phaseimpute was originally written by LouisLeNezet.
+nf-core/phaseimpute was originally written by Louis Le Nézet.
 
 We thank the following people for their extensive assistance in the development of this pipeline:
 
-<!-- TODO nf-core: If applicable, make list of people who have also contributed -->
+- Anabella Trigila
+- Saul Pierotti
+- Eugenia Fontecha
+- Matias Romero Victorica
 
 ## Contributions and Support
 
 If you would like to contribute to this pipeline, please see the [contributing guidelines](.github/CONTRIBUTING.md).
 
+For further information or help, don't hesitate to get in touch on the [Slack `#phaseimpute` channel](https://nfcore.slack.com/channels/phaseimpute) (you can join with [this invite](https://nf-co.re/join/slack)).
 For further information or help, don't hesitate to get in touch on the [Slack `#phaseimpute` channel](https://nfcore.slack.com/channels/phaseimpute) (you can join with [this invite](https://nf-co.re/join/slack)).
 
 ## Citations
@@ -99,6 +109,14 @@ For further information or help, don't hesitate to get in touch on the [Slack `#
 
 <!-- TODO nf-core: Add bibliography of tools and data used in your pipeline -->
 
+You can cite one of the main imputation methods ([`QUILT`](https://github.com/rwdavies/QUILT)) as follows:
+
+> **Rapid genotype imputation from sequence with reference panels.**
+>
+> Davies, R. W., Kucka, M., Su, D., Shi, S., Flanagan, M., Cunniff, C. M., Chan, Y. F., & Myers, S.
+>
+> _Nature genetics_ 2021 June 03. doi: [10.1038/s41588-021-00877-0](https://doi.org/10.1038/s41588-021-00877-0)
+
 An extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](CITATIONS.md) file.
 
 You can cite the `nf-core` publication as follows:

diff --git a/assets/chr_rename_add.txt b/assets/chr_rename_add.txt
@@ -0,0 +1,39 @@
+1 chr1
+2 chr2
+3 chr3
+4 chr4
+5 chr5
+6 chr6
+7 chr7
+8 chr8
+9 chr9
+10 chr10
+11 chr11
+12 chr12
+13 chr13
+14 chr14
+15 chr15
+16 chr16
+17 chr17
+18 chr18
+19 chr19
+20 chr20
+21 chr21
+22 chr22
+23 chr23
+24 chr24
+25 chr25
+26 chr26
+27 chr27
+28 chr28
+29 chr29
+30 chr30
+31 chr31
+32 chr32
+33 chr33
+34 chr34
+35 chr35
+36 chr36
+37 chr37
+38 chr38
+X chrX
diff --git a/assets/chr_rename_del.txt b/assets/chr_rename_del.txt
@@ -0,0 +1,39 @@
+chr1 1
+chr2 2
+chr3 3
+chr4 4
+chr5 5
+chr6 6
+chr7 7
+chr8 8
+chr9 9
+chr10 10
+chr11 11
+chr12 12
+chr13 13
+chr14 14
+chr15 15
+chr16 16
+chr17 17
+chr18 18
+chr19 19
+chr20 20
+chr21 21
+chr22 22
+chr23 23
+chr24 24
+chr25 25
+chr26 26
+chr27 27
+chr28 28
+chr29 29
+chr30 30
+chr31 31
+chr32 32
+chr33 33
+chr34 34
+chr35 35
+chr36 36
+chr37 37
+chr38 38
+chr39 X
diff --git a/assets/panel.csv b/assets/panel.csv
@@ -0,0 +1,3 @@
+panel,chr,vcf,index
+1000GP,chr21,1000GP_21.vcf,1000GP_21.vcf.csi
+1000GP,chr22,1000GP_22.vcf,1000GP_22.vcf.csi
diff --git a/assets/regionsheet.csv b/assets/regionsheet.csv
@@ -0,0 +1,2 @@
+chr,start,end
+20,20000000,2200000
diff --git a/assets/samplesheet.csv b/assets/samplesheet.csv
@@ -1,3 +1,3 @@
-sample,fastq_1,fastq_2
-SAMPLE_PAIRED_END,/path/to/fastq/files/AEG588A1_S1_L002_R1_001.fastq.gz,/path/to/fastq/files/AEG588A1_S1_L002_R2_001.fastq.gz
-SAMPLE_SINGLE_END,/path/to/fastq/files/AEG588A4_S4_L003_R1_001.fastq.gz,
+sample,bam,bai
+1_BAM_1X,/path/to/.bam,/path/to/.bai
+1_BAM_SNP,/path/to/.bam,/path/to/.bai
diff --git a/assets/schema_input.json b/assets/schema_input.json
@@ -1,7 +1,7 @@
 {
     "$schema": "http://json-schema.org/draft-07/schema",
     "$id": "https://raw.githubusercontent.com/nf-core/phaseimpute/master/assets/schema_input.json",
-    "title": "nf-core/phaseimpute pipeline - params.input schema",
+    "title": "nf-core/phaseimpute pipeline - params.input",
     "description": "Schema for the file provided with params.input",
     "type": "array",
     "items": {
@@ -13,21 +13,17 @@
                 "errorMessage": "Sample name must be provided and cannot contain spaces",
                 "meta": ["id"]
             },
-            "fastq_1": {
+            "file": {
                 "type": "string",
-                "format": "file-path",
-                "exists": true,
-                "pattern": "^\\S+\\.f(ast)?q\\.gz$",
-                "errorMessage": "FastQ file for reads 1 must be provided, cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'"
+                "pattern": "^\\S+\\.(bam)|((vcf|bcf)(\\.gz))?$",
+                "errorMessage": "BAM, VCF or BCF file must be provided, cannot contain spaces and must have extension '.bam' or '.vcf', '.bcf' with optional '.gz' extension"
             },
-            "fastq_2": {
+            "index": {
+                "errorMessage": "Input file index must be provided, cannot contain spaces and must have extension '.bai', '.tbi' or '.csi'",
                 "type": "string",
-                "format": "file-path",
-                "exists": true,
-                "pattern": "^\\S+\\.f(ast)?q\\.gz$",
-                "errorMessage": "FastQ file for reads 2 cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'"
+                "pattern": "^\\S+\\.(bai|tbi|csi)$"
             }
         },
-        "required": ["sample", "fastq_1"]
+        "required": ["sample", "file", "index"]
     }
 }