Skip to content

Commit

Permalink
Merge pull request nf-core#50 from LouisLeNezet/nf-core-template-merg…
Browse files Browse the repository at this point in the history
…e-2.14.1

Nf core template merge 2.14.1
  • Loading branch information
LouisLeNezet authored May 13, 2024
2 parents e9337f4 + e8cb498 commit fd603b0
Show file tree
Hide file tree
Showing 307 changed files with 15,016 additions and 626 deletions.
6 changes: 6 additions & 0 deletions .editorconfig
Original file line number Diff line number Diff line change
Expand Up @@ -31,3 +31,9 @@ indent_size = unset
# ignore python and markdown
[*.{py,md}]
indent_style = unset

[/docs/*.xml]
indent_style = unset

[/docs/images/metro/*.xml]
indent_style = unset
8 changes: 7 additions & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
name: nf-core CI
# This workflow runs the pipeline with the minimal test dataset to check that it completes without any syntax errors

on:
push:
branches:
Expand All @@ -26,6 +27,11 @@ jobs:
NXF_VER:
- "23.04.0"
- "latest-everything"
TEST_PROFILE:
- "test"
- "test_sim"
- "test_quilt"
- "test_stitch"
steps:
- name: Check out pipeline code
uses: actions/checkout@0ad4b8fadaa221de15dcec353f45205ec38ea70b # v4
Expand All @@ -43,4 +49,4 @@ jobs:
# For example: adding multiple test runs with different parameters
# Remember that you can parallelise this by using strategy.matrix
run: |
nextflow run ${GITHUB_WORKSPACE} -profile test,docker --outdir ./results
nextflow run ${GITHUB_WORKSPACE} -profile "${{ matrix.TEST_PROFILE }}",docker --outdir ./results
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,5 @@ results/
testing/
testing*
*.pyc
*.code-workspace
.nf-test*
15 changes: 15 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,23 @@ Initial release of nf-core/phaseimpute, created with the [nf-core](https://nf-co

### `Added`

### `Changed`

- [#18](https://github.com/nf-core/phaseimpute/pull/18)
- Maps and region by chromosome
- update tests config files
- correct meta map propagation
- Test impute and test sim works
- [#19](https://github.com/nf-core/phaseimpute/pull/19) - Changed reference panel to accept a csv, update modules and subworkflows (glimpse1/2 and shapeit5)
- [#20](https://github.com/nf-core/phaseimpute/pull/20) - Added automatic detection of vcf contigs for the reference panel and automatic renaming available
- [#22](https://github.com/nf-core/phaseimpute/pull/20) - Add validation step for concordance analysis. Input channels changed to match inputs steps. Outdir folder organised by steps. Modules config by subworkflows.
- [#26](https://github.com/nf-core/phaseimpute/pull/26) - Added QUILT method

### `Fixed`

- [#15](https://github.com/nf-core/phaseimpute/pull/15) - Changed test csv files to point to nf-core repository
- [#16](https://github.com/nf-core/phaseimpute/pull/16) - Removed outdir from test config files

### `Dependencies`

### `Deprecated`
16 changes: 14 additions & 2 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,21 @@
## Pipeline tools

- [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
- [QUILT](https://pubmed.ncbi.nlm.nih.gov/34083788/)

> Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data [Online].
> Davies, R. W., Kucka, M., Su, D., Shi, S., Flanagan, M., Cunniff, C. M., ... & Myers, S. (2021). Rapid genotype imputation from sequence with reference panels. Nature genetics, 53(7), 1104-1111.
- [GLIMPSE](https://www.nature.com/articles/s41588-020-00756-0)

> Rubinacci, S., Ribeiro, D. M., Hofmeister, R. J., & Delaneau, O. (2021). Efficient phasing and imputation of low-coverage sequencing data using large reference panels. Nature Genetics, 53(1), 120-126.
- [Shapeit](https://odelaneau.github.io/shapeit5/)

> Hofmeister RJ, Ribeiro DM, Rubinacci S., Delaneau O. (2023). Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank. Nature Genetics doi: https://doi.org/10.1038/s41588-023-01415-w
- [bcftools](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3198575/)

> Li, H. (2011). A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics, 27(21), 2987-2993.
- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)

Expand Down
64 changes: 41 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,50 +19,43 @@

## Introduction

**nf-core/phaseimpute** is a bioinformatics pipeline that ...
**nf-core/phaseimpute** is a bioinformatics pipeline to phase and impute genetic data. Different steps are available each corresponding to a dedicated modes.

<!-- TODO nf-core:
Complete this sentence with a 2-3 sentence summary of what types of data the pipeline ingests, a brief overview of the
major pipeline sections and the types of output it produces. You're giving an overview to someone new
to nf-core here, in 15-20 seconds. For an example, see https://github.com/nf-core/rnaseq/blob/master/README.md#introduction
-->
### Main steps of the pipeline

<!-- TODO nf-core: Include a figure that guides the user through the major workflow steps. Many nf-core
workflows use the "tube map" design for that. See https://nf-co.re/docs/contributing/design_guidelines#examples for examples. -->
<!-- TODO nf-core: Fill in short bullet-pointed list of the default steps in the pipeline -->
The **phaseimpute** pipeline is constituted of 5 main steps:

1. Read QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
2. Present QC for raw reads ([`MultiQC`](http://multiqc.info/))
| Metro map | Modes |
| ---------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| <img src="docs/images/metro/MetroMap.png" alt="metromap" width="800"/> | - **Pre-processing**: Phasing, QC, variant filtering, variant annotation of the reference panel <br> - **Phase**: Phasing of the target dataset on the reference panel <br> - **Simulate**: Simulation of the target dataset from high quality target data <br> - **Concordance**: Concordance between the target dataset and a truth dataset <br> - **Post-processing**: Variant filtering based on their imputation quality |

## Usage

> [!NOTE]
> If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline) with `-profile test` before running the workflow on actual data.
<!-- TODO nf-core: Describe the minimum required steps to execute the pipeline, e.g. how to prepare samplesheets.
Explain what rows and columns represent. For instance (please edit as appropriate):
The basic usage of this pipeline is to impute a target dataset based on a phased panel.
First, prepare a samplesheet with your input data that looks as follows:

`samplesheet.csv`:

```csv
sample,fastq_1,fastq_2
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz
sample,bam,bai
1_BAM_1X,/path/to/.bam,/path/to/.bai
```

Each row represents a fastq file (single-end) or a pair of fastq files (paired end).
-->
Each row represents a bam file with its index file.

Now, you can run the pipeline using:

<!-- TODO nf-core: update the following command to include all required parameters for a minimal example -->

```bash
nextflow run nf-core/phaseimpute \
-profile <docker/singularity/.../institute> \
--input samplesheet.csv \
--genome "GRCh38" \
--panel <phased_reference_panel.vcf.gz> \
--steps "impute" \
--tools "glimpse1" \
--outdir <OUTDIR>
```

Expand All @@ -72,6 +65,19 @@ nextflow run nf-core/phaseimpute \
For more details and further functionality, please refer to the [usage documentation](https://nf-co.re/phaseimpute/usage) and the [parameter documentation](https://nf-co.re/phaseimpute/parameters).

## Description of the different mode of the pipeline

Here is a short description of the different mode of the pipeline.
For more information please refer to the [documentation](https://nf-core.github.io/phaseimpute/usage/).

| Mode | Flow chart | Description |
| ------------------ | ---------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Preprocessing** | <img src="docs/images/metro/PreProcessing.png" alt="phase_metro" width="600"/> | The preprocessing mode is responsible to the preparation of the multiple input file that will be used by the phasing process. <br> The main processes are : <br> - **Haplotypes phasing** of the reference panel using [**Shapeit5**](https://odelaneau.github.io/shapeit5/). <br> - **Filter** the reference panel to select only the necessary variants. <br> - **Chunking the reference panel** in a subset of region for all the chromosomes. <br> - **Extract** the positions where to perform the imputation. |
| **Phasing** | <img src="docs/images/metro/Phase.png" alt="phase_metro" width="600"/> | The phasing mode is the core mode of this pipeline. <br> It is constituted of 3 main steps: <br> - **Phasing**: Phasing of the target dataset on the reference panel using either: <br> &emsp; - [**Glimpse1**](https://odelaneau.github.io/GLIMPSE/glimpse1/index.html) <br> &emsp; It's come with the necessety to compute the genotype likelihoods of the target dataset. <br> &emsp; This step is done using [BCFTOOLS_mpileup](https://samtools.github.io/bcftools/bcftools.html#mpileup) <br> &emsp; - [**Glimpse2**](https://odelaneau.github.io/GLIMPSE/glimpse2/index.html) For this step the reference panel is transformed to binary chunks. <br> &emsp; - [**Stitch**](https://github.com/rwdavies/stitch) <br> &emsp; - [**Quilt**](https://github.com/rwdavies/QUILT) <br> - **Ligation**: all the different chunks are merged together. <br> - **Sampling** (optional) |
| **Simulate** | <img src="docs/images/metro/Simulate.png" alt="simulate_metro" width="600"/> | The simulation mode is used to create artificial low informative genetic information from high density data. This allow to compare the imputed result to a _truth_ and therefore evaluate the quality of the imputation. <br> For the moment it is possible to simulate: <br> - Low-pass data by **downsample** BAM or CRAM using [SAMTOOLS_view -s]() at different depth <br> - Genotype data by **SNP selecting** the position used by a designated SNP chip. <br> The simulation mode will also compute the **Genotype likelihoods** of the high density data. |
| **Concordance** | <img src="docs/images/metro/Concordance.png" alt="concordance_metro" width="600"/> | This mode compare two vcf together to compute a summary of the differences between them. <br> To do so it use either: <br> - [**Glimpse1**](https://odelaneau.github.io/GLIMPSE/glimpse1/index.html) concordance process. <br> - [**Glimpse2**](https://odelaneau.github.io/GLIMPSE/glimpse2/index.html) concordance process <br> - Or convert the two vcf fill to `.zarr` using [**Scikit allele**](https://scikit-allel.readthedocs.io/en/stable/) and [**anndata**](https://anndata.readthedocs.io/en/latest/) before comparing the SNPs. |
| **Postprocessing** | <img src="docs/images/metro/PostProcessing.png" alt="postprocessing_metro" width="600"/> | This final process unable to loop the whole pipeline for increasing the performance of the imputation. To do so it filter out the best imputed position and rerun the analysis using this positions. |

## Pipeline output

To see the results of an example test run with a full size dataset refer to the [results](https://nf-co.re/phaseimpute/results) tab on the nf-core website pipeline page.
Expand All @@ -80,16 +86,20 @@ For more details about the output files and reports, please refer to the

## Credits

nf-core/phaseimpute was originally written by LouisLeNezet.
nf-core/phaseimpute was originally written by Louis Le Nézet.

We thank the following people for their extensive assistance in the development of this pipeline:

<!-- TODO nf-core: If applicable, make list of people who have also contributed -->
- Anabella Trigila
- Saul Pierotti
- Eugenia Fontecha
- Matias Romero Victorica

## Contributions and Support

If you would like to contribute to this pipeline, please see the [contributing guidelines](.github/CONTRIBUTING.md).

For further information or help, don't hesitate to get in touch on the [Slack `#phaseimpute` channel](https://nfcore.slack.com/channels/phaseimpute) (you can join with [this invite](https://nf-co.re/join/slack)).
For further information or help, don't hesitate to get in touch on the [Slack `#phaseimpute` channel](https://nfcore.slack.com/channels/phaseimpute) (you can join with [this invite](https://nf-co.re/join/slack)).

## Citations
Expand All @@ -99,6 +109,14 @@ For further information or help, don't hesitate to get in touch on the [Slack `#

<!-- TODO nf-core: Add bibliography of tools and data used in your pipeline -->

You can cite one of the main imputation methods ([`QUILT`](https://github.com/rwdavies/QUILT)) as follows:

> **Rapid genotype imputation from sequence with reference panels.**
>
> Davies, R. W., Kucka, M., Su, D., Shi, S., Flanagan, M., Cunniff, C. M., Chan, Y. F., & Myers, S.
>
> _Nature genetics_ 2021 June 03. doi: [10.1038/s41588-021-00877-0](https://doi.org/10.1038/s41588-021-00877-0)
An extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](CITATIONS.md) file.

You can cite the `nf-core` publication as follows:
Expand Down
39 changes: 39 additions & 0 deletions assets/chr_rename_add.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
1 chr1
2 chr2
3 chr3
4 chr4
5 chr5
6 chr6
7 chr7
8 chr8
9 chr9
10 chr10
11 chr11
12 chr12
13 chr13
14 chr14
15 chr15
16 chr16
17 chr17
18 chr18
19 chr19
20 chr20
21 chr21
22 chr22
23 chr23
24 chr24
25 chr25
26 chr26
27 chr27
28 chr28
29 chr29
30 chr30
31 chr31
32 chr32
33 chr33
34 chr34
35 chr35
36 chr36
37 chr37
38 chr38
X chrX
39 changes: 39 additions & 0 deletions assets/chr_rename_del.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
chr1 1
chr2 2
chr3 3
chr4 4
chr5 5
chr6 6
chr7 7
chr8 8
chr9 9
chr10 10
chr11 11
chr12 12
chr13 13
chr14 14
chr15 15
chr16 16
chr17 17
chr18 18
chr19 19
chr20 20
chr21 21
chr22 22
chr23 23
chr24 24
chr25 25
chr26 26
chr27 27
chr28 28
chr29 29
chr30 30
chr31 31
chr32 32
chr33 33
chr34 34
chr35 35
chr36 36
chr37 37
chr38 38
chr39 X
3 changes: 3 additions & 0 deletions assets/panel.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
panel,chr,vcf,index
1000GP,chr21,1000GP_21.vcf,1000GP_21.vcf.csi
1000GP,chr22,1000GP_22.vcf,1000GP_22.vcf.csi
2 changes: 2 additions & 0 deletions assets/regionsheet.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
chr,start,end
20,20000000,2200000
6 changes: 3 additions & 3 deletions assets/samplesheet.csv
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
sample,fastq_1,fastq_2
SAMPLE_PAIRED_END,/path/to/fastq/files/AEG588A1_S1_L002_R1_001.fastq.gz,/path/to/fastq/files/AEG588A1_S1_L002_R2_001.fastq.gz
SAMPLE_SINGLE_END,/path/to/fastq/files/AEG588A4_S4_L003_R1_001.fastq.gz,
sample,bam,bai
1_BAM_1X,/path/to/.bam,/path/to/.bai
1_BAM_SNP,/path/to/.bam,/path/to/.bai
20 changes: 8 additions & 12 deletions assets/schema_input.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"$schema": "http://json-schema.org/draft-07/schema",
"$id": "https://raw.githubusercontent.com/nf-core/phaseimpute/master/assets/schema_input.json",
"title": "nf-core/phaseimpute pipeline - params.input schema",
"title": "nf-core/phaseimpute pipeline - params.input",
"description": "Schema for the file provided with params.input",
"type": "array",
"items": {
Expand All @@ -13,21 +13,17 @@
"errorMessage": "Sample name must be provided and cannot contain spaces",
"meta": ["id"]
},
"fastq_1": {
"file": {
"type": "string",
"format": "file-path",
"exists": true,
"pattern": "^\\S+\\.f(ast)?q\\.gz$",
"errorMessage": "FastQ file for reads 1 must be provided, cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'"
"pattern": "^\\S+\\.(bam)|((vcf|bcf)(\\.gz))?$",
"errorMessage": "BAM, VCF or BCF file must be provided, cannot contain spaces and must have extension '.bam' or '.vcf', '.bcf' with optional '.gz' extension"
},
"fastq_2": {
"index": {
"errorMessage": "Input file index must be provided, cannot contain spaces and must have extension '.bai', '.tbi' or '.csi'",
"type": "string",
"format": "file-path",
"exists": true,
"pattern": "^\\S+\\.f(ast)?q\\.gz$",
"errorMessage": "FastQ file for reads 2 cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'"
"pattern": "^\\S+\\.(bai|tbi|csi)$"
}
},
"required": ["sample", "fastq_1"]
"required": ["sample", "file", "index"]
}
}
Loading

0 comments on commit fd603b0

Please sign in to comment.