-
Notifications
You must be signed in to change notification settings - Fork 0
setup nfcore workflow
author | date | tags |
---|---|---|
KH | 2023-01-31 | nextflow, workflow, nf-core, installation, setup |
Table of contents
Nextflow is a workflow scripting language commonly used in bioinformatic applications.
Around nextflow, a large community of researchers develops standards for pipelines and provides a big catalogue of curated workflows called nextflow core (nf-core).
The nf-core community also provides a toolbox package nf-core
for creating and running workflows, available through pip and bioconda.
You were browsing through the nf-core workflow catalogue and found the desired workflow? Great! The next steps guide you through the process of running it on CUBI infrastructure.
To avoid annoying dependency issues the nextflow
and nf-core
packages should be installed and run via a conda environment.
A working yml
file for which this guide was tested looks like this:
name: nextflow-env
dependencies:
- Python=3.10.*
- mamba=1.5.6
- nextflow=23.10.1
- nf-core=2.11.1
It is recommended to download the workflow, also for offline usage.
For offline usage of Nextflow, also add export NXF_OFFLINE='true'
(all lower case of 'true' is important!) to your .bashrc
or scripts to avoid nextflow looking online for updates.
More information at https://nf-co.re/docs/usage/offline.
The download can be automated using the nf-core download
command, which is part of the nf-core tools package and prompts you through the necessary options for downloading.
You can also specify them through the CL (Command Line).
NOTE: Nextflow can store and pull previously downloaded singularity images from a local cache folder, which is recommended by , e.g., using export NXF_SINGULARITY_CACHEDIR=~/singularity_dir
(also see here)
Here is an example for the nf-core bamtofastq workflow:
nf-core download bamtofastq --revision 2.1.0 --compress none --container-system singularity
This will create the following three folders within nf-core-bamtofastq-2.1.0/
:
-
<workflow version>
: Workflow code from the github repository. The folder name always correspond to the version tag from the workflow github repository, in this example2_1_0
. -
singularity
: Containers for workflow processes, e.g., singularity containers. -
configs/
: basic configs and preconfigured institutional configs.
Moreover, some workflows need additional databases. Most nf-core workflows use the human reference genome that is provided for nf-core workflows through the AWS iGenomes collection. More information for usage with nf-core at nf-core reference genomes. You can get the download command for specific datasets from AWS-iGenomes. For example, downloading the human reference genome GRCh38 assembled for usage with GATK (>20 GB):
aws s3 --no-sign-request --region eu-west-1 sync s3://ngi-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/ db/references/Homo_sapiens/GATK/GRCh38/
Within your nextflow parameters, you need to specify the igenomes base directory and the reference genome using the parameters igenomes_base
(pointing to the iGenomes directory, e.g. db/references/
) and genome
(Name of reference downloaded, e.g. GATK.GRCh38
).
If igenomes_base
is not defined, it will automatically catch it from the internet source defined in conf/igenomes.config
.
Further, some workflows require workflow-specific references that need to be separately downloaded, e.g., the VEP cache in sarek.
For generally configuring nextflow workflows for your infrastructure, see the nextflow config tips.
The params file is required for configuring your nf-core workflow.
For this, you can use the nf-core launch
command.
For example, after you downloaded the bamtofastq workflow:
nf-core launch -x -a nf-core-bamtofastq-2.1.0/<workflow version>/
The -a
and -x
flags ensure, all parameters will be displayed and configurable.
You will be guided through a web or CL-based interface to configure all parameters.
In the end, it will create the nf-params.json
, which can be provided via the -params-file flag.
For choosing the right parameters and writing the samplesheet.csv
, you can read the documentation of the respective workflow, e.g., for bamtofastq at https://nf-co.re/bamtofastq.
Several configuration profiles are predefined for nf-core workflows that specialize on usage with container systems.
Important are profiles as singularity
, docker
, or conda
for usage of containers in the workflow.
For example, if you downloaded the workflow using singularity images, you need to enable the singularity profile with -profile singularity
.
Also, some pipelines include the test
profile for running automatic test samples (internet connection required) once you set up the workflow.
These test profiles usually require internet connection.
Nf-core workflows allow the definition of resource caps that prevent nextflow to submit jobs with higher resources as required.
These should exceptionally be configured in the run.config
file as shown here, as these are specific for the infrastructure.
The recommended use is within the infrastructure-specific profiles, e.g., in the hilbert
profile:
params {
// maximum HPC job resources for jobs
max_memory = 200.GB
max_cpus = 32
max_time = 72.h
}
More information at max-resources.
The resource requirements of nextflow processes are specified in process labels which are defined in the base.config
file within the workflow.
If a process exits because of lacking resources, Nextflow automatically retries the process with doubled resources until it reaches the specified max_memory
, max_cpus
or max_time
values. Hence, you can increase these parameters and restart the process.
To avoid long runtimes, e.g., due to several retries by nextflow or too low numbers of CPU cores for big datasets, you can also increase the resource requirements for specific processes in the process scope within a separate run.config
file.
Each process contains a label that specifies its resource requirements (see nextflow.config
file) that can be overwritten, e.g., like this for the medium label:
process {
withLabel:process_medium {
cpus = 32
}
}
More information at tuning-workflow-resources
For most processes in nf-core workflow and modules you can also define extra arguments or flags for the running command using the process name and the ext.args
argument. For example, we can define another flag for the VEP process in sarek:
process {
withName: 'ENSEMBLVEP' {
ext.args = '--everything'
}
}
More information about nf-core configurations can be found at https://nf-co.re/usage/configuration.
You downloaded all files, specified your parameters in nf-params.json
and added the infrastructure configuration as separate profile in run.config
?
Great! Now, you can run your workflow using the nextflow run
command, e.g., for the bamtofastq workflow on HILBERT:
nextflow run nf-core-bamtofastq-2.1.0/<workflow version> \
-profile singularity,hilbert \
-c run.config \
-params-file nf-params.json
NOTE: If the workflow exits unsuccessfully, e.g. due to a wrongly specified parameter, you can relaunch this command by adding the
-resume
flag.
For debugging, you can also have a look at the log file .nextflow.log
or into the pipeline_info/
folder within your results directory.
Copyright © 2022-2024 Core Unit Bioinformatics, Medical Faculty, HHU
All content in this Wiki is published under the CC BY-NC-SA 4.0 license.