-
Notifications
You must be signed in to change notification settings - Fork 22
Walkthrough using tracking database
This page documents how to use Clockwork, plus MySQL and nextflow, to track all your samples and run the pipelines. It takes a little initial setting up, but then all the pipelines are simple to run.
We will download two samples from the ENA, and put them through all of the pipelines.
It is important to bear in mind that Clockwork was developed for the CRyPTIC Mycobacterium tuberculosis project, which stored particular information about each sample. On your own data you will need to use some dummy values for some of the metadata (and then just ignore those fields!).
You will need MySQL, nextflow, and (unless you run clockwork using another method), Singularity installed to follow this walkthrough.
This page uses the singularity container,
which we assume is called clockwork_container.img
throughout this document.
In addition to the singularity container, you will need the
nextflow scripts from the clockwork repository. You can either
download the code from the
release page,
or git clone
the repository:
git clone https://github.com/iqbal-lab-org/clockwork.git
This is important: use the nextflow scripts on your host computer,
not nextflow scripts inside the clockwork container.
The nextflow scripts are in the directory nextflow/
in the
clockwork repository. In this walkthrough we will assume that
the clockwork repository is in your current working directory,
so that the nextflow scripts are in clockwork/nextflow/
.
It is beyond the scope of this documentation to provide help with how to use nextflow. But note that:
- Nextflow writes all intermediate files (there can be a large number!)
in a "work directory", which is specified
using the option
-w /foo/bar
and is created by nextflow if it does not exist. The default iswork
in your current working directory. -
nextflow run
options have a single dash (eg-w my_work_dir
). - Options to the actual pipeline have a double dash
(eg
--db_config_file db.ini
). - Nextflow writes log and cache files in your current working directory,
called
.nextflow*
. - After a successful run of nextflow, it is a good idea to delete
the
.nextflow*
files and the work directory.
The nextflow pipelines are written in the older nextflow language DSL1 and new versions of nextflow by default assume the newer language DLS2. However, even with newer versions of nextflow, you can tell it to run an older version in one of two ways:
-
export NXF_VER=22.10.0
before running nextflow - prefix all the nextflow commands like this:
NXF_VER=22.10.0 nextflow run ...
This will mean the nextflow will not crash with an error about DSL1 being unsopprted.
Each clockwork pipeline uses Nextflow to run each stage of its workflow, and a MySQL database to store information about each sample, and which samples have had which pipelines run on them. Running one pipeline means one run of a Nextflow script, which will query the database to find samples that have not already had that pipeline run on them, it will run the pipeline on those samples, then update the database accordingly.
All clockwork files are stored in a "pipeline root" directory. Clockwork chooses the location of the files within this directory. Those files inside the pipeline root directory should not be moved, renamed, or deleted by the user. Scripts are available to report the location of files inside the pipeline root directory for each sample. The pipeline root directory itself can be moved/renamed because it is a parameter when running pipelines.
The pipelines are:
- Import: imports new data into the tracking database, and FASTQ files on disk.
- Remove contamination: removes contaminated reads.
- QC: gathers QC information on the reads using FastQC and BWA MEM/Samtools.
- Variant calling using BWA MEM/Samtools and cortex. The calls from these two tools are combined and validated to produce one call set using minos. The output is a VCF file from minos. The unfiltered VCF files from BWA+samtools and Cortex are also kept.
- Mykrobe predict: runs mykrobe predict, outputting the JSON file made by Mykrobe.
IMPORTANT: after importing data, the remove contamination pipeline should be run to produce decontaminated reads. All other pipelines use the decontaminated reads as input. If you try to run, for example, variant calling directly after importing data, then no samples will run because Clockwork will not find any decontaminated reads.
You need to make a config file with the login and database name. Here is an example:
[db_login]
user = user_name
password = password
host = host_name
db = name_of_database
There can optionally be a line for the port: port = N
. If this is not
present then the default port is used.
Assuming your config file is called db.ini
, run this to set up the tracking
database:
singularity exec clockwork_container.img clockwork make_empty_db db.ini
This will create a new database named using the entry db
from the
file db.ini
(in the above example name_of_database
).
The pipelines need to find files on disk when using a database. Make the following directories (you can call them what you like, but we will use these names throughout this document):
mkdir Pipeline_root # this is where reads and analysis files are kept long-term
mkdir Pipeline_refs # where reference genomes are stored
mkdir Pipeline_spreadsheet_archive # import spreadsheets archived in here
The pipelines require the following reference sequences:
- Optional but recommended at least for TB - Genomes to use for removing contamination from reads.
- At least one reference genome to use for QC and variant calling.
There is a script provided in the container that can download reference files for TB. Or you can provide your own files, by following the instructions on preparing remove contamination reference data. The following instructions are for downloading and setting up reference files for TB.
Download the reference genomes for contamination removal and variant calling:
singularity exec clockwork_container.img download_tb_reference_files.pl ref_data
That will download the data into a new directory called ref_data
, with
the files in the form needed for the reference_prepare
command described
next. Note that this could take up to one hour to run, depending on your internet
connection speed. It downloads a large number of genomes (including human),
and then processes them. If the script crashes, then rerun the same command
and it will continue, skipping any genomes that have already been downloaded.
To set up the contamination reference, using the files that were
just made in ref_data/
:
singularity exec clockwork_container.img clockwork reference_prepare \
--db_config_file db.ini --pipeline_references_root Pipeline_refs \
--contam_tsv ref_data/remove_contam.tsv \
--name remove_contam \
ref_data/remove_contam.fa.gz
To setup version 3 of the H37Rv reference genome so that clockwork can use it for the QC and variant calling pipelines:
singularity exec clockwork_container.img clockwork reference_prepare \
--db_config_file db.ini --pipeline_references_root Pipeline_refs \
--name NC_000962.3 \
ref_data/NC_000962.3.fa
The ref_data
directory will also contain versions 1 and 2 of H37Rv, should
you wish to use those instead by changing the 3 to a 1 or 2 in the last
two lines of that command. Or you can use your genome of choice by changing
the name and providing your own FASTA file. If you ran the commands above,
in MySQL you should see this (replace PASSWORD
, USER
, HOST
, DATABASE_NAME
with the values from your config file db.ini
):
mysql --password=PASSWORD -u USER -h HOST -P PORT -D DATABASE_NAME -e 'select * from Reference'
+--------------+----------------------+
| reference_id | name |
+--------------+----------------------+
| 1 | remove_contam |
| 2 | NC_000962.3 |
+--------------+----------------------+
You need to use the reference_id
to refer to the references
when running the pipelines, not the name. The names are for your
convenience to remember what each reference is.
If you used your own data, then the names will be whatever you
chose when running the
commands.
For this walkthrough, we get the reads from two samples from the ENA. We will use a built-in script that can download the reads and make the spreadsheet that the import pipeline needs to import the data. On your own data, you will instead need to make your own directory of reads and spreadsheet to import the data.
The script needs a file that has the ENA sample (column 1) and runs for that sample (column2). We will use two samples, one of which has two sequencing runs. The file looks like this:
SAMEA2534128 ERR551697,ERR551698
SAMEA2534421 ERR552113
where there are TAB characters separating the columns, which
we call samples.tsv
.
Then run this to get the data:
singularity exec clockwork_container.img clockwork ena_download samples.tsv To_import site1 lab1 20220614 dataset1
The script makes a new directory called To_import
, with these
contents:
ERR551697_1.fastq.gz
ERR551697_1.fastq.gz.md5
ERR551697_2.fastq.gz
ERR551697_2.fastq.gz.md5
ERR551698_1.fastq.gz
ERR551698_1.fastq.gz.md5
ERR551698_2.fastq.gz
ERR551698_2.fastq.gz.md5
ERR552113_1.fastq.gz
ERR552113_1.fastq.gz.md5
ERR552113_2.fastq.gz
ERR552113_2.fastq.gz.md5
import.tsv
This directory can then be used as input to the import pipeline, as in the next section.
If you want an example TSV file that the import pipeline uses,
then have a look inside import.tsv
.
It uses some dummy values where the CRyPTIC project would have used
real data. CRyPTIC had a "site", "lab id", sample date, and "dataset name"
for each sample. We've used dummy values instead, and can ignore them
from now on.
This imports FASTQ files,
taking metadata from a spreadsheet (in xlsx or TSV format).
Please see
this page describing the import spreadsheet format.
We will use the directory created above, called To_import
as input
to the pipeline.
Run:
nextflow run clockwork/nextflow/import.nf \
-with-singularity clockwork_container.img \
--dropbox_dir dir_with_spreadheet_and_fastq_files \
--pipeline_root Pipeline_root \
--xlsx_archive_dir Pipeline_spreadsheet_archive \
--db_config_file db.ini
It will find all spreadsheets called *.xlsx
or *.tsv
in
the directory specified by --dropbox_dir
, and import their
data. IMPORTANT: all imported FASTQ files are moved from the
input directory into clockwork's pipeline root directory structure.
Clockwork tracks all samples (their metadata and which pipelines
have been run), and arranges files in the Pipeline_root/
directory. There is a script to find data, outputting your
original names from the import spreadsheet, and locations
on disk. We can run this now to check that there are
two samples and three sequencing runs imported:
singularity exec clockwork_container.img clockwork find_data db.ini Pipeline_root/ > find_data.tsv
The output, in find_data.tsv
, is a TSV file with a lot
of columns (too wide to easily read here!). Here are four interesting
columns:
cut -f 2,5,8,20 find_data.tsv | column -t
subject_id sequence_replicate_number import_status isolate_directory
SAMEA2534128 1 1 Pipeline_root/00/00/00/01/1
SAMEA2534128 2 1 Pipeline_root/00/00/00/01/1
SAMEA2534421 1 1 Pipeline_root/00/00/00/02/2
We have the two samples imported (import_status
of 1
means
successfully imported). Sample SAMEA2534128
has
two sequence replicates (sequence_replicate_number
numbers 1
and 2
).
The location on disk for sample is in the isolate_directory
column.
This decontaminates reads. All later pipelines (QC, Variant call, mykrobe) use decontaminated reads, so this pipeline must be run for the later pipelines to run.
The pipeline will find all newly imported samples (ie those that have not been decontaminated), and run the decontamination pipeline on each of them.
Run this command:
nextflow run clockwork/nextflow/remove_contam.nf \
-with-singularity clockwork_container.img \
--ref_id 1 \
--references_root Pipeline_refs \
--pipeline_root Pipeline_root \
--db_config_file db.ini
If you are happy that your reads do not need decontaminating, then you can avoid decontamination by running the "fake" decontamination pipeline, which simply updates the database to say that decontamination has been run. Then the "decontaminated" reads used for the next pipelines will just be the original reads.
nextflow run clockwork/nextflow/fake_remove_contam.nf \
-with-singularity clockwork_container.img \
--pipeline_root Pipeline_root \
--db_config_file db.ini
This is optional. It runs FASTQC and also gathers stats/plots from mapping the reads and running samtools stats/plot-bamstats. It was really intended for troubleshooting, and as such the results are put into the database. This means some MySQL knowledge to make use of the results. Documentation may be added in the future, but for now you may want to skip this pipeline.
Run the QC pipeline with the command below. It will find all samples that have been decontaminated and have not yet had QC run on them.
nextflow run clockwork/nextflow/qc.nf \
-with-singularity clockwork_container.img \
--ref_id 2 \
--references_root Pipeline_refs \
--pipeline_root Pipeline_root \
--db_config_file db.ini
This will run the variant calling pipeline on all samples that have been decontaminated and not yet been variant called.
Run this command:
nextflow run clockwork/nextflow/variant_call.nf \
-with-singularity clockwork_container.img \
--ref_id 2 \
--references_root Pipeline_refs \
--pipeline_root Pipeline_root \
--db_config_file db.ini
At this point, having run the variant calling pipeline,
you probably want a VCF file for each sample.
Recall that clockwork arranges all files in its own
directory structure inside Pipeline_root/
.
We can use the clockwork command find_data
to get all the
VCF file paths:
singularity exec clockwork_container.img clockwork find_data \
--pipeline variant_call \
db.ini Pipeline_root/ > var_calls.tsv
The column pipeline_directory
has the directory containing
output files for the variant calling pipeline. The path
will look like this:
/path/to/Pipeline_root/00/00/00/01/1/Pipelines/1_2/variant_call/0.11.1.ref.2
The main files of interest in there are:
- The final call set in
minos/final.vcf
, ie to get the final call set for a sample, take thepipeline_directory
and addminos/final.vcf
to it. - A BAM file of the mapped trimmed reads, with duplicates removed,
called
samtools/rmdup.bam
.
First, you need to register a mykrobe panel with clockwork,
similarly to setting up the reference genomes for
remove contamination and variant calling.
To register the panel 201901
, which at the time of writing is
the latest built-in panel for mykrobe, run this:
singularity exec clockwork_container.img clockwork db_add_mykrobe_panel \
db.ini tb 201901 Pipeline_refs
That updated the database, telling clockwork that we will be using the
201901
panel. In MySQL you should see this:
select * from Reference;
+--------------+---------------+
| reference_id | name |
+--------------+---------------+
| 1 | remove_contam |
| 2 | H37Rv |
| 3 | 201901 |
+--------------+---------------+
The newly-added mykrobe panel has reference_id
3
.
Now run the mykrobe pipeline using the following command.
nextflow run clockwork/nextflow/mykrobe_predict.nf \
-with-singularity clockwork_container.img \
--ref_id 3 \
--references_root Pipeline_refs \
--pipeline_root Pipeline_root \
--db_config_file db.ini
Pipelines can crash in unpredictable ways.
If the import pipeline crashes, then it is impossible to predict what state the database and files on disk will be. We may try to document this in future, but it is likely that there are too many possibilities to cover.
If the remove contamination, QC, variant call, or mykrobe pipelines crash, then it is relatively easy to tidy things up and rerun. There is a Pipeline table in the database, which tracks which pipelines have been run, or are currently running, on which samples. If a pipeline crashes, then deleting the rows of the table that say a sample is currently running means that when you rerun, those samples will be run again (otherwise, clockwork will thinkg they are still currently running and ignore them).
If you followed this walkthrough, and everything worked, then you should see this:
select * from Pipeline;
+------------+-----------+-------------+---------+-----------------+--------+--------------+
| isolate_id | seqrep_id | seqrep_pool | version | pipeline_name | status | reference_id |
+------------+-----------+-------------+---------+-----------------+--------+--------------+
| 1 | 1 | NULL | 0.11.1 | remove_contam | 1 | 1 |
| 1 | 2 | NULL | 0.11.1 | remove_contam | 1 | 1 |
| 2 | 3 | NULL | 0.11.1 | remove_contam | 1 | 1 |
| 1 | 1 | NULL | 0.11.1 | qc | 1 | 2 |
| 1 | 2 | NULL | 0.11.1 | qc | 1 | 2 |
| 2 | 3 | NULL | 0.11.1 | qc | 1 | 2 |
| 1 | NULL | 1_2 | 0.11.1 | variant_call | 1 | 2 |
| 2 | NULL | 1 | 0.11.1 | variant_call | 1 | 2 |
| 1 | NULL | 1_2 | 0.11.1 | mykrobe_predict | 1 | 3 |
| 2 | NULL | 1 | 0.11.1 | mykrobe_predict | 1 | 3 |
+------------+-----------+-------------+---------+-----------------+--------+--------------+
A status
of 1
means successfully finished.
A status
of -1
means failed, and 0
means currently
running. In the above, everything has status 1
, meaning
that all runs were successful.
You can delete all the status not equal to 1
rows with:
delete from Pipeline where status != 1
Be careful with this command and only run it if there are currently no pipelines running!
After cleaning up that Pipeline table, also delete all your
.nextflow*
files, and the nextflow work directory (either
work
or whatever you used with the -w
option).
Then rerun the nextflow pipeline.