Skip to content

Walkthrough using tracking database

martinghunt edited this page May 26, 2023 · 3 revisions

Walkthrough: using tracking database

This page documents how to use Clockwork, plus MySQL and nextflow, to track all your samples and run the pipelines. It takes a little initial setting up, but then all the pipelines are simple to run.

We will download two samples from the ENA, and put them through all of the pipelines.

It is important to bear in mind that Clockwork was developed for the CRyPTIC Mycobacterium tuberculosis project, which stored particular information about each sample. On your own data you will need to use some dummy values for some of the metadata (and then just ignore those fields!).

Setting up

You will need MySQL, nextflow, and (unless you run clockwork using another method), Singularity installed to follow this walkthrough.

This page uses the singularity container, which we assume is called clockwork_container.img throughout this document.

Nextflow

In addition to the singularity container, you will need the nextflow scripts from the clockwork repository. You can either download the code from the release page, or git clone the repository:

git clone https://github.com/iqbal-lab-org/clockwork.git

This is important: use the nextflow scripts on your host computer, not nextflow scripts inside the clockwork container. The nextflow scripts are in the directory nextflow/ in the clockwork repository. In this walkthrough we will assume that the clockwork repository is in your current working directory, so that the nextflow scripts are in clockwork/nextflow/.

It is beyond the scope of this documentation to provide help with how to use nextflow. But note that:

  • Nextflow writes all intermediate files (there can be a large number!) in a "work directory", which is specified using the option -w /foo/bar and is created by nextflow if it does not exist. The default is work in your current working directory.
  • nextflow run options have a single dash (eg -w my_work_dir).
  • Options to the actual pipeline have a double dash (eg --db_config_file db.ini).
  • Nextflow writes log and cache files in your current working directory, called .nextflow*.
  • After a successful run of nextflow, it is a good idea to delete the .nextflow* files and the work directory.

Nextflow versions

The nextflow pipelines are written in the older nextflow language DSL1 and new versions of nextflow by default assume the newer language DLS2. However, even with newer versions of nextflow, you can tell it to run an older version in one of two ways:

  1. export NXF_VER=22.10.0 before running nextflow
  2. prefix all the nextflow commands like this: NXF_VER=22.10.0 nextflow run ...

This will mean the nextflow will not crash with an error about DSL1 being unsopprted.

Overview

Each clockwork pipeline uses Nextflow to run each stage of its workflow, and a MySQL database to store information about each sample, and which samples have had which pipelines run on them. Running one pipeline means one run of a Nextflow script, which will query the database to find samples that have not already had that pipeline run on them, it will run the pipeline on those samples, then update the database accordingly.

All clockwork files are stored in a "pipeline root" directory. Clockwork chooses the location of the files within this directory. Those files inside the pipeline root directory should not be moved, renamed, or deleted by the user. Scripts are available to report the location of files inside the pipeline root directory for each sample. The pipeline root directory itself can be moved/renamed because it is a parameter when running pipelines.

The pipelines are:

  1. Import: imports new data into the tracking database, and FASTQ files on disk.
  2. Remove contamination: removes contaminated reads.
  3. QC: gathers QC information on the reads using FastQC and BWA MEM/Samtools.
  4. Variant calling using BWA MEM/Samtools and cortex. The calls from these two tools are combined and validated to produce one call set using minos. The output is a VCF file from minos. The unfiltered VCF files from BWA+samtools and Cortex are also kept.
  5. Mykrobe predict: runs mykrobe predict, outputting the JSON file made by Mykrobe.

IMPORTANT: after importing data, the remove contamination pipeline should be run to produce decontaminated reads. All other pipelines use the decontaminated reads as input. If you try to run, for example, variant calling directly after importing data, then no samples will run because Clockwork will not find any decontaminated reads.

Database setup

You need to make a config file with the login and database name. Here is an example:

[db_login]
user = user_name
password = password
host = host_name
db = name_of_database

There can optionally be a line for the port: port = N. If this is not present then the default port is used.

Assuming your config file is called db.ini, run this to set up the tracking database:

singularity exec clockwork_container.img clockwork make_empty_db db.ini

This will create a new database named using the entry db from the file db.ini (in the above example name_of_database).

The pipelines need to find files on disk when using a database. Make the following directories (you can call them what you like, but we will use these names throughout this document):

mkdir Pipeline_root # this is where reads and analysis files are kept long-term
mkdir Pipeline_refs # where reference genomes are stored
mkdir Pipeline_spreadsheet_archive # import spreadsheets archived in here

Set up reference genomes

The pipelines require the following reference sequences:

  1. Optional but recommended at least for TB - Genomes to use for removing contamination from reads.
  2. At least one reference genome to use for QC and variant calling.

There is a script provided in the container that can download reference files for TB. Or you can provide your own files, by following the instructions on preparing remove contamination reference data. The following instructions are for downloading and setting up reference files for TB.

Download the reference genomes for contamination removal and variant calling:

singularity exec clockwork_container.img download_tb_reference_files.pl ref_data

That will download the data into a new directory called ref_data, with the files in the form needed for the reference_prepare command described next. Note that this could take up to one hour to run, depending on your internet connection speed. It downloads a large number of genomes (including human), and then processes them. If the script crashes, then rerun the same command and it will continue, skipping any genomes that have already been downloaded.

To set up the contamination reference, using the files that were just made in ref_data/:

singularity exec clockwork_container.img clockwork reference_prepare \
  --db_config_file db.ini --pipeline_references_root Pipeline_refs \
  --contam_tsv ref_data/remove_contam.tsv \
  --name remove_contam \
  ref_data/remove_contam.fa.gz

To setup version 3 of the H37Rv reference genome so that clockwork can use it for the QC and variant calling pipelines:

singularity exec clockwork_container.img clockwork reference_prepare \
  --db_config_file db.ini --pipeline_references_root Pipeline_refs \
  --name NC_000962.3 \
  ref_data/NC_000962.3.fa

The ref_data directory will also contain versions 1 and 2 of H37Rv, should you wish to use those instead by changing the 3 to a 1 or 2 in the last two lines of that command. Or you can use your genome of choice by changing the name and providing your own FASTA file. If you ran the commands above, in MySQL you should see this (replace PASSWORD, USER, HOST, DATABASE_NAME with the values from your config file db.ini):

mysql --password=PASSWORD -u USER -h HOST -P PORT -D DATABASE_NAME -e 'select * from Reference'
+--------------+----------------------+
| reference_id | name                 |
+--------------+----------------------+
|            1 | remove_contam        |
|            2 | NC_000962.3          |
+--------------+----------------------+

You need to use the reference_id to refer to the references when running the pipelines, not the name. The names are for your convenience to remember what each reference is. If you used your own data, then the names will be whatever you chose when running the commands.

Get reads

For this walkthrough, we get the reads from two samples from the ENA. We will use a built-in script that can download the reads and make the spreadsheet that the import pipeline needs to import the data. On your own data, you will instead need to make your own directory of reads and spreadsheet to import the data.

The script needs a file that has the ENA sample (column 1) and runs for that sample (column2). We will use two samples, one of which has two sequencing runs. The file looks like this:

SAMEA2534128	ERR551697,ERR551698
SAMEA2534421	ERR552113

where there are TAB characters separating the columns, which we call samples.tsv.

Then run this to get the data:

singularity exec clockwork_container.img clockwork ena_download samples.tsv To_import site1 lab1 20220614 dataset1

The script makes a new directory called To_import, with these contents:

ERR551697_1.fastq.gz
ERR551697_1.fastq.gz.md5
ERR551697_2.fastq.gz
ERR551697_2.fastq.gz.md5
ERR551698_1.fastq.gz
ERR551698_1.fastq.gz.md5
ERR551698_2.fastq.gz
ERR551698_2.fastq.gz.md5
ERR552113_1.fastq.gz
ERR552113_1.fastq.gz.md5
ERR552113_2.fastq.gz
ERR552113_2.fastq.gz.md5
import.tsv

This directory can then be used as input to the import pipeline, as in the next section.

If you want an example TSV file that the import pipeline uses, then have a look inside import.tsv. It uses some dummy values where the CRyPTIC project would have used real data. CRyPTIC had a "site", "lab id", sample date, and "dataset name" for each sample. We've used dummy values instead, and can ignore them from now on.

Import pipeline

This imports FASTQ files, taking metadata from a spreadsheet (in xlsx or TSV format). Please see this page describing the import spreadsheet format. We will use the directory created above, called To_import as input to the pipeline.

Run:

nextflow run clockwork/nextflow/import.nf \
  -with-singularity clockwork_container.img \
  --dropbox_dir dir_with_spreadheet_and_fastq_files \
  --pipeline_root Pipeline_root \
  --xlsx_archive_dir Pipeline_spreadsheet_archive \
  --db_config_file db.ini

It will find all spreadsheets called *.xlsx or *.tsv in the directory specified by --dropbox_dir, and import their data. IMPORTANT: all imported FASTQ files are moved from the input directory into clockwork's pipeline root directory structure.

Clockwork tracks all samples (their metadata and which pipelines have been run), and arranges files in the Pipeline_root/ directory. There is a script to find data, outputting your original names from the import spreadsheet, and locations on disk. We can run this now to check that there are two samples and three sequencing runs imported:

singularity exec clockwork_container.img clockwork find_data db.ini Pipeline_root/ > find_data.tsv

The output, in find_data.tsv, is a TSV file with a lot of columns (too wide to easily read here!). Here are four interesting columns:

cut -f 2,5,8,20  find_data.tsv  | column -t
subject_id    sequence_replicate_number  import_status  isolate_directory
SAMEA2534128  1                          1              Pipeline_root/00/00/00/01/1
SAMEA2534128  2                          1              Pipeline_root/00/00/00/01/1
SAMEA2534421  1                          1              Pipeline_root/00/00/00/02/2

We have the two samples imported (import_status of 1 means successfully imported). Sample SAMEA2534128 has two sequence replicates (sequence_replicate_number numbers 1 and 2). The location on disk for sample is in the isolate_directory column.

Remove contamination pipeline

This decontaminates reads. All later pipelines (QC, Variant call, mykrobe) use decontaminated reads, so this pipeline must be run for the later pipelines to run.

The pipeline will find all newly imported samples (ie those that have not been decontaminated), and run the decontamination pipeline on each of them.

Run this command:

nextflow run clockwork/nextflow/remove_contam.nf \
  -with-singularity clockwork_container.img \
  --ref_id 1 \
  --references_root Pipeline_refs \
  --pipeline_root Pipeline_root \
  --db_config_file db.ini

Alternative - do not use remove contamination

If you are happy that your reads do not need decontaminating, then you can avoid decontamination by running the "fake" decontamination pipeline, which simply updates the database to say that decontamination has been run. Then the "decontaminated" reads used for the next pipelines will just be the original reads.

nextflow run clockwork/nextflow/fake_remove_contam.nf \
  -with-singularity clockwork_container.img \
  --pipeline_root Pipeline_root \
  --db_config_file db.ini

QC pipeline

This is optional. It runs FASTQC and also gathers stats/plots from mapping the reads and running samtools stats/plot-bamstats. It was really intended for troubleshooting, and as such the results are put into the database. This means some MySQL knowledge to make use of the results. Documentation may be added in the future, but for now you may want to skip this pipeline.

Run the QC pipeline with the command below. It will find all samples that have been decontaminated and have not yet had QC run on them.

nextflow run clockwork/nextflow/qc.nf \
  -with-singularity clockwork_container.img \
  --ref_id 2 \
  --references_root  Pipeline_refs \
  --pipeline_root Pipeline_root \
  --db_config_file db.ini

Variant calling pipeline

This will run the variant calling pipeline on all samples that have been decontaminated and not yet been variant called.

Run this command:

nextflow run clockwork/nextflow/variant_call.nf \
  -with-singularity clockwork_container.img \
  --ref_id 2 \
  --references_root  Pipeline_refs \
  --pipeline_root Pipeline_root \
  --db_config_file db.ini

Getting VCF files

At this point, having run the variant calling pipeline, you probably want a VCF file for each sample. Recall that clockwork arranges all files in its own directory structure inside Pipeline_root/. We can use the clockwork command find_data to get all the VCF file paths:

singularity exec clockwork_container.img clockwork find_data \
  --pipeline variant_call \
  db.ini Pipeline_root/ > var_calls.tsv

The column pipeline_directory has the directory containing output files for the variant calling pipeline. The path will look like this:

/path/to/Pipeline_root/00/00/00/01/1/Pipelines/1_2/variant_call/0.11.1.ref.2

The main files of interest in there are:

  1. The final call set in minos/final.vcf, ie to get the final call set for a sample, take the pipeline_directory and add minos/final.vcf to it.
  2. A BAM file of the mapped trimmed reads, with duplicates removed, called samtools/rmdup.bam.

Mykrobe predict pipeline

First, you need to register a mykrobe panel with clockwork, similarly to setting up the reference genomes for remove contamination and variant calling. To register the panel 201901, which at the time of writing is the latest built-in panel for mykrobe, run this:

singularity exec clockwork_container.img clockwork db_add_mykrobe_panel \
  db.ini tb 201901 Pipeline_refs

That updated the database, telling clockwork that we will be using the 201901 panel. In MySQL you should see this:

select * from Reference;
+--------------+---------------+
| reference_id | name          |
+--------------+---------------+
|            1 | remove_contam |
|            2 | H37Rv         |
|            3 | 201901        |
+--------------+---------------+

The newly-added mykrobe panel has reference_id 3.

Now run the mykrobe pipeline using the following command.

nextflow run clockwork/nextflow/mykrobe_predict.nf \
  -with-singularity clockwork_container.img \
  --ref_id 3 \
  --references_root Pipeline_refs \
  --pipeline_root Pipeline_root \
  --db_config_file db.ini

Troubleshooting

Pipelines can crash in unpredictable ways.

If the import pipeline crashes, then it is impossible to predict what state the database and files on disk will be. We may try to document this in future, but it is likely that there are too many possibilities to cover.

If the remove contamination, QC, variant call, or mykrobe pipelines crash, then it is relatively easy to tidy things up and rerun. There is a Pipeline table in the database, which tracks which pipelines have been run, or are currently running, on which samples. If a pipeline crashes, then deleting the rows of the table that say a sample is currently running means that when you rerun, those samples will be run again (otherwise, clockwork will thinkg they are still currently running and ignore them).

If you followed this walkthrough, and everything worked, then you should see this:

select * from Pipeline;
+------------+-----------+-------------+---------+-----------------+--------+--------------+
| isolate_id | seqrep_id | seqrep_pool | version | pipeline_name   | status | reference_id |
+------------+-----------+-------------+---------+-----------------+--------+--------------+
|          1 |         1 | NULL        | 0.11.1  | remove_contam   |      1 |            1 |
|          1 |         2 | NULL        | 0.11.1  | remove_contam   |      1 |            1 |
|          2 |         3 | NULL        | 0.11.1  | remove_contam   |      1 |            1 |
|          1 |         1 | NULL        | 0.11.1  | qc              |      1 |            2 |
|          1 |         2 | NULL        | 0.11.1  | qc              |      1 |            2 |
|          2 |         3 | NULL        | 0.11.1  | qc              |      1 |            2 |
|          1 |      NULL | 1_2         | 0.11.1  | variant_call    |      1 |            2 |
|          2 |      NULL | 1           | 0.11.1  | variant_call    |      1 |            2 |
|          1 |      NULL | 1_2         | 0.11.1  | mykrobe_predict |      1 |            3 |
|          2 |      NULL | 1           | 0.11.1  | mykrobe_predict |      1 |            3 |
+------------+-----------+-------------+---------+-----------------+--------+--------------+

A status of 1 means successfully finished. A status of -1 means failed, and 0 means currently running. In the above, everything has status 1, meaning that all runs were successful.

You can delete all the status not equal to 1 rows with:

delete from Pipeline where status != 1

Be careful with this command and only run it if there are currently no pipelines running!

After cleaning up that Pipeline table, also delete all your .nextflow* files, and the nextflow work directory (either work or whatever you used with the -w option). Then rerun the nextflow pipeline.