diff --git a/README.md b/README.md
index c917f4f..8fc03ca 100644
--- a/README.md
+++ b/README.md
@@ -1,82 +1,353 @@
-
-
-# Run Sentieon pipelines on Google Cloud Platform
-
-Here we present a convenient tool for your to run our typical pieplines on Google Cloud Plarform. If you want to customize your pipeline, you may want to visit https://www.sentieon.com/support/.
-
-## Pipelines
-
-- Mapping(FASTQ -> BAM): `sentieon bwa-mem`
-- Variant Calling(BAM -> VCF)
- - Germline: `DNAseq` `DNAscope`
- - WGS
- - WES
- - Somatic: `TNhaplotyper` `TNhaplotyper2` `TNscope`
- - Tumor-Normal
- - Tumor only
-
-## Prerequisite & Setup
-
-- Python 2.7+
-- GCP account and Cloud SDK
-
- You may want to follow the instructions in "Before you begin" section in this [Google Cloud Genomics Tutorial](https://cloud.google.com/genomics/docs/tutorials/sentieon) to set up the working environment.
-- Set up your virtual environment and install dependencies
- ```bash
- #pip install virtualenv
- virtualenv env
- source env/bin/activate
- pip install --upgrade \
- pyyaml \
- google-api-python-client \
- google-auth \
- google-cloud-storage \
- google-auth-httplib2
- ```
-- Download the example files and set your current directory
-
- ```bash
- git clone https://github.com/sentieon/sentieon-google-genomics.git
- cd sentieon-google-genomics
- ```
-
-## Run your first Sentieon job on GCP
-
-Right now, we are granting free-trial license to your account automatically. You will get 14 days free trial starting from the time you run your first Sentieon job.
-
- 1. Specify your `PROJECT` and `BUCKET` in `examples/example.json`
- ```json
- {
+
+
+# Run Sentieon pipelines on the Google Cloud Platform
+*Easy to use pipelines for the Sentieon tools on the Google Cloud*
+
+For a tutorial, see Google's tutorial on [running a Sentieon DNAseq pipeline](https://cloud.google.com/genomics/docs/tutorials/sentieon). For more customized pipelines and additional details on the Sentieon software, please visit https://www.sentieon.com.
+
+## Table of Contents
+- [Highlights](#highlights)
+- [Prerequisites](#prerequisites)
+- [Running a pipeline](#running)
+ - [Configure your environment](#configure)
+ - [Download the example files](#example)
+ - [Understanding the input format](#input)
+ - [Run the example pipeline](#run)
+ - [Understanding the output](#understand)
+ - [Other example pipelines](#other_examples)
+- [Recommended configurations](#recommended)
+ - [Germline WGS and WES](#germline)
+ - [Somatic WGS and WES](#somatic)
+- [Additional options - germline](#configurations_germline)
+ - [Input file options](#germline_input)
+ - [Machine options](#germline_machine)
+ - [Pipeline configuration](#germline_config)
+ - [Pipeline options](#germline_options)
+- [Additional options - somatic](#configurations_somatic)
+ - [Input file options](#somatic_input)
+ - [Machine options](#somatic_machine)
+ - [Pipeline configuration](#somatic_config)
+ - [Pipeline options](#somatic_options)
+- [Where to get help](#help)
+
+
+
+## Highlights
+
+- Easily run the Sentieon Pipelines on the Google Cloud.
+- Pipelines are optimized by Sentieon to be well-tuned for efficiently processing WES and WGS samples.
+- All Sentieon pipelines and variant callers are available including DNAseq, DNAscope, TNseq, and TNscope
+- Matching results to the GATK [Germline](https://github.com/Sentieon/sentieon-dnaseq) and Somatic Best Practices Pipelines
+- Automatic 14-day free-trial of the Sentieon software on the Google Cloud
+
+
+
+## Prerequisites
+
+1. [Install Python 2.7+](https://www.python.org/downloads/).
+2. Select or create a GCP project.
+3. Make sure that billing is enabled for your Google Cloud Platform project.
+4. Enable the Cloud Life Sciences, Compute Engine, and Cloud Storage APIs.
+5. Install and initialize the Cloud SDK.
+6. Update and install gcloud components:
+```bash
+gcloud components update &&
+gcloud components install alpha
+```
+7. [Install git](https://git-scm.com/downloads) to download the required files.
+8. By default, Compute Engine has resource quotas in place to prevent inadvertent usage. By increasing quotas, you can launch more virtual machines concurrently, increasing throughput and reducing turnaround time.
+For best results in this tutorial, you should request additional quota above your project's default. Recommendations for quota increases are provided in the following list, as well as the minimum quotas needed to run the tutorial. Make your quota requests in the us-central1 region:
+ CPUs: 64
+ Persistent Disk Standard (GB): 375
+You can leave other quota request fields empty to keep your current quotas.
+
+
+
+## Running a pipeline
+
+
+
+
+### Configuring your environment
+
+Setup a Python virtualenv to manage the environment. First, install virtualenv if necessary
+```bash
+pip install --upgrade virtualenv
+```
+Install the required Python dependencies
+```bash
+virtualenv env
+source env/bin/activate
+pip install --upgrade \
+ pyyaml \
+ google-api-python-client \
+ google-auth \
+ google-cloud-storage \
+ google-auth-httplib2
+```
+
+
+
+### Download the pipeline script
+
+Download the pipeline script and move into the new directory.
+```bash
+git clone https://github.com/sentieon/sentieon-google-genomics.git
+cd sentieon-google-genomics
+```
+
+
+
+### Understanding the input format
+
+The runner script accepts a JSON file as input. In the repository you downloaded, there is an `examples/example.json` file with the following content:
+```json
+{
"FQ1": "gs://sentieon-test/pipeline_test/inputs/test1_1.fastq.gz",
"FQ2": "gs://sentieon-test/pipeline_test/inputs/test1_2.fastq.gz",
"REF": "gs://sentieon-test/pipeline_test/reference/hs37d5.fa",
- "OUTPUT_BUCKET": "YOUR_BUCKET_HERE",
+ "OUTPUT_BUCKET": "gs://BUCKET",
+ "ZONES": "us-central1-a,us-central1-b,us-central1-c,us-central1-f",
+ "PROJECT_ID": "PROJECT_ID",
+ "EMAIL": "EMAIL"
+}
+```
+
+The following table describes the JSON keys in the file:
+
+| JSON key | Description |
+| ------------- | ----------------------------------------------------------------------------- |
+| FQ1 | The first pair of reads in the input fastq file. |
+| FQ2 | The second pair of reads in the input fastq file. |
+| BAM | The input BAM file, if applicable. |
+| REF | The reference genome. If set, the reference index files are assumed to exist. |
+| OUTPUT_BUCKET | The bucket and directory used to store the data output from the pipeline. |
+| ZONES | A comma-separated list of GCP zones to use for the worker node. |
+| PROJECT_ID | Your GCP project ID. |
+| EMAIL | Your email |
+
+The `FQ1`, `FQ2`, `REF`, and `ZONES` fields will work with the defaults. However, the `OUTPUT_BUCKET`, `PROJECT_ID`, and `EMAIL` fields will need to be updated to point to your specific output bucket/path, Project ID, and email address.
+
+
+
+### Run the example pipelines
+
+Edit the `OUTPUT_BUCKET`, `PROJECT_ID`, and `EMAIL` fields in the `examples/example.json` to your output bucket/path, the GCP Project ID that you setup earlier, and email you want associated with your Sentieon license. By supplying the `EMAIL` field, your PROJECT_ID will automatically receive a 14 day free trial for the Sentieon software on the Google Cloud.
+
+You after modifying the `examples/example.json` file, you can use the following command to run the DNAseq pipeline on a small test dataset.
+```bash
+python runner/sentieon_runner.py examples/example.json
+```
+
+
+
+### Understanding the output
+
+If execution is successful, the runner script will print some logging information followed by `Operation succeeded` to the terminal. Output files from the pipeline can then be found in the `OUTPUT_BUCKET` location in Google Cloud Storage including alignment (BAM) files, variant calls, sample metrics and logging information.
+
+In the event of run failure, some diagnostic information will be printed to the screen followed by an error message. For assistance, please send the diagnostic information along with any log files in `OUTPUT_BUCKET`/worker_logs/ to support@sentieon.com.
+
+
+
+### Other example pipelines
+In the `examples` directory, you can find the following example configurations:
+
+| Configuration | Pipeline |
+| --------------- | ------------------------------------------------------------------ |
+| 100x_wes.json | DNAseq pipeline from FASTQ to VCF for Whole Exome Sequencing Data |
+| 30x_wgs.json | DNAseq pipeline from FASTQ to VCF for Whole Genome Sequencing Data |
+| tn_example.json | TNseq pipeline from FASTQ to VCF for Tumor Normal Pairs |
+
+
+
+## Recommended configurations
+
+Below are some recommended configurations for some common use-cases. The cost and runtime estimates below assume that jobs are run on preemptible instances that were not preempted during job execution.
+
+
+
+### Germline Whole Genome and Whole Exome Sequencing
+
+The following configuration will run a 30x human genome at a cost of approximately $1.35 and will take about 2 hours. This configuration can also be used to run a 100x whole exome at a cost of approximately $0.30 and will take about 35 minutes.
+```json
+{
+ "FQ1": "gs://my-bucket/sample1_1.fastq.gz",
+ "FQ2": "gs://my-bucket/sample1._2.fastq.gz",
+ "REF": "gs://sentieon-test/pipeline_test/reference/hs37d5.fa",
+ "OUTPUT_BUCKET": "gs://BUCKET",
+ "ZONES": "us-central1-a,us-central1-b,us-central1-c,us-central1-f",
+ "PROJECT_ID": "PROJECT_ID",
+ "EMAIL": "EMAIL",
+ "BQSR_SITES": "gs://sentieon-test/pipeline_test/reference/Mills_and_1000G_gold_standard.indels.b37.vcf.gz,gs://sentieon-test/pipeline_test/reference/1000G_phase1.indels.b37.vcf.gz,gs://sentieon-test/pipeline_test/reference/dbsnp_138.b37.vcf.gz",
+ "DBSNP": "gs://sentieon-test/pipeline_test/reference/dbsnp_138.b37.vcf.gz",
+ "PREEMPTIBLE_TRIES": "2",
+ "NONPREEMPTIBLE_TRY": true,
+ "STREAM_INPUT": "True",
+ "DISK_SIZE": 300,
+ "PIPELINE": "GERMLINE",
+ "CALLING_ALGO": "Haplotyper"
+}
+```
+The `CALLING_ALGO` key can be changed to `DNAscope` to use the Sentieon DNAscope variant caller for improved variant calling accuracy. For large input files, `DISK_SIZE` should be increased.
+
+
+
+
+### Somatic Whole Genome and Whole Exome Sequencing
+
+The following configuration will run a paired 60-30x human genome at a cost of approximately $3.70 and will take about 7 hours. This configuration can also be used to run a paired 150-150x human exome at a cost of approximately $0.60 and will take about 1.5 hours.
+```json
+{
+ "TUMOR_FQ1": "gs://my-bucket/tumor1_1.fastq.gz",
+ "TUMOR_FQ2": "gs://my-bucket/tumor1_2.fastq.gz",
+ "FQ1": "gs://my-bucket/normal1_1.fastq.gz",
+ "FQ2": "gs://my-bucket/normal1._2.fastq.gz",
+ "REF": "gs://sentieon-test/pipeline_test/reference/hs37d5.fa",
+ "OUTPUT_BUCKET": "gs://BUCKET",
"ZONES": "us-central1-a,us-central1-b,us-central1-c,us-central1-f",
- "PROJECT_ID": "YOUR_PROJECT_HERE"
- }
- ```
-
- 2. Run
- ```bash
- python runner/sentieon_runner.py examples/example.json
- ```
-
-## Pipeline Examples
-
- ### JSON examples
-
- In `examples` directory, you could find
-
- file | pipeline
- --- | ---
- 100x_wes.json | DNAseq pipeline from FASTQ to VCF for Whole Exome Sequencing Data
- 30x_wgs.json | DNAseq pipeline from FASTQ to VCF for Whole Genome Sequencing Data
- tn_example.json | TNseq pipeline from FASTQ to VCF for Tumor Normal Pairs
-
- ### Parameters
-
- In `runner/sentieon_germline.yaml` and `runner/sentieon_somatic.yaml`, you could find adjustable parameters for germline and somatic pipelines.
-
-
-
-
+ "PROJECT_ID": "PROJECT_ID",
+ "EMAIL": "EMAIL",
+ "BQSR_SITES": "gs://sentieon-test/pipeline_test/reference/Mills_and_1000G_gold_standard.indels.b37.vcf.gz,gs://sentieon-test/pipeline_test/reference/1000G_phase1.indels.b37.vcf.gz,gs://sentieon-test/pipeline_test/reference/dbsnp_138.b37.vcf.gz",
+ "DBSNP": "gs://sentieon-test/pipeline_test/reference/dbsnp_138.b37.vcf.gz",
+ "PREEMPTIBLE_TRIES": "2",
+ "NONPREEMPTIBLE_TRY": true,
+ "STREAM_INPUT": "True",
+ "DISK_SIZE": 300,
+ "PIPELINE": "SOMATIC",
+ "CALLING_ALGO": "TNhaplotyper"
+}
+```
+The `CALLING_ALGO` key key can be change to `TNsnv`, `TNhaplotyper`, `TNhaplotyper2`, or `TNscope` to use Sentieon's TNsnv, TNhaplotyper, TNhaplotyper2 or TNscope variant callers, respectively. For large input files, `DISK_SIZE` should be increased.
+
+
+
+## Additional options - Germline
+
+
+
+### Input file options
+
+| JSON Key | Description |
+|--------------- | ------------------------------------------------------------------------------------ |
+| FQ1 | A comma-separated list of input R1 FASTQ files |
+| FQ2 | A comma-separated list of input R2 FASTQ files |
+| READGROUP | A comma-separted list of readgroups headers to add to the read data during alignment |
+| BAM | A comma-separated list of input BAM files |
+| REF | The path to the reference genome |
+| BQSR_SITES | A comma-separated list of known sites for BQSR |
+| DBSNP | A dbSNP file to use during variant calling |
+| INTERVAL | A string of interval(s) to use during variant calling |
+| INTERVAL_FILE | A file of intervals(s) to use during variant calling |
+| DNASCOPE_MODEL | A trained model to use during DNAscope variant calling |
+
+
+
+### Machine options
+
+| JSON Key | Description |
+| ------------ | --------------------------------------------------------------------------- |
+| ZONES | GCE Zones to potentially launch the job in |
+| DISK_SIZE | The size of the hard disk to use (should be 3x the size of the input files) |
+| MACHINE_TYPE | The type of GCE machine to use to run the pipeline |
+
+
+
+### Pipeline configurations
+
+| JSON Key | Description |
+| ------------------- | ----------------------------------------------------------------------- |
+| SENTIEON_VERSION | The version of the Sentieon software package to use |
+| DEDUP | Type of duplicate removal to run (nodup, markdup or rmdup) |
+| NO_METRICS | Skip running metrics collection |
+| NO_BAM_OUTPUT | Skip outputting a preprocessed BAM file |
+| NO_HAPLOTYPER | Skip variant calling |
+| GVCF_OUTPUT | Output variant calls in gVCF format rather than VCF format |
+| STREAM_INPUT | Stream the input FASTQ files directly from Google Cloud Storage |
+| RECALIBRATED_OUTPUT | Apply BQSR to the output preprocessed alignments (not recommended) |
+| CALLING_ARGS | A string of additional arguments to pass to the variant caller |
+| PIPELINE | Set to `GERMLINE` to run the germline variant calling pipeline |
+| CALLING_ALGO | The Sentieon variant calling algo to run. Either Haplotyper or DNAscope |
+
+
+
+### Pipeline options
+
+| JSON Key | Description |
+| ------------------- | --------------------------------------------------------------------------------------------------- |
+| OUTPUT_BUCKET | The Google Cloud Storage Bucket and path prefix to use for the output files |
+| EMAIL | An email address to use to obtain an evaluation license for your GCP Project |
+| SENTIEON_KEY | Your Sentieon license key (only applicable for paying customers) |
+| PROJECT_ID | Your GCP Project ID to use when running jobs |
+| PREEMPTIBLE_TRIES | Number of attempts to run the pipeline using preemptible instances |
+| NONPREEMPTIBLE_TRY | After `PREEMPTIBLE_TRIES` are exhausted, whether to try one additional run with standard instances |
+
+
+
+## Additional options - Somatic
+
+
+
+### Input file options
+
+| JSON Key | Description |
+|---------------- | ------------------------------------------------------------------------------------------- |
+| TUMOR_FQ1 | A comma-separated list of input R1 tumor FASTQ files |
+| TUMOR FQ2 | A comma-separated list of input R2 tumor FASTQ files |
+| FQ1 | A comma-separated list of input R1 normal FASTQ files |
+| FQ2 | A comma-separated list of input R2 normal FASTQ files |
+| TUMOR_READGROUP | A comma-separted list of readgroups headers to add to the tumor read data during alignment |
+| READGROUP | A comma-separted list of readgroups headers to add to the normal read data during alignment |
+| TUMOR_BAM | A comma-separated list of input tumor BAM files |
+| BAM | A comma-separated list of input normal BAM files |
+| REF | The path to the reference genome |
+| BQSR_SITES | A comma-separated list of known sites for BQSR |
+| DBSNP | A dbSNP file to use during variant calling |
+| INTERVAL | A string of interval(s) to use during variant calling |
+| INTERVAL_FILE | A file of intervals(s) to use during variant calling |
+
+
+
+### Machine options
+
+| JSON Key | Description |
+| ------------ | --------------------------------------------------------------------------- |
+| ZONES | GCE Zones to potentially launch the job in |
+| DISK_SIZE | The size of the hard disk to use (should be 3x the size of the input files) |
+| MACHINE_TYPE | The type of GCE machine to use to run the pipeline |
+
+
+
+### Pipeline configurations
+
+| JSON Key | Description |
+| ------------------- | ------------------------------------------------------------------------------------------------------- |
+| SENTIEON_VERSION | The version of the Sentieon software package to use |
+| DEDUP | Type of duplicate removal to run (nodup, markdup or rmdup) |
+| NO_METRICS | Skip running metrics collection |
+| NO_BAM_OUTPUT | Skip outputting a preprocessed BAM file |
+| NO_VCF | Skip variant calling |
+| STREAM_INPUT | Stream the input FASTQ files directly from Google Cloud Storage |
+| RECALIBRATED_OUTPUT | Apply BQSR to the output preprocessed alignments (not recommended) |
+| CALLING_ARGS | A string of additional arguments to pass to the variant caller |
+| PIPELINE | Set to `SOMATIC` to run the somatic variant calling pipeline |
+| RUN_TNSNV | If using the TNseq pipeline, use TNsnv for variant calling |
+| CALLING_ALGO | The Sentieon somatic variant calling algo to run. Either TNsnv, TNhaplotyper, TNhaplotyper2, or TNscope |
+
+
+
+### Pipeline options
+
+| JSON Key | Description |
+| ------------------- | --------------------------------------------------------------------------------------------------- |
+| OUTPUT_BUCKET | The Google Cloud Storage Bucket and path prefix to use for the output files |
+| EMAIL | An email address to use to obtain an evaluation license for your GCP Project |
+| SENTIEON_KEY | Your Sentieon license key (only applicable for paying customers) |
+| PROJECT_ID | Your GCP Project ID to use when running jobs |
+| PREEMPTIBLE_TRIES | Number of attempts to run the pipeline using preemptible instances |
+| NONPREEMPTIBLE_TRY | After `PREEMPTIBLE_TRIES` are exhausted, whether to try one additional run with standard instances |
+
+
+
+## Where to get help
+
+Please email support@sentieon.com with any questions.
diff --git a/examples/30x_wgs_ccdg.json b/examples/30x_wgs_ccdg.json
new file mode 100644
index 0000000..82e41da
--- /dev/null
+++ b/examples/30x_wgs_ccdg.json
@@ -0,0 +1,12 @@
+{
+ "FQ1": "gs://your-bucket/wgs_sample1_1.fastq.gz",
+ "FQ2": "gs://your-bucket/wgs_sample1_2.fastq.gz",
+ "REF": "gs://sentieon-test/pipeline_test/reference/hg38/GRCh38_full_analysis_set_plus_decoy_hla.fa",
+ "OUTPUT_BUCKET": "YOUR_BUCKET_HERE",
+ "BQSR_SITES": "gs://sentieon-test/pipeline_test/reference/hg38/Homo_sapiens_assembly38.dbsnp138.vcf.gz,gs://sentieon-test/pipeline_test/reference/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz,gs://sentieon-test/pipeline_test/reference/hg38/Homo_sapiens_assembly38.known_indels.vcf.gz",
+ "DBSNP": "gs://sentieon-test/pipeline_test/reference/hg38/Homo_sapiens_assembly38.dbsnp138.vcf.gz",
+ "PREEMPTIBLE_TRIES": "2",
+ "STREAM_INPUT": "True",
+ "ZONES": "us-central1-a,us-central1-b,us-central1-c,us-central1-f",
+ "PROJECT_ID": "YOUR_PROJECT_HERE"
+}
diff --git a/examples/tn_example.json b/examples/tn_example.json
index 4286ec7..7fcfcbd 100644
--- a/examples/tn_example.json
+++ b/examples/tn_example.json
@@ -7,7 +7,7 @@
"OUTPUT_BUCKET": "YOUR_BUCKET_HERE",
"ZONES": "us-central1-a,us-central1-b,us-central1-c,us-central1-f",
"PROJECT_ID": "YOUR_PROJECT_HERE",
- "PIPELINE": "TNseq",
- "RUN_TNSNV": null
+ "PIPELINE": "SOMATIC",
+ "CALLING_ALGO": "TNhaplotyper"
}
diff --git a/pipeline_scripts/Dockerfile b/pipeline_scripts/Dockerfile
index 282438f..424d7b9 100644
--- a/pipeline_scripts/Dockerfile
+++ b/pipeline_scripts/Dockerfile
@@ -1,12 +1,11 @@
-FROM sentieon/sentieon:201808.01 as builder
-ARG version=201808.01
+FROM python:2.7.16-slim-stretch as downloader
+# Install gsutil
RUN apt-get update && \
apt-get install -y \
- gcc \
- python-dev \
- python-setuptools \
curl \
+ python-pip \
+ gcc \
lsb-release && \
export CLOUD_SDK_REPO="cloud-sdk-$(lsb_release -c -s)" && \
echo "deb http://packages.cloud.google.com/apt $CLOUD_SDK_REPO main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && \
@@ -14,6 +13,7 @@ RUN apt-get update && \
apt-get update && apt-get install -y google-cloud-sdk && \
pip install crcmod
+# Install samtools
RUN apt-get update && \
apt-get install -y \
lbzip2 \
@@ -22,34 +22,39 @@ RUN apt-get update && \
libbz2-dev \
liblzma-dev \
make && \
- curl -Lo samtools-1.7.tar.bz2 https://github.com/samtools/samtools/releases/download/1.7/samtools-1.7.tar.bz2 && \
- tar -xf samtools-1.7.tar.bz2 && \
- cd samtools-1.7 && \
+ curl -Lo samtools-1.9.tar.bz2 https://github.com/samtools/samtools/releases/download/1.9/samtools-1.9.tar.bz2 && \
+ tar -xf samtools-1.9.tar.bz2 && \
+ cd samtools-1.9 && \
./configure && \
make install
-RUN apt-get update && \
- apt-get install -y \
- g++ && \
- curl -Lo samblaster-v.0.1.24.tar.gz https://github.com/GregoryFaust/samblaster/releases/download/v.0.1.24/samblaster-v.0.1.24.tar.gz && \
- tar -xf samblaster-v.0.1.24.tar.gz && \
- cd samblaster-v.0.1.24 && \
- make
+# Install samblaster
+RUN curl -Lo samblaster-v.0.1.24.tar.gz https://github.com/GregoryFaust/samblaster/releases/download/v.0.1.24/samblaster-v.0.1.24.tar.gz && \
+ tar -xf samblaster-v.0.1.24.tar.gz && \
+ cd samblaster-v.0.1.24 && \
+ make && \
+ cp samblaster /usr/local/bin/
+
+# Install metadata script dependencies
+RUN pip install requests urllib3
-FROM sentieon/sentieon:201808.01
-ARG version=201808.01
+FROM python:2.7.16-slim-stretch
-COPY --from=builder /usr/lib/google-cloud-sdk /usr/lib/google-cloud-sdk
-COPY --from=builder /usr/local/lib/python2.7/site-packages/crcmod/ /usr/local/lib/python2.7/site-packages/crcmod/
-COPY --from=builder /usr/local/bin/samtools /usr/local/bin
-COPY --from=builder /samblaster-v.0.1.24/samblaster /usr/local/bin
+LABEL container.base.image="python:2.7.16-slim-stretch"
+
+COPY --from=downloader /usr/lib/google-cloud-sdk /usr/lib/google-cloud-sdk
+COPY --from=downloader /usr/local/lib/python2.7/site-packages /usr/local/lib/python2.7/site-packages
+COPY --from=downloader /usr/local/bin/samtools /usr/local/bin
+COPY --from=downloader /usr/local/bin/samblaster /usr/local/bin
+
+CMD ["/bin/bash"]
RUN apt-get update && apt-get install -y \
+ curl \
+ libncurses5-dev \
bc \
dnsutils \
iputils-ping && \
- pip install --upgrade requests && \
ln -s ../lib/google-cloud-sdk/bin/gsutil /usr/bin/gsutil
-ADD gc_functions.sh gc_somatic.sh gc_germline.sh gen_credentials.py /opt/sentieon/
-
+ADD gc_functions.sh gc_somatic.sh gc_germline.sh gc_ccdg_germline.sh gen_credentials.py /opt/sentieon/
diff --git a/pipeline_scripts/gc_ccdg_germline.sh b/pipeline_scripts/gc_ccdg_germline.sh
new file mode 100644
index 0000000..3ec3abd
--- /dev/null
+++ b/pipeline_scripts/gc_ccdg_germline.sh
@@ -0,0 +1,230 @@
+#!/usr/bin/env bash
+
+set -xveo pipefail
+set +H
+
+BASEDIR=$(dirname "$0")
+scratch=/mnt/work
+nt=$(nproc)
+source $BASEDIR/gc_functions.sh
+
+# Set "None" variables to an empty string
+environmental_variables=(FQ1 FQ2 BAM OUTPUT_BUCKET REF READGROUP BQSR_SITES \
+ DBSNP INTERVAL INTERVAL_FILE NO_METRICS NO_BAM_OUTPUT NO_HAPLOTYPER \
+ GVCF_OUTPUT STREAM_INPUT PIPELINE OUTPUT_CRAM_FORMAT SENTIEON_KEY EMAIL \
+ SENTIEON_VERSION CALLING_ARGS DNASCOPE_MODEL CALLING_ALGO)
+unset_none_variables ${environmental_variables[@]}
+OUTPUT_CRAM_FORMAT="" # Not yet supported
+
+readonly FQ1 FQ2 BAM OUTPUT_BUCKET REF READGROUP BQSR_SITES DBSNP INTERVAL \
+ INTERVAL_FILE NO_METRICS NO_BAM_OUTPUT NO_HAPLOTYPER GVCF_OUTPUT \
+ STREAM_INPUT PIPELINE OUTPUT_CRAM_FORMAT SENTIEON_KEY EMAIL \
+ SENTIEON_VERSION CALLING_ARGS DNASCOPE_MODEL CALLING_ALGO
+
+release_dir="/opt/sentieon/sentieon-genomics-${SENTIEON_VERSION}/"
+
+# Basic error handling #
+if [[ -n "$FQ1" && -n "$BAM" ]]; then
+ echo "Please supply either fastq or BAM files"
+ exit 1
+fi
+
+if [[ -n "$INTERVAL" && -n "$INTERVAL_FILE" ]]; then
+ echo "Please supply either an INTERVAL or INTERVAL_FILE, not both"
+ exit 1
+fi
+
+if [[ -n "$NO_BAM_OUTPUT" && -n "$NO_HAPLOTYPER" && -n "$NO_METRICS" ]]; then
+ echo "Nothing to do"
+ exit 1
+fi
+
+# **********************************
+# 0. Setup
+# **********************************
+gc_setup
+export LD_PRELOAD=${release_dir}/lib/libjemalloc.so
+export MALLOC_CONF=lg_dirty_mult:-1
+
+## Download input files
+if [[ -n "$BAM" ]]; then
+ download_bams "$BAM" local_bams $input_dir
+else
+ local_bams=()
+fi
+
+download_intervals
+download_reference
+if [[ $CALLING_ALGO == "DNAscope" && -n "$DNASCOPE_MODEL" ]]; then
+ curl -L -o ${input_dir}/dnascope.model "$DNASCOPE_MODEL"
+fi
+
+## Handle the sites files
+IFS=',' read -r -a gs_bqsr_sites <<< "$BQSR_SITES"
+transfer_all_sites $bqsr_dir "${gs_bqsr_sites[@]}"
+local_bqsr_sites=("${local_sites[@]}")
+bqsr_sites="$local_str"
+
+if [[ -n "$DBSNP" ]]; then
+ transfer_all_sites $dbsnp_dir "$DBSNP"
+ dbsnp="${local_sites[0]}"
+fi
+
+# ******************************************
+# 1. Mapping reads with BWA-MEM, sorting
+# ******************************************
+output_ext="bam"
+if [[ -n $FQ1 ]]; then
+ bwa_mem_align "" "$FQ1" "$FQ2" "$READGROUP" local_bams $output_ext "-K 100000000 -Y" "$util_sort_xargs" "true"
+fi
+
+local_bams_str=""
+for bam in "${local_bams[@]}"; do
+ local_bams_str+=" -i \"$bam\" "
+done
+
+# ******************************************
+# 2. Metrics command
+# ******************************************
+metrics_cmd1=
+metrics_cmd2=
+metrics_files=
+if [[ -z "$NO_METRICS" ]]; then
+ build_metrics_cmd "" metrics_cmd1 metrics_cmd2 metrics_files
+fi
+
+# ******************************************
+# 3. Remove duplicates
+# ******************************************
+output_ext="bam"
+
+run_mark_duplicates "" "markdup" metrics_cmd1 "$local_bams_str" dedup_bam_str dedup_bams "$dedup_xargs" $output_ext "true" "${local_bams[@]}"
+if [[ -z "$NO_METRICS" ]]; then
+ (gsutil cp $metrics_dir/dedup_metrics.txt "$out_metrics" &&
+ rm $metrics_dir/dedup_metrics.txt) &
+ upload_dedup_pid=$!
+else
+ rm $metrics_dir/dedup_metrics.txt &
+fi
+for bam in "${local_bams[@]}"; do
+ if [[ -f "$bam" ]]; then
+ rm "$bam" &
+ fi
+ if [[ -f "${bam}".bai ]]; then
+ rm "${bam}".bai &
+ fi
+ if [[ -f "${bam}".crai ]]; then
+ rm "${bam}".crai &
+ fi
+done
+
+upload_metrics metrics_cmd1 metrics_cmd2 upload_metrics_pid ${metrics_files[@]}
+
+# ******************************************
+# 4. Base recalibration
+# ******************************************
+bqsr_intervals="chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr20,chr21,chr22"
+cmd="$release_dir/bin/sentieon driver -t $nt -r \"$ref\" --interval $bqsr_intervals $dedup_bam_str --algo QualCal $bqsr_sites $work/recal_data.table"
+run "$cmd" "BQSR"
+
+cmd="$release_dir/bin/sentieon driver -t $nt -r \"$ref\" $dedup_bam_str --read_filter QualCalFilter,table=$work/recal_data.table,prior=-1.0,indel=false,levels=10/20/30,min_qual=6 --algo ReadWriter --cram_write_options version=3.0,compressor=gzip+rans $work/recalibrated.cram"
+run "$cmd" "ReadWriter"
+
+for bam in "${dedup_bams[@]}"; do
+ if [[ -f "$bam" ]]; then
+ rm "$bam" &
+ fi
+ if [[ -f "${bam}".bai ]]; then
+ rm "$bam".bai &
+ fi
+ if [[ -f "${bam}".crai ]]; then
+ rm "${bam}".crai &
+ fi
+done
+
+if [[ -z "$NO_BAM_OUTPUT" ]]; then
+ gsutil cp $work/recalibrated.cram $work/recalibrated.cram.crai "$out_bam" &
+ upload_recal_pid=$!
+fi
+
+
+# ******************************************
+# 5. Variant Calling
+# ******************************************
+
+## Generate a non-decoy BED file
+generate_nondecoy_bed "${ref}".fai "${ref}"_nondecoy.bed
+call_interval="$interval"
+call_interval=${call_interval:-"${ref}"_nondecoy.bed}
+
+outgvcf=$work/hc.g.vcf.gz
+outvcf=$work/hc.vcf.gz
+
+algo=Haplotyper
+extra_vcf_args=""
+if [[ $CALLING_ALGO == "DNAscope" ]]; then
+ algo="DNAscope"
+ if [[ -n "$DNASCOPE_MODEL" ]]; then
+ extra_vcf_args="--model ${input_dir}/dnascope.model"
+ else
+ extra_vcf_args="--var_type snp,indel,bnd"
+ fi
+ outvcf=$work/dnascope.vcf.gz
+ outgvcf=$work/dnascope.g.vcf.gz
+ tmpvcf=$work/tmp.vcf.gz
+fi
+
+if [[ -z $NO_HAPLOTYPER ]]; then
+ if [[ -n "$GVCF_OUTPUT" ]]; then
+ cmd="$release_dir/bin/sentieon driver --interval \"$call_interval\" -t $nt -r \"$ref\" -i $work/recalibrated.cram --algo $algo $CALLING_ARGS ${dbsnp:+-d \"$dbsnp\"} --emit_mode gvcf ${outgvcf}"
+ outfile=$outgvcf
+ else
+ cmd="$release_dir/bin/sentieon driver --interval \"$call_interval\" -t $nt -r \"$ref\" -i $work/recalibrated.cram --algo $algo $CALLING_ARGS $extra_vcf_args ${dbsnp:+-d \"$dbsnp\"} ${outvcf}"
+ outfile=$outvcf
+ fi
+ run "$cmd" "Variant calling"
+
+ # DNAscope SV calling
+ if [[ $CALLING_ALGO == "DNAscope" && -z "$GVCF_OUTPUT" && -z "$DNASCOPE_MODEL" ]]; then
+ mv $outvcf $tmpvcf
+ mv ${outvcf}.tbi ${tmpvcf}.tbi
+ cmd="$release_dir/bin/sentieon driver -t $nt -r \"$ref\" --algo SVSolver -v $tmpvcf $outvcf"
+ run "$cmd" "SVSolver"
+ fi
+
+ # DNAscope model apply
+ if [[ $CALLING_ALGO == "DNAscope" && -z "$GVCF_OUTPUT" && -n "$DNASCOPE_MODEL" ]]; then
+ mv $outvcf $tmpvcf
+ mv ${outvcf}.tbi ${tmpvcf}.tbi
+ cmd="$release_dir/bin/sentieon driver -t $nt -r \"$ref\" --algo DNAModelApply --model ${input_dir}/dnascope.model -v $tmpvcf $outvcf"
+ run "$cmd" "DNAscope model apply"
+ fi
+
+ gsutil cp $outfile ${outfile}.tbi "$out_variants" &
+ upload_vcf_pid=$!
+fi
+
+# Wait for all running uploads to finish
+set +e
+if [[ -n $upload_recal_pid ]]; then
+ wait $upload_recal_pid
+fi
+if [[ -n $upload_metrics_pid ]]; then
+ wait $upload_metrics_pid
+fi
+if [[ -n $upload_dedup_pid ]]; then
+ wait $upload_dedup_pid
+fi
+if [[ -n $upload_deduped_pid ]]; then
+ wait $upload_deduped_pid
+fi
+if [[ -n $upload_bqsr_pid ]]; then
+ wait $upload_bqsr_pid
+fi
+if [[ -n $upload_vcf_pid ]]; then
+ wait $upload_vcf_pid
+fi
+if [[ -n $upload_bqsr_metrics_pid ]]; then
+ wait $upload_bqsr_metrics_pid
+fi
+exit 0
diff --git a/pipeline_scripts/gc_functions.sh b/pipeline_scripts/gc_functions.sh
index 58fdf98..3b85484 100644
--- a/pipeline_scripts/gc_functions.sh
+++ b/pipeline_scripts/gc_functions.sh
@@ -45,7 +45,7 @@ transfer()
src_file=$1
dst_file=$2
start_s=`date +%s`
- gsutil cp $src_file $dst_file
+ gsutil cp "$src_file" "$dst_file"
check_error $? "Transfer $src_file to $dst_file"
end_s=`date +%s`
runtime=$(delta_time $start_s $end_s)
@@ -60,21 +60,21 @@ transfer_all_sites()
local_str=""
for src_file in "${src_files[@]}"; do
# File
- local_file=$dst_dir/$(basename $src_file)
- transfer $src_file $local_file
- local_sites+=($local_file)
- local_str+=" -k $local_file "
+ local_file=$dst_dir/$(basename "$src_file")
+ transfer "$src_file" "$local_file"
+ local_sites+=("$local_file")
+ local_str+=" -k \"$local_file\" "
# Index
- if $(test -e ${src_file}.idx) || $(gsutil -q stat ${src_file}.idx); then
- idx=${src_file}.idx
- elif $(test -e ${src_file}.tbi) || $(gsutil -q stat ${src_file}.tbi); then
- idx=${src_file}.tbi
+ if $(test -e "${src_file}".idx) || $(gsutil -q stat "${src_file}".idx); then
+ idx="${src_file}".idx
+ elif $(test -e "${src_file}".tbi) || $(gsutil -q stat "${src_file}".tbi); then
+ idx="${src_file}".tbi
else
echo "Cannot find idx for $src_file"
exit 1
fi
- local_idx=$dst_dir/$(basename $idx)
- transfer $idx $local_idx
+ local_idx=$dst_dir/$(basename "$idx")
+ transfer "$idx" "$local_idx"
done
}
@@ -88,7 +88,7 @@ upload_metrics()
eval "fun_metrics_cmd2=\$$fun_var_cmd2"
if [[ -n "$fun_metrics_cmd2" && -z "$fun_metrics_cmd1" && -f "${fun_metrics_files[0]}" ]]; then
(run "$fun_metrics_cmd2" "Plotting metrics results." &&
- gsutil cp ${fun_metrics_files[@]} $out_metrics &&
+ gsutil cp ${fun_metrics_files[@]} "$out_metrics" &&
rm ${fun_metrics_files[@]}) &
eval "$fun_pid=$! "
eval "$fun_var_cmd2=''"
@@ -106,6 +106,10 @@ unset_none_variables()
gc_setup()
{
+ ## Download the Sentieon software
+ curl -L https://s3.amazonaws.com/sentieon-release/software/sentieon-genomics-${SENTIEON_VERSION}.tar.gz | tar -zxf - -C /opt/sentieon
+ PATH=/opt/sentieon/sentieon-genomics-${SENTIEON_VERSION}/bin:$PATH
+
## Dirs
work=$scratch/work
metrics_dir=$scratch/metrics
@@ -130,8 +134,7 @@ gc_setup()
## Setup license information #
cred=$license_dir/credentials.json
project_file=$license_dir/credentials.json.project
- python /opt/sentieon/gen_credentials.py $cred "$SENTIEON_KEY" &
- credentials_pid=$!
+ python /opt/sentieon/gen_credentials.py ${EMAIL:+--email $EMAIL} $cred "$SENTIEON_KEY"
sleep 10
if [[ -n $SENTIEON_KEY ]]; then
export SENTIEON_AUTH_MECH=proxy_GOOGLE
@@ -171,22 +174,22 @@ download_bams()
dest_arr=$2
download_input_dir=$3
- bams=($(echo $bams | tr ',' ' '))
+ IFS=',' read -r -a bams <<< "$bams"
tmp_bam_dest=()
- for bam in ${bams[@]}; do
- local_bam=$download_input_dir/$(basename $bam)
- transfer $bam $local_bam
- if $(test -e ${bam}.bai) || $(gsutil -q stat ${bam}.bai); then
- bai=${bam}.bai
- elif $(test -e ${bam%%.bam}.bai) || $(gsutil -q stat ${bam%%.bam}.bai); then
- bai=${bam%%.bam}.bai
+ for bam in "${bams[@]}"; do
+ local_bam=$download_input_dir/$(basename "$bam")
+ transfer "$bam" "$local_bam"
+ if $(test -e "${bam}".bai) || $(gsutil -q stat "${bam}".bai); then
+ bai="${bam}".bai
+ elif $(test -e "${bam%%.bam}".bai) || $(gsutil -q stat "${bam%%.bam}".bai); then
+ bai="${bam%%.bam}".bai
else
echo "Cannot find the index file for $bam"
exit 1
fi
- local_bai=$download_input_dir/$(basename $bai)
- transfer $bai $local_bai
- tmp_bam_dest+=($local_bam)
+ local_bai=$download_input_dir/$(basename "$bai")
+ transfer "$bai" "$local_bai"
+ tmp_bam_dest+=("$local_bam")
done
eval "${dest_arr}=(${tmp_bam_dest})"
@@ -195,13 +198,13 @@ download_bams()
download_intervals()
{
if [[ -n "$INTERVAL" ]]; then
- interval=" --interval $INTERVAL "
+ interval="$INTERVAL"
interval_list=""
elif [[ -n "$INTERVAL_FILE" ]]; then
- local_interval_file=$input_dir/$(basename $INTERVAL_FILE)
- transfer $INTERVAL_FILE $local_interval_file
- interval=" --interval $local_interval_file "
- interval_list=" --interval_list $local_interval_file "
+ local_interval_file=$input_dir/$(basename "$INTERVAL_FILE")
+ transfer "$INTERVAL_FILE" "$local_interval_file"
+ interval="$local_interval_file"
+ interval_list="$local_interval_file"
else
interval=""
interval_list=""
@@ -210,25 +213,36 @@ download_intervals()
download_reference()
{
- ref=$ref_dir/$(basename $REF)
- transfer $REF $ref
- transfer ${REF}.fai ${ref}.fai
- if $(test -e ${REF}.dict) || $(gsutil -q stat ${REF}.dict); then
- transfer ${REF}.dict ${ref}.dict
- elif $(test -e ${REF%%.fa}.dict) || $(gsutil -q stat ${REF%%.fa}.dict); then
- transfer ${REF%%.fa}.dict ${ref%%.fa}.dict
- elif $(test -e ${REF%%.fasta}.dict) || $(gsutil -q stat ${REF%%.fasta}.dict); then
- transfer ${REF%%.fasta}.dict ${ref%%.fasta}.dict
+ ref=$ref_dir/$(basename "$REF")
+ transfer "$REF" "$ref"
+ transfer "${REF}".fai "${ref}".fai
+ if $(test -e "${REF}".dict) || $(gsutil -q stat "${REF}".dict); then
+ transfer "${REF}".dict "${ref}".dict
+ elif $(test -e "${REF%%.fa}".dict) || $(gsutil -q stat "${REF%%.fa}".dict); then
+ transfer "${REF%%.fa}".dict "${ref%%.fa}".dict
+ elif $(test -e "${REF%%.fasta}".dict) || $(gsutil -q stat "${REF%%.fasta}".dict); then
+ transfer "${REF%%.fasta}".dict "${ref%%.fasta}".dict
else
echo "Cannot find reference dictionary"
exit 1
fi
if [[ -n "$FQ1" || -n "$TUMOR_FQ1" ]]; then
- transfer ${REF}.amb ${ref}.amb
- transfer ${REF}.ann ${ref}.ann
- transfer ${REF}.bwt ${ref}.bwt
- transfer ${REF}.pac ${ref}.pac
- transfer ${REF}.sa ${ref}.sa
+ if $(test -e "${REF}".64.amb) || $(gsutil -q stat "${REF}".64.amb); then
+ middle=".64"
+ elif $(test -e "${REF}".amb) || $(gsutil -q stat "${REF}".amb); then
+ middle=""
+ else
+ echo "Cannot file BWA index files"
+ exit 1
+ fi
+ transfer "${REF}"${middle}.amb "${ref}"${middle}.amb
+ transfer "${REF}"${middle}.ann "${ref}"${middle}.ann
+ transfer "${REF}"${middle}.bwt "${ref}"${middle}.bwt
+ transfer "${REF}"${middle}.pac "${ref}"${middle}.pac
+ transfer "${REF}"${middle}.sa "${ref}"${middle}.sa
+ if $(test -e "${REF}"${middle}.alt) || $(gsutil -q stat "${REF}"${middle}.alt); then
+ transfer "${REF}"${middle}.alt "${ref}"${middle}.alt
+ fi
fi
}
@@ -242,11 +256,12 @@ bwa_mem_align()
fun_output_ext=$1; shift
fun_bwa_xargs=$1; shift
fun_util_sort_xargs=$1; shift
+ fun_run_samblaster=$1; shift
fun_bam_dest=()
- fun_fq1=($(echo $fun_fq1 | tr ',' ' '))
- fun_fq2=($(echo $fun_fq2 | tr ',' ' '))
- fun_rgs=($(echo $fun_rgs | tr ',' ' '))
+ IFS=',' read -r -a fun_fq1 <<< "$fun_fq1"
+ IFS=',' read -r -a fun_fq2 <<< "$fun_fq2"
+ IFS=',' read -r -a fun_rgs <<< "$fun_rgs"
if [[ -z "$bwt_max_mem" ]]; then
mem_kb=$(cat /proc/meminfo | grep "MemTotal" | awk '{print $2}')
@@ -258,25 +273,31 @@ bwa_mem_align()
fq1=${fun_fq1[$i]}
fq2=${fun_fq2[$i]}
readgroup=${fun_rgs[$i]}
- bwa_cmd="$release_dir/bin/bwa mem ${fun_bwa_xargs} -K 10000000 -M -R \"${readgroup}\" -t $nt $ref "
+ bwa_cmd="$release_dir/bin/bwa mem ${fun_bwa_xargs} -R \"${readgroup}\" -t $nt \"$ref\" "
if [[ -n "$STREAM_INPUT" ]]; then
bwa_cmd="$bwa_cmd <(gsutil cp $fq1 -) "
if [[ -n "$fq2" ]]; then
bwa_cmd="$bwa_cmd <(gsutil cp $fq2 -) "
fi
else
- local_fq1=$input_dir/$(basename $fq1)
- transfer $fq1 $local_fq1
- bwa_cmd="$bwa_cmd $local_fq1"
+ local_fq1=$input_dir/$(basename "$fq1")
+ transfer "$fq1" "$local_fq1"
+ bwa_cmd="$bwa_cmd \"$local_fq1\""
if [[ -n "$fq2" ]]; then
- local_fq2=$input_dir/$(basename $fq2)
- transfer $fq2 $local_fq2
- bwa_cmd="$bwa_cmd $local_fq2"
+ local_fq2=$input_dir/$(basename "$fq2")
+ transfer "$fq2" "$local_fq2"
+ bwa_cmd="$bwa_cmd \"$local_fq2\""
fi
fi
local_bam=$work/${fun_base}sorted_bam-${i}.${fun_output_ext}
+ bwa_log=$work/${fun_base}_bwa.log
+ bwa_cmd="$bwa_cmd 2>$bwa_log"
+ if [[ "$fun_run_samblaster" == "true" ]]; then
+ bwa_cmd="$bwa_cmd | samblaster --addMateTags -a"
+ fi
bwa_cmd="$bwa_cmd | $release_dir/bin/sentieon util sort ${fun_util_sort_xargs} --block_size 512M -o $local_bam -t $nt --sam2bam -i -"
run "$bwa_cmd" "BWA-mem and sorting"
+ gsutil cp $bwa_log "$out_bam"
fun_bam_dest+=($local_bam)
done
echo "BWA ended"
@@ -286,13 +307,13 @@ bwa_mem_align()
i=$((i - 1))
fq1=${fun_fq1[$i]}
fq2=${fun_fq2[$i]}
- local_fq1=$input_dir/$(basename $fq1)
- local_fq2=$input_dir/$(basename $fq2)
+ local_fq1=$input_dir/$(basename "$fq1")
+ local_fq2=$input_dir/$(basename "$fq2")
if [[ -f "$local_fq1" ]]; then
- rm $fq1 &
+ rm $local_fq1 &
fi
if [[ -f "$local_fq2" ]]; then
- rm $fq2 &
+ rm $local_fq2 &
fi
done
@@ -328,35 +349,37 @@ run_mark_duplicates() {
fun_bams_dest=$1; shift
fun_dedup_xargs=$1; shift
fun_output_ext=$1; shift
+ fun_two_pass=$1; shift
fun_local_bams=("$@")
if [[ "$fun_dedup" == "nodup" || (-z "$fun_bam_str" || -z "${fun_local_bams[@]}") ]]; then
eval "${fun_bam_str_dest}=\"$fun_bam_str\""
- eval "${fun_bams_dest}=(${fun_local_bams[@]})"
+ eval "${fun_bams_dest}=(\"${fun_local_bams[@]}\")"
else
# LocusCollector
- cmd="$release_dir/bin/sentieon driver --traverse_param=200000/10000 $fun_bam_str -t $nt -r $ref --algo LocusCollector $work/${fun_base}score.txt"
+ cmd="$release_dir/bin/sentieon driver --traverse_param=200000/10000 $fun_bam_str -t $nt -r \"$ref\" --algo LocusCollector $work/${fun_base}score.txt"
if [[ -n $(eval "echo \$$fun_metrics_cmd") ]]; then
eval "cmd+=\" \$$fun_metrics_cmd \""
eval "$fun_metrics_cmd=''"
fi
run "$cmd" "Locus collector"
+ # Dedup pre-pass
+ if [[ "$fun_two_pass" == "true" ]]; then
+ cmd="$release_dir/bin/sentieon driver --traverse_param=200000/10000 $fun_bam_str -t $nt -r \"$ref\" --algo Dedup --score_info $work/${fun_base}score.txt --metrics $metrics_dir/${fun_base}dedup_metrics.txt --output_dup_read_name $work/${fun_base}dedup_qname.txt.gz"
+ run "$cmd" "Dedup pre-pass"
+ fun_dedup_xargs="${fun_dedup_xargs} --dup_read_name $work/${fun_base}dedup_qname.txt.gz"
+ else
+ fun_dedup_xargs="${fun_dedup_xargs} --score_info $work/${fun_base}score.txt --metrics $metrics_dir/${fun_base}dedup_metrics.txt"
+ fi
+
# Dedup
dedup_bam=$work/${fun_base}dedup.${fun_output_ext}
if [[ "$fun_dedup" != "markdup" ]]; then
- fun_dedup_xargs+=" --rmdup "
+ fun_dedup_xargs="${fun_dedup_xargs} --rmdup "
fi
- cmd="$release_dir/bin/sentieon driver -r $ref --traverse_param=200000/10000 $fun_bam_str -t $nt --algo Dedup ${fun_dedup_xargs} --score_info $work/${fun_base}score.txt --metrics $metrics_dir/${fun_base}dedup_metrics.txt $dedup_bam"
+ cmd="$release_dir/bin/sentieon driver -r \"$ref\" --traverse_param=200000/10000 $fun_bam_str -t $nt --algo Dedup ${fun_dedup_xargs} $dedup_bam"
run "$cmd" $fun_dedup
- for bam in ${local_bams[@]}; do
- if [[ -n $bam ]]; then
- rm $bam &
- fi
- if [[ -n ${bam}.bai ]]; then
- rm ${bam}.bai &
- fi
- done
eval "${fun_bam_str_dest}=\" -i $dedup_bam \""
eval "${fun_bams_dest}=(${dedup_bam})"
fi
@@ -382,7 +405,7 @@ run_bqsr()
if [[ -n "$bqsr_sites" && -n "$fun_bam_str" ]]; then
fun_bqsr_table=$work/${fun_base}recal_data.table
fun_bqsr_post=$work/${fun_base}recal_data.table.post
- cmd="$release_dir/bin/sentieon driver $interval -t $nt -r '$ref' $fun_bam_str --algo QualCal $bqsr_sites $fun_bqsr_table"
+ cmd="$release_dir/bin/sentieon driver ${interval:+--interval \"$interval\"} -t $nt -r \"$ref\" $fun_bam_str --algo QualCal $bqsr_sites $fun_bqsr_table"
if [[ -n $(eval "echo \$$fun_metrics_cmd") ]]; then
eval "cmd+=\" \$$fun_metrics_cmd \""
eval "$fun_metrics_cmd=''"
@@ -395,11 +418,11 @@ run_bqsr()
fi
fi
- eval "$fun_bqsr2=\"$fun_bqsr_cmd2\""
- eval "$fun_bqsr3=\"$fun_bqsr_cmd3\""
- eval "$fun_bqsr4=\"$fun_bqsr_cmd4\""
- eval "$fun_table=\"$fun_bqsr_table\""
- eval "$fun_plot=\"$plot\""
+ eval "$fun_bqsr2='$fun_bqsr_cmd2'"
+ eval "$fun_bqsr3='$fun_bqsr_cmd3'"
+ eval "$fun_bqsr4='$fun_bqsr_cmd4'"
+ eval "$fun_table='$fun_bqsr_table'"
+ eval "$fun_plot='$plot'"
}
run_bqsr_post()
@@ -417,11 +440,11 @@ run_bqsr_post()
eval "fun_bqsr_cmd4=\$$fun_bqsr4"
if [[ -n "$fun_bqsr_cmd2" && -n "$fun_bam_str" && -f "$fun_table" ]]; then
- cmd="$release_dir/bin/sentieon driver $interval -t $nt -r $ref $fun_bam_str -q $fun_table $fun_bqsr_cmd2"
+ cmd="$release_dir/bin/sentieon driver ${interval:+--interval \"$interval\"} -t $nt -r $ref $fun_bam_str -q $fun_table $fun_bqsr_cmd2"
run "$cmd" "BQSR post"
run "$fun_bqsr_cmd3" "BQSR CSV"
run "$fun_bqsr_cmd4" "BQSR plot"
- gsutil cp $fun_plot $out_metrics &
+ gsutil cp $fun_plot "$out_metrics" &
eval "$fun_upload_pid=$1 "
fi
@@ -435,6 +458,6 @@ generate_nondecoy_bed()
fun_reference_fai=$1; shift
fun_output_dest=$1; shift
- grep -v "hs37d5\|chrEBV\|hs38d1\|decoy" $fun_reference_fai | awk 'BEGIN{OFS="\t"} {print $1,0,$2}' > $fun_output_dest
+ grep -v "hs37d5\|chrEBV\|hs38d1\|decoy" "$fun_reference_fai" | awk 'BEGIN{OFS="\t"} {print $1,0,$2}' > "$fun_output_dest"
}
diff --git a/pipeline_scripts/gc_germline.sh b/pipeline_scripts/gc_germline.sh
index 3998dac..3d06fef 100644
--- a/pipeline_scripts/gc_germline.sh
+++ b/pipeline_scripts/gc_germline.sh
@@ -4,35 +4,34 @@ set -xveo pipefail
set +H
BASEDIR=$(dirname "$0")
-version="201808.01"
-release_dir="/opt/sentieon/sentieon-genomics-${version}/"
scratch=/mnt/work
nt=$(nproc)
source $BASEDIR/gc_functions.sh
-export LD_PRELOAD=/opt/sentieon/sentieon-genomics-${version}/lib/libjemalloc.so
-export MALLOC_CONF=lg_dirty_mult:-1
-
# Set "None" variables to an empty string
environmental_variables=(FQ1 FQ2 BAM OUTPUT_BUCKET REF READGROUP DEDUP \
BQSR_SITES DBSNP INTERVAL INTERVAL_FILE NO_METRICS NO_BAM_OUTPUT \
NO_HAPLOTYPER GVCF_OUTPUT STREAM_INPUT PIPELINE OUTPUT_CRAM_FORMAT \
- SENTIEON_KEY RECALIBRATED_OUTPUT)
+ SENTIEON_KEY RECALIBRATED_OUTPUT EMAIL SENTIEON_VERSION CALLING_ARGS \
+ DNASCOPE_MODEL CALLING_ALGO)
unset_none_variables ${environmental_variables[@]}
OUTPUT_CRAM_FORMAT="" # Not yet supported
readonly FQ1 FQ2 BAM OUTPUT_BUCKET REF READGROUP DEDUP BQSR_SITES DBSNP \
INTERVAL INTERVAL_FILE NO_METRICS NO_BAM_OUTPUT NO_HAPLOTYPER GVCF_OUTPUT \
- STREAM_INPUT PIPELINE OUTPUT_CRAM_FORMAT SENTIEON_KEY RECALIBRATED_OUTPUT
+ STREAM_INPUT PIPELINE OUTPUT_CRAM_FORMAT SENTIEON_KEY RECALIBRATED_OUTPUT \
+ EMAIL SENTIEON_VERSION CALLING_ARGS DNASCOPE_MODEL CALLING_ALGO
-# Some basic error handling #
+release_dir="/opt/sentieon/sentieon-genomics-${SENTIEON_VERSION}/"
+
+# Basic error handling #
if [[ -n "$FQ1" && -n "$BAM" ]]; then
echo "Please supply either fastq or BAM files"
exit 1
fi
if [[ -n "$INTERVAL" && -n "$INTERVAL_FILE" ]]; then
- echo "Please supply either an INTERVAL or INTERVAL_FILE"
+ echo "Please supply either an INTERVAL or INTERVAL_FILE, not both"
exit 1
fi
@@ -45,28 +44,31 @@ fi
# 0. Setup
# **********************************
gc_setup
+export LD_PRELOAD=${release_dir}/lib/libjemalloc.so
+export MALLOC_CONF=lg_dirty_mult:-1
## Download input files
if [[ -n "$BAM" ]]; then
- download_bams $BAM local_bams $input_dir
+ download_bams "$BAM" local_bams $input_dir
else
local_bams=()
fi
download_intervals
download_reference
+if [[ $CALLING_ALGO == "DNAscope" && -n "$DNASCOPE_MODEL" ]]; then
+ curl -L -o ${input_dir}/dnascope.model "$DNASCOPE_MODEL"
+fi
## Handle the sites files
-gs_bqsr_sites=($(echo $BQSR_SITES | tr ',' ' '))
+IFS=',' read -r -a gs_bqsr_sites <<< "$BQSR_SITES"
transfer_all_sites $bqsr_dir "${gs_bqsr_sites[@]}"
-local_bqsr_sites="${local_sites[@]}"
+local_bqsr_sites=("${local_sites[@]}")
bqsr_sites="$local_str"
-REALIGN_SITES=
-
if [[ -n "$DBSNP" ]]; then
- transfer_all_sites $dbsnp_dir $DBSNP
- dbsnp=${local_sites[0]}
+ transfer_all_sites $dbsnp_dir "$DBSNP"
+ dbsnp="${local_sites[0]}"
fi
# ******************************************
@@ -74,15 +76,12 @@ fi
# ******************************************
output_ext="bam"
if [[ -n $FQ1 ]]; then
-# if [[ -n "$NO_BAM_OUTPUT" || "$DEDUP" != "nodup" ]]; then
-# util_sort_xargs="${util_sort_xargs} --bam_compression 1 "
-# fi
- bwa_mem_align "" $FQ1 $FQ2 $READGROUP local_bams $output_ext "$bwa_xargs" "$util_sort_xargs"
+ bwa_mem_align "" "$FQ1" "$FQ2" "$READGROUP" local_bams $output_ext "-M -K 10000000" "$util_sort_xargs" "false"
fi
local_bams_str=""
-for bam in ${local_bams[@]}; do
- local_bams_str+=" -i $bam "
+for bam in "${local_bams[@]}"; do
+ local_bams_str+=" -i \"$bam\" "
done
# ******************************************
@@ -99,43 +98,40 @@ fi
# 3. Remove duplicates
# ******************************************
output_ext="bam"
-#if [[ -n "$NO_BAM_OUTPUT" ]]; then
-# dedup_xargs=" --bam_compression 1 "
-#fi
-run_mark_duplicates "" $DEDUP metrics_cmd1 "$local_bams_str" dedup_bam_str dedup_bams "$dedup_xargs" $output_ext ${local_bams[@]}
+run_mark_duplicates "" "$DEDUP" metrics_cmd1 "$local_bams_str" dedup_bam_str dedup_bams "$dedup_xargs" $output_ext "false" "${local_bams[@]}"
if [[ "$DEDUP" != "nodup" ]]; then
if [[ -z "$NO_METRICS" ]]; then
- (gsutil cp $metrics_dir/dedup_metrics.txt $out_metrics &&
+ (gsutil cp $metrics_dir/dedup_metrics.txt "$out_metrics" &&
rm $metrics_dir/dedup_metrics.txt) &
upload_dedup_pid=$!
else
rm $metrics_dir/dedup_metrics.txt &
fi
- for bam in ${local_bams[@]}; do
- if [[ -f $bam ]]; then
- rm $bam &
+ for bam in "${local_bams[@]}"; do
+ if [[ -f "$bam" ]]; then
+ rm "$bam" &
fi
- if [[ -f ${bam}.bai ]]; then
- rm ${bam}.bai &
+ if [[ -f "${bam}".bai ]]; then
+ rm "${bam}".bai &
fi
- if [[ -f ${bam}.crai ]]; then
- rm ${bam}.crai &
+ if [[ -f "${bam}".crai ]]; then
+ rm "${bam}".crai &
fi
done
fi
if [[ -z "$NO_BAM_OUTPUT" && (-z "$bqsr_sites" || -z "$RECALIBRATED_OUTPUT" ) ]]; then
upload_list=""
- for bam in ${dedup_bams[@]}; do
- upload_list+=" $bam "
+ for bam in "${dedup_bams[@]}"; do
+ upload_list+=" \"$bam\" "
if [[ -f "${bam}.bai" ]]; then
- upload_list+=" ${bam}.bai "
+ upload_list+=" \"${bam}.bai\" "
elif [[ -f "${bam}.crai" ]]; then
- upload_list+=" ${bam}.crai "
+ upload_list+=" \"${bam}.crai\" "
fi
done
- gsutil cp $upload_list $out_bam &
+ eval gsutil cp $upload_list "$out_bam" &
upload_deduped_pid=$!
fi
@@ -150,12 +146,12 @@ if [[ -n "$bqsr_sites" && -z "$NO_BAM_OUTPUT" && -n "$RECALIBRATED_OUTPUT" ]]; t
outrecal=$work/recalibrated.bam
cmd="$release_dir/bin/sentieon driver $dedup_bam_str -q $bqsr_table --algo ReadWriter $outrecal"
(run "$cmd" "ReadWriter";
- gsutil cp $outrecal ${outrecal}.bai $out_bam) &
+ gsutil cp $outrecal ${outrecal}.bai "$out_bam") &
upload_recal_pid=$!
fi
if [[ -n "$bqsr_sites" && -z "$NO_BAM_OUTPUT" && -z "$RECALIBRATED_OUTPUT" ]]; then
- gsutil cp $bqsr_table $out_bam &
+ gsutil cp $bqsr_table "$out_bam" &
upload_bqsr_pid=$!
fi
@@ -167,17 +163,22 @@ upload_metrics metrics_cmd1 metrics_cmd2 upload_metrics_pid ${metrics_files[@]}
# ******************************************
## Generate a non-decoy BED file
-generate_nondecoy_bed ${ref}.fai ${ref}_nondecoy.bed
+generate_nondecoy_bed "${ref}".fai "${ref}"_nondecoy.bed
+call_interval="$interval"
+call_interval=${call_interval:-"${ref}"_nondecoy.bed}
outgvcf=$work/hc.g.vcf.gz
outvcf=$work/hc.vcf.gz
algo=Haplotyper
extra_vcf_args=""
-extra_gvcf_args=""
-if [[ $PIPELINE == "DNAscope" ]]; then
+if [[ $CALLING_ALGO == "DNAscope" ]]; then
algo="DNAscope"
- extra_vcf_args="--var_type snp,indel,bnd"
+ if [[ -n "$DNASCOPE_MODEL" ]]; then
+ extra_vcf_args="--model ${input_dir}/dnascope.model"
+ else
+ extra_vcf_args="--var_type snp,indel,bnd"
+ fi
outvcf=$work/dnascope.vcf.gz
outgvcf=$work/dnascope.g.vcf.gz
tmpvcf=$work/tmp.vcf.gz
@@ -185,13 +186,12 @@ fi
if [[ -z $NO_HAPLOTYPER ]]; then
if [[ -n "$GVCF_OUTPUT" ]]; then
- cmd="$release_dir/bin/sentieon driver ${interval:- --interval ${ref}_nondecoy.bed} -t $nt -r '$ref' $dedup_bam_str ${bqsr_table:+-q $bqsr_table} --algo $algo $extra_gvcf_args ${dbsnp:+-d $dbsnp} --emit_mode gvcf ${outgvcf}"
+ cmd="$release_dir/bin/sentieon driver --interval \"$call_interval\" -t $nt -r \"$ref\" $dedup_bam_str ${bqsr_table:+-q $bqsr_table} --algo $algo $CALLING_ARGS ${dbsnp:+-d \"$dbsnp\"} --emit_mode gvcf ${outgvcf}"
outfile=$outgvcf
else
- cmd="$release_dir/bin/sentieon driver ${interval:- --interval ${ref}_nondecoy.bed} -t $nt -r '$ref' $dedup_bam_str ${bqsr_table:+-q $bqsr_table} --algo $algo $extra_vcf_args ${dbsnp:+-d $dbsnp} ${outvcf}"
+ cmd="$release_dir/bin/sentieon driver --interval \"$call_interval\" -t $nt -r \"$ref\" $dedup_bam_str ${bqsr_table:+-q $bqsr_table} --algo $algo $CALLING_ARGS $extra_vcf_args ${dbsnp:+-d \"$dbsnp\"} ${outvcf}"
outfile=$outvcf
fi
-
if [[ -n $metrics_cmd1 ]]; then
cmd+=" $metrics_cmd1"
metrics_cmd1=
@@ -200,20 +200,30 @@ if [[ -z $NO_HAPLOTYPER ]]; then
cmd+=" $bqsr_cmd2"
bqsr_cmd2=
fi
+ run "$cmd" "Variant calling"
- run "$cmd" "Haplotyper variant calling"
- if [[ $PIPELIINE == "DNAscope" ]]; then
+ # DNAscope SV calling
+ if [[ $CALLING_ALGO == "DNAscope" && -z "$GVCF_OUTPUT" && -z "$DNASCOPE_MODEL" ]]; then
mv $outvcf $tmpvcf
mv ${outvcf}.tbi ${tmpvcf}.tbi
- cmd="$release_dir/bin/sentieon driver -t $nt -r '$ref' --algo SVSolver -v $tmpvcf $outvcf"
+ cmd="$release_dir/bin/sentieon driver -t $nt -r \"$ref\" --algo SVSolver -v $tmpvcf $outvcf"
run "$cmd" "SVSolver"
fi
- gsutil cp $outfile ${outfile}.tbi $out_variants &
+
+ # DNAscope model apply
+ if [[ $CALLING_ALGO == "DNAscope" && -z "$GVCF_OUTPUT" && -n "$DNASCOPE_MODEL" ]]; then
+ mv $outvcf $tmpvcf
+ mv ${outvcf}.tbi ${tmpvcf}.tbi
+ cmd="$release_dir/bin/sentieon driver -t $nt -r \"$ref\" --algo DNAModelApply --model ${input_dir}/dnascope.model -v $tmpvcf $outvcf"
+ run "$cmd" "DNAscope model apply"
+ fi
+
+ gsutil cp $outfile ${outfile}.tbi "$out_variants" &
upload_vcf_pid=$!
fi
if [[ -n $metrics_cmd1 ]]; then
- cmd="$release_dir/bin/sentieon driver $interval -t $nt -r '$ref' $dedup_bam_str ${bqsr_table:+-q $bqsr_table}"
+ cmd="$release_dir/bin/sentieon driver ${interval:+--interval \"$interval\"} -t $nt -r \"$ref\" $dedup_bam_str ${bqsr_table:+-q $bqsr_table}"
cmd+=" $metrics_cmd1"
run "$cmd" "Metrics collection"
fi
@@ -222,7 +232,7 @@ upload_metrics metrics_cmd1 metrics_cmd2 upload_metrics_pid ${metrics_files[@]}
if [[ -n $bqsr_cmd2 ]]; then
- cmd="$release_dir/bin/sentieon driver $interval -t $nt -r '$ref' $dedup_bam_str -q $bqsr_table $bqsr_cmd2"
+ cmd="$release_dir/bin/sentieon driver ${interval:+--interval \"$interval\"} -t $nt -r \"$ref\" $dedup_bam_str -q $bqsr_table $bqsr_cmd2"
bqsr_cmd2=
run "$cmd" "BQSR Post"
fi
@@ -230,12 +240,10 @@ fi
if [[ -n $bqsr_cmd3 ]]; then
run "$bqsr_cmd3" "BQSR CSV"
run "$bqsr_cmd4" "BQSR plot"
- gsutil cp $plot $out_metrics &
+ gsutil cp $plot "$out_metrics" &
upload_bqsr_metrics_pid=$!
fi
-kill $credentials_pid
-
# Wait for all running uploads to finish
set +e
if [[ -n $upload_recal_pid ]]; then
diff --git a/pipeline_scripts/gc_somatic.sh b/pipeline_scripts/gc_somatic.sh
index ea0564b..657b896 100644
--- a/pipeline_scripts/gc_somatic.sh
+++ b/pipeline_scripts/gc_somatic.sh
@@ -4,20 +4,16 @@ set -xveo pipefail
set +H
BASEDIR=$(dirname "$0")
-version="201808.01"
-release_dir="/opt/sentieon/sentieon-genomics-${version}/"
scratch=/mnt/work
nt=$(nproc)
source $BASEDIR/gc_functions.sh
-export LD_PRELOAD=/opt/sentieon/sentieon-genomics-${version}/lib/libjemalloc.so
-export MALLOC_CONF=lg_dirty_mult:-1
-
# Set "None" varibles to an empty string
environmental_variables=(FQ1 FQ2 TUMOR_FQ1 TUMOR_FQ2 BAM TUMOR_BAM \
OUTPUT_BUCKET REF READGROUP TUMOR_READGROUP DEDUP BQSR_SITES DBSNP \
INTERVAL INTERVAL_FILE NO_METRICS NO_BAM_OUTPUT NO_VCF RUN_TNSNV \
- STREAM_INPUT PIPELINE REALIGN_SITES OUTPUT_CRAM_FORMAT SENTIEON_KEY)
+ STREAM_INPUT PIPELINE REALIGN_SITES OUTPUT_CRAM_FORMAT SENTIEON_KEY \
+ EMAIL SENTIEON_VERSION CALLING_ARGS CALLING_ALGO)
unset_none_variables ${environmental_variables[@]}
OUTPUT_CRAM_FORMAT="" # Not yet supported
@@ -60,22 +56,27 @@ fi
readonly FQ1 FQ2 TUMOR_FQ1 TUMOR_FQ2 BAM TUMOR_BAM \
OUTPUT_BUCKET REF READGROUP TUMOR_READGROUP DEDUP BQSR_SITES DBSNP \
INTERVAL INTERVAL_FILE NO_METRICS NO_BAM_OUTPUT NO_VCF RUN_TNSNV \
- STREAM_INPUT PIPELINE REALIGN_SITES OUTPUT_CRAM_FORMAT
+ STREAM_INPUT PIPELINE REALIGN_SITES OUTPUT_CRAM_FORMAT EMAIL \
+ SENTIEON_VERSION CALLING_ARGS CALLING_ALGO
+
+release_dir="/opt/sentieon/sentieon-genomics-${SENTIEON_VERSION}/"
# *****************************
# 0. Setup
# *****************************
gc_setup
+export LD_PRELOAD=${release_dir}/lib/libjemalloc.so
+export MALLOC_CONF=lg_dirty_mult:-1
## Download input files
if [[ -n "$BAM" ]]; then
- download_bams $BAM local_bams $input_dir
+ download_bams "$BAM" local_bams $input_dir
else
local_bams=()
fi
if [[ -n "$TUMOR_BAM" ]]; then
- download_bams $TUMOR_BAM tumor_bams $input_dir
+ download_bams "$TUMOR_BAM" tumor_bams $input_dir
else
tumor_bams=()
fi
@@ -84,45 +85,41 @@ download_intervals
download_reference
## Handle the sites files
-gs_bqsr_sites=($(echo $BQSR_SITES | tr ',' ' '))
+IFS=',' read -r -a gs_bqsr_sites <<< "$BQSR_SITES"
transfer_all_sites $bqsr_dir "${gs_bqsr_sites[@]}"
-local_bqsr_sites="${local_sites[@]}"
+local_bqsr_sites=("${local_sites[@]}")
bqsr_sites="$local_str"
-gs_realign_sites=($(echo $REALIGN_SITES | tr ',' ' '))
+IFS=',' read -r -a gs_realign_sites <<< "$REALIGN_SITES"
transfer_all_sites $realign_dir "${gs_realign_sites[@]}"
-local_realign_sites="${local_sites[@]}"
+local_realign_sites=("${local_sites[@]}")
realign_sites="$local_str"
if [[ -n "$DBSNP" ]]; then
- transfer_all_sites $dbsnp_dir $DBSNP
- dbsnp=${local_sites[0]}
+ transfer_all_sites $dbsnp_dir "$DBSNP"
+ dbsnp="${local_sites[0]}"
fi
# ******************************************
# 1. Mapping reads with BWA-MEM, sorting
# ******************************************
output_ext="bam"
-#if [[ -n "$NO_BAM_OUTPUT" || "$DEDUP" != "nodup" || -n "$REALIGN_SITES" ]]; then
-# util_sort_xargs="${util_sort_xargs} --bam_compression 1 "
-#fi
-
if [[ -n "$FQ1" ]]; then
- bwa_mem_align "normal_" $FQ1 $FQ2 $READGROUP local_bams $output_ext "$bwa_xargs" "$util_sort_xargs"
+ bwa_mem_align "normal_" "$FQ1" "$FQ2" "$READGROUP" local_bams $output_ext "-M -K 10000000" "$util_sort_xargs" "false"
fi
local_bams_str=""
-for bam in ${local_bams[@]}; do
- local_bams_str+=" -i $bam "
+for bam in "${local_bams[@]}"; do
+ local_bams_str+=" -i \"$bam\" "
done
if [[ -n "$TUMOR_FQ1" ]]; then
- bwa_mem_align "tumor_" $TUMOR_FQ1 $TUMOR_FQ2 $TUMOR_READGROUP tumor_bams $output_ext "$bwa_xargs" "$util_sort_xargs"
+ bwa_mem_align "tumor_" "$TUMOR_FQ1" "$TUMOR_FQ2" "$TUMOR_READGROUP" tumor_bams $output_ext "-M -K 10000000" "$util_sort_xargs" "false"
fi
tumor_bams_str=""
-for bam in ${tumor_bams[@]}; do
- tumor_bams_str+=" -i $bam "
+for bam in "${tumor_bams[@]}"; do
+ tumor_bams_str+=" -i \"$bam\" "
done
# Detect the tumor and normal sample names
@@ -152,12 +149,9 @@ fi
# 3. Remove duplicates
# ******************************************
output_ext="bam"
-#if [[ -n "$NO_BAM_OUTPUT" || -n "$REALIGN_SITES" ]]; then
-# dedup_xargs=" --bam_compression 1 "
-#fi
-run_mark_duplicates "normal_" $DEDUP metrics_cmd1 "$local_bams_str" dedup_bam_str dedup_bams "$dedup_xargs" $output_ext ${local_bams[@]}
-run_mark_duplicates "tumor_" $DEDUP tumor_metrics_cmd1 "$tumor_bams_str" tumor_dedup_bam_str tumor_dedup_bams "$dedup_xargs" $output_ext ${tumor_bams[@]}
+run_mark_duplicates "normal_" "$DEDUP" metrics_cmd1 "$local_bams_str" dedup_bam_str dedup_bams "$dedup_xargs" $output_ext "false" "${local_bams[@]}"
+run_mark_duplicates "tumor_" "$DEDUP" tumor_metrics_cmd1 "$tumor_bams_str" tumor_dedup_bam_str tumor_dedup_bams "$dedup_xargs" $output_ext "false" "${tumor_bams[@]}"
if [[ "$DEDUP" != "nodup" ]]; then
if [[ -z "$NO_METRICS" ]]; then
@@ -165,36 +159,36 @@ if [[ "$DEDUP" != "nodup" ]]; then
if [[ -n "$dedup_bam_str" ]]; then
to_upload+=" $metrics_dir/normal_dedup_metrics.txt"
fi
- (gsutil cp $to_upload $out_metrics &&
+ (gsutil cp $to_upload "$out_metrics" &&
rm $metrics_dir/*_dedup_metrics.txt) &
upload_dedup_pid=$!
else
rm $metrics_dir/*_dedup_metrics.txt &
fi
- for bam in ${local_bams[@]} ${tumor_bams[@]}; do
- if [[ -f $bam ]]; then
- rm $bam &
+ for bam in "${local_bams[@]}" "${tumor_bams[@]}"; do
+ if [[ -f "$bam" ]]; then
+ rm "$bam" &
fi
- if [[ -f ${bam}.bai ]]; then
- rm ${bam}.bai &
+ if [[ -f "${bam}".bai ]]; then
+ rm "${bam}".bai &
fi
- if [[ -f ${bam}.crai ]]; then
- rm ${bam}.crai &
+ if [[ -f "${bam}".crai ]]; then
+ rm "${bam}".crai &
fi
done
fi
if [[ -z "$NO_BAM_OUTPUT" && -z "$REALIGN_SITES" ]]; then
upload_list=""
- for bam in ${dedup_bams[@]} ${tumor_dedup_bams[@]}; do
- upload_list+=" $bam "
+ for bam in "${dedup_bams[@]}" "${tumor_dedup_bams[@]}"; do
+ upload_list+=" \"$bam\" "
if [[ -f "${bam}.bai" ]]; then
- upload_list+=" ${bam}.bai "
+ upload_list+=" \"${bam}.bai\" "
elif [[ -f "${bam}.crai" ]]; then
- upload_list+=" ${bam}.crai "
+ upload_list+=" \"${bam}.crai\" "
fi
done
- gsutil cp $upload_list $out_bam &
+ eval gsutil cp $upload_list "$out_bam" &
upload_deduped_pid=$!
fi
@@ -205,30 +199,27 @@ upload_metrics tumor_metrics_cmd1 tumor_metrics_cmd2 tumor_upload_metrics_pid ${
# 4. Indel Realignment
# ******************************************
output_ext="bam"
-#if [[ -n "$NO_BAM_OUTPUT" || -n "$dedup_bam_str" ]]; then
-# realign_xargs=" --bam_compression 1 "
-#fi
if [[ -n "$REALIGN_SITES" && -n "$RUN_TNSNV" ]]; then
realigned_bam=$work/normal_realigned.${output_ext}
tumor_realigned_bam=$work/tumor_realigned.${output_ext}
if [[ -n "$dedup_bam_str" ]]; then
- cmd="$release_dir/bin/sentieon driver $dedup_bam_str -t $nt -r $ref --algo Realigner $realign_xargs $interval_list $realign_sites $realigned_bam"
+ cmd="$release_dir/bin/sentieon driver $dedup_bam_str -t $nt -r \"$ref\" --algo Realigner ${interval_list:+--interval_list \"$interval_list\"} $realign_sites $realigned_bam"
run "$cmd" "Indel Realign - Normal"
fi
- cmd="$release_dir/bin/sentieon driver $tumor_dedup_bam_str -t $nt -r $ref --algo Realigner $realign_xargs $interval_list $realign_sites $tumor_realigned_bam"
+ cmd="$release_dir/bin/sentieon driver $tumor_dedup_bam_str -t $nt -r \"$ref\" --algo Realigner ${interval_list:+--interval_list \"$interval_list\"} $realign_sites $tumor_realigned_bam"
run "$cmd" "Indel Realign - Tumor"
# Cleanup
- for bam in ${dedup_bams[@]} ${tumor_dedup_bams[@]}; do
+ for bam in "${dedup_bams[@]}" "${tumor_dedup_bams[@]}"; do
if [[ -f "$bam" ]]; then
- rm $bam &
+ rm "$bam" &
fi
if [[ -f "${bam}.bai" ]]; then
- rm ${bam}.bai &
+ rm "${bam}".bai &
fi
if [[ -f "${bam}.crai" ]]; then
- rm ${bam}.crai &
+ rm "${bam}".crai &
fi
done
@@ -242,9 +233,9 @@ if [[ -n "$REALIGN_SITES" && -n "$RUN_TNSNV" ]]; then
tumor_realigned_bams=($tumor_realigned_bam)
tumor_realigned_bam_str=" -i $tumor_realigned_bam "
else
- realigned_bams=${dedup_bams[@]}
+ realigned_bams="${dedup_bams[@]}"
realigned_bam_str=${dedup_bam_str}
- tumor_realigned_bams=${tumor_dedup_bams[@]}
+ tumor_realigned_bams="${tumor_dedup_bams[@]}"
tumor_realigned_bam_str=${tumor_dedup_bam_str}
fi
@@ -263,7 +254,7 @@ if [[ -n "$realigned_bam_str" ]]; then
fi
if [[ -n "$bqsr_sites" && -z "$NO_BAM_OUTPUT" ]]; then
- gsutil cp $bqsr_table $tumor_bqsr_table $out_bam &
+ gsutil cp $bqsr_table $tumor_bqsr_table "$out_bam" &
upload_bqsr_pid=$!
fi
@@ -274,30 +265,27 @@ upload_metrics tumor_metrics_cmd1 tumor_metrics_cmd2 tumor_upload_metrics_pid ${
# 6. Indel corealignment
# ******************************************
output_ext="bam"
-#if [[ -n "$NO_BAM_OUTPUT" ]]; then
-# corealign_xargs=" --bam_compression 1 "
-#fi
if [[ -n "$REALIGN_SITES" && -n "$RUN_TNSNV" && -n "$realigned_bam_str" ]]; then
corealigned_bam=$work/corealigned.${output_ext}
- cmd="$release_dir/bin/sentieon driver $realigned_bam_str $tumor_realigned_bam_str -t $nt -r $ref --algo Realigner $corealign_xargs $interval_list $realign_sites $corealigned_bam"
+ cmd="$release_dir/bin/sentieon driver $realigned_bam_str $tumor_realigned_bam_str -t $nt -r \"$ref\" --algo Realigner ${interval_list:+--interval_list \"$interval_list\"} $realign_sites $corealigned_bam"
run "$cmd" "Indel co-realignment"
# Cleanup
- for bam in ${realigned_bams[@]} ${tumor_realigned_bams[@]}; do
- if [[ -f $bam ]]; then
- rm $bam &
+ for bam in "${realigned_bams[@]}" "${tumor_realigned_bams[@]}"; do
+ if [[ -f "$bam" ]]; then
+ rm "$bam" &
fi
- if [[ -f ${bam}.bai ]]; then
- rm ${bam}.bai &
+ if [[ -f "${bam}".bai ]]; then
+ rm "${bam}".bai &
fi
- if [[ -f ${bam}.crai ]]; then
- rm ${bam}.crai &
+ if [[ -f "${bam}".crai ]]; then
+ rm "${bam}".crai &
fi
done
- for site_file in ${local_realign_sites[@]}; do
- if [[ -f $site_file ]]; then
- rm $site_file &
+ for site_file in "${local_realign_sites[@]}"; do
+ if [[ -f "$site_file" ]]; then
+ rm "$site_file" &
fi
done
@@ -308,21 +296,21 @@ if [[ -n "$REALIGN_SITES" && -n "$RUN_TNSNV" && -n "$realigned_bam_str" ]]; then
elif [[ -f "${corealigned_bam}.crai" ]]; then
upload_list+=" ${corealigned_bam}.crai "
fi
- gsutil cp $upload_list $out_bam &
+ gsutil cp $upload_list "$out_bam" &
upload_corealigned_pid=$!
corealigned_bam_str=" -i $corealigned_bam "
elif [[ -n "$REALIGN_SITES" && -n "$RUN_TNSNV" ]]; then
upload_list=""
- for bam in ${tumor_realigned_bams[@]}; do
- upload_list+=" $bam "
+ for bam in "${tumor_realigned_bams[@]}"; do
+ upload_list+=" \"$bam\" "
if [[ -f "${bam}.bai" ]]; then
- upload_list+=" ${bam}.bai "
+ upload_list+=" \"${bam}.bai\" "
elif [[ -f "${bam}.crai" ]]; then
- upload_list+=" ${bam}.crai "
+ upload_list+=" \"${bam}.crai\" "
fi
done
- gsutil cp $upload_list $out_bam &
+ eval gsutil cp $upload_list "$out_bam" &
upload_corealigned_pid=$!
corealigned_bam_str=" $tumor_realigned_bam_str "
@@ -336,22 +324,16 @@ corealigned_bqsr_str=" ${tumor_bqsr_table:+-q $tumor_bqsr_table} ${bqsr_table:+-
# *******************************************
## Generate a non-decoy BED file
-generate_nondecoy_bed ${ref}.fai ${ref}_nondecoy.bed
+generate_nondecoy_bed "${ref}".fai "${ref}"_nondecoy.bed
+call_interval="$interval"
+call_interval=${call_interval:-"${ref}"_nondecoy.bed}
if [[ -z "$NO_VCF" ]]; then
extra_calling_args=""
- if [[ "$PIPELINE" == "TNscope" ]]; then
- algo="TNscope"
- vcf=$work/tnscope.vcf.gz
- elif [[ -n "$RUN_TNSNV" ]]; then
- algo="TNsnv"
- vcf=$work/tnsnv.vcf.gz
- else
- algo="TNhaplotyper"
- vcf=$work/tnhaplotyper.vcf.gz
- fi
+ algo="$CALLING_ALGO"
+ vcf="$work"/"$CALLING_ALGO".vcf.gz
- cmd="$release_dir/bin/sentieon driver ${interval:- --interval ${ref}_nondecoy.bed} $corealigned_bam_str $corealigned_bqsr_str -t $nt -r $ref --algo $algo ${normal_sample:+--normal_sample $normal_sample} --tumor_sample $tumor_sample ${dbsnp:+--dbsnp $dbsnp} $vcf"
+ cmd="$release_dir/bin/sentieon driver --interval \"$call_interval\" $corealigned_bam_str $corealigned_bqsr_str -t $nt -r \"$ref\" --algo $algo $CALLING_ARGS ${normal_sample:+--normal_sample $normal_sample} --tumor_sample $tumor_sample ${dbsnp:+--dbsnp \"$dbsnp\"} $vcf"
if [[ -n $tumor_metrics_cmd1 ]]; then
cmd+=" $metrics_cmd1 "
@@ -368,12 +350,12 @@ if [[ -z "$NO_VCF" ]]; then
fi
run "$cmd" "Variant calling"
- gsutil cp $vcf ${vcf}.tbi $out_variants &
+ gsutil cp $vcf ${vcf}.tbi "$out_variants" &
upload_vcf_pid=$!
fi
if [[ -n $tumor_metrics_cmd1 ]]; then
- cmd="$release_dir/bin/sentieon driver ${interval:- --interval ${ref}_nondecoy.bed} -t $nt -r '$ref' $corealigned_bam_str $corealigned_bqsr_str"
+ cmd="$release_dir/bin/sentieon driver --interval \"$call_interval\" -t $nt -r \"$ref\" $corealigned_bam_str $corealigned_bqsr_str"
cmd+=" $metrics_cmd1 "
cmd+=" $tumor_metrics_cmd1 "
metrics_cmd1=
@@ -385,7 +367,7 @@ upload_metrics metrics_cmd1 metrics_cmd2 upload_metrics_pid ${metrics_files[@]}
upload_metrics tumor_metrics_cmd1 tumor_metrics_cmd2 tumor_upload_metrics_pid ${tumor_metrics_files[@]}
if [[ -n $bqsr_cmd2 ]]; then
- cmd="$release_dir/bin/sentieon driver ${interval:- --interval ${ref}_nondecoy.bed} -t $nt -r $ref $corealigned_bam_str $corealigned_bqsr_str $bqsr_cmd2 $tumor_bqsr_cmd2"
+ cmd="$release_dir/bin/sentieon driver --interval \"$call_interval\" -t $nt -r \"$ref\" $corealigned_bam_str $corealigned_bqsr_str $bqsr_cmd2 $tumor_bqsr_cmd2"
bqsr_cmd2=
tumor_bqsr_cmd2=
run "$cmd" "BQSR Post"
@@ -400,12 +382,10 @@ if [[ -n $tumor_bqsr_cmd3 ]]; then
fi
run "$tumor_bqsr_cmd3" "Tumor BQSR CSV"
run "$tumor_bqsr_cmd4" "Tumor BQSR plot"
- gsutil cp $upload_list $tumor_bqsr_plot $out_metrics &
+ gsutil cp $upload_list $tumor_bqsr_plot "$out_metrics" &
upload_bqsr_metrics_pid=$!
fi
-kill $credentials_pid
-
# Wait for all running uploads to finish
set +e
if [[ -n $upload_normal_bqsr_plot_pid ]]; then
@@ -438,4 +418,4 @@ fi
if [[ -n $upload_bqsr_metrics_pid ]]; then
wait $upload_bqsr_metrics_pid
fi
-
+exit 0
diff --git a/pipeline_scripts/gen_credentials.py b/pipeline_scripts/gen_credentials.py
index 0351cf2..2e753d2 100644
--- a/pipeline_scripts/gen_credentials.py
+++ b/pipeline_scripts/gen_credentials.py
@@ -6,23 +6,40 @@
import time
import argparse
import json
+import os
audience = "https://sentieon.com"
headers = {'Metadata-Flavor': 'Google'}
request_format = "full"
-metadata_url = "http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/identity?audience={}&format={}"
-project_url = "http://metadata.google.internal/computeMetadata/v1/project/project-id?format={}"
+metadata_url = ("http://metadata.google.internal/computeMetadata/v1/instance/"
+ "service-accounts/default/identity?audience={}&format={}")
+project_url = ("http://metadata.google.internal/computeMetadata/v1/project/"
+ "project-id?format={}")
+
def process_args():
- parser = argparse.ArgumentParser(description="Write fresh instance metadata credentials to a file for license authentication")
- parser.add_argument("auth_data_file", help="A file to hold the instance metadata JWT")
+ parser = argparse.ArgumentParser(description="Write fresh instance "
+ "metadata credentials to a file for "
+ "license authentication")
+ parser.add_argument("auth_data_file", help="A file to hold the instance "
+ "metadata JWT")
parser.add_argument("sentieon_key", help="A license key string")
+ parser.add_argument("--email", help="An email associated with the license")
return parser.parse_args()
-def main(args):
- if not args:
- args = process_args()
+def send_to_background():
+ pid = os.fork()
+ if pid == 0:
+ os.setsid() # Guarantee no controlling terminal
+ pid = os.fork()
+ if pid != 0:
+ os._exit(0)
+ else:
+ os._exit(0)
+
+
+def main(args):
url = project_url.format(request_format)
response = requests.get(url, headers=headers)
project_id = response.text
@@ -50,10 +67,15 @@ def main(args):
out["google_session_token"] = response.text
if args.sentieon_key:
out["license_key"] = args.sentieon_key
+ if args.email:
+ out["email"] = args.email
with open(args.auth_data_file, 'w') as f:
json.dump(out, f)
- time.sleep(55 * 60) # sleep for 55 minutes before refreshing the token or until killed
+ # sleep for 55 minutes before refreshing the token or until killed
+ time.sleep(55 * 60)
-if __name__ == "__main__":
- main(None)
+if __name__ == "__main__":
+ args = process_args()
+ send_to_background()
+ main(args)
diff --git a/runner/ccdg.yaml b/runner/ccdg.yaml
new file mode 100644
index 0000000..4cc8d92
--- /dev/null
+++ b/runner/ccdg.yaml
@@ -0,0 +1,72 @@
+name: Sentieon_ccdg
+description: Run a Sentieon CCDG pipeline on the Google Cloud
+
+inputParameters:
+# Required parameters
+# FQ1 or BAM is required
+- name: FQ1
+ defaultValue: None
+ description: Fastq for a single sample (comma-separated)
+- name: FQ2
+ defaultValue: None
+ description: Fastq pairs for a single sample (comma-separated)
+- name: BAM
+ defaultValue: None
+ description: Bam files for a single sample (comma-separated)
+- name: OUTPUT_BUCKET
+ description: The output Google Cloud Storage directory
+- name: REF
+ description: The refence genome (and assoicated indicies)
+
+# Optional parameters
+- name: EMAIL
+ description: An email to use to obtain an evaluation license
+ defaultValue: None
+- name: SENTIEON_VERSION
+ description: Version of the Sentieon software to use
+ defaultValue: 201808.07
+- name: READGROUP
+ description: Readgroup information to add during alignment
+ defaultValue: "@RG\\tID:read-group\\tSM:sample-name\\tPL:ILLUMINA"
+- name: BQSR_SITES
+ description: Known sites for BQSR (comma-separated)
+ defaultValue: None
+- name: DBSNP
+ description: A dbSNP file to use during variant calling
+ defaultValue: None
+- name: INTERVAL
+ description: A string of interval(s) to use during variant calling
+ defaultValue: None
+- name: INTERVAL_FILE
+ description: An interval file
+ defaultValue: None
+- name: NO_METRICS
+ description: Set to not run metrics collection
+ defaultValue: None
+- name: NO_BAM_OUTPUT
+ description: Set to not output a preprocessed BAM file
+ defaultValue: None
+- name: NO_HAPLOTYPER
+ description: Set to not output a VCF
+ defaultValue: None
+- name: GVCF_OUTPUT
+ description: Set to output a GVCF instead of a VCF
+ defaultValue: None
+- name: STREAM_INPUT
+ description: Stream fastq input directly from storage
+ defaultValue: None
+- name: PIPELINE
+ description: Run the CCDG preprocessing pipeline
+ defaultValue: CCDG
+- name: SENTIEON_KEY
+ description: A Sentieon License Key
+ defaultValue: None
+- name: DNASCOPE_MODEL
+ description: A trained model to use with DNAscope
+ defaultValue: https://s3.amazonaws.com/sentieon-release/other/SentieonDNAscopeModelBeta0.4a-201808.05.model
+- name: CALLING_ARGS
+ description: Additional parameters to set during variant calling
+ defaultValue: None
+- name: CALLING_ALGO
+ description: The variant calling algorithm to use
+ defaultValue: Haplotyper
diff --git a/runner/sentieon_germline.yaml b/runner/germline.yaml
similarity index 73%
rename from runner/sentieon_germline.yaml
rename to runner/germline.yaml
index 0bd1fae..fda8d3e 100644
--- a/runner/sentieon_germline.yaml
+++ b/runner/germline.yaml
@@ -19,6 +19,12 @@ inputParameters:
description: The refence genome (and assoicated indicies)
# Optional parameters
+- name: EMAIL
+ description: An email to use to obtain an evaluation license
+ defaultValue: None
+- name: SENTIEON_VERSION
+ description: Version of the Sentieon software to use
+ defaultValue: 201808.07
- name: READGROUP
description: Readgroup information to add during alignment
defaultValue: "@RG\\tID:read-group\\tSM:sample-name\\tPL:ILLUMINA"
@@ -53,11 +59,20 @@ inputParameters:
description: Stream fastq input directly from storage
defaultValue: None
- name: PIPELINE
- description: Run DNAseq or DNAscope
- defaultValue: DNA
+ description: Run germline variant calling
+ defaultValue: GERMLINE
- name: SENTIEON_KEY
description: A Sentieon License Key
defaultValue: None
- name: RECALIBRATED_OUTPUT
description: Set to apply BQSR to the output BAM file
defaultValue: None
+- name: DNASCOPE_MODEL
+ description: A trained model to use with DNAscope
+ defaultValue: https://s3.amazonaws.com/sentieon-release/other/SentieonDNAscopeModelBeta0.4a-201808.05.model
+- name: CALLING_ARGS
+ description: Additional parameters to set during variant calling
+ defaultValue: None
+- name: CALLING_ALGO
+ description: The variant calling algorithm to use
+ defaultValue: Haplotyper
diff --git a/runner/runner_default.json b/runner/runner_default.json
index b26cc52..89697cf 100644
--- a/runner/runner_default.json
+++ b/runner/runner_default.json
@@ -1,36 +1,41 @@
{
- "PREEMPTIBLE_TRIES": 0,
- "DEDUP": "rmdup",
- "NO_HAPLOTYPER": null,
- "TUMOR_FQ1": null,
- "TUMOR_FQ2": null,
- "BAM": null,
- "REF": null,
- "INTERVAL_FILE": null,
- "REALIGN_SITES": null,
- "PROJECT_ID": null,
- "DISK_SIZE": 300,
- "INTERVAL": null,
- "NO_METRICS": null,
- "TUMOR_READGROUP": "@RG\\tID:tumor-read-group\\tSM:tumor-sample-name\\tPL:ILLUMINA",
- "DBSNP": null,
- "MIN_RAM_GB": 56,
- "NO_BAM_OUTPUT": null,
- "RUN_TNSNV": null,
- "OUTPUT_BUCKET": null,
- "STREAM_INPUT": null,
- "READGROUP": "@RG\\tID:read-group\\tSM:sample-name\\tPL:ILLUMINA",
- "BQSR_SITES": null,
- "PIPELINE": "DNA",
- "MIN_CPU": 64,
- "NONPREEMPTIBLE_TRY": true,
- "ZONES": null,
- "TUMOR_BAM": null,
- "NO_VCF": null,
- "GVCF_OUTPUT": null,
- "FQ2": null,
- "FQ1": null,
- "DOCKER_IMAGE": "sentieon/sentieon-google-cloud:201808.01",
+ "FQ1": null,
+ "FQ2": null,
+ "BAM": null,
+ "OUTPUT_BUCKET": null,
+ "REF": null,
+ "EMAIL": null,
+ "SENTIEON_VERSION": "201808.07",
+ "READGROUP": "@RG\\tID:read-group\\tSM:sample-name\\tPL:ILLUMINA",
+ "DEDUP": "rmdup",
+ "BQSR_SITES": null,
+ "DBSNP": null,
+ "INTERVAL": null,
+ "INTERVAL_FILE": null,
+ "NO_METRICS": null,
+ "NO_BAM_OUTPUT": null,
+ "NO_HAPLOTYPER": null,
+ "GVCF_OUTPUT": null,
+ "STREAM_INPUT": null,
+ "PIPELINE": "GERMLINE",
"SENTIEON_KEY": null,
- "RECALIBRATED_OUTPUT": null
+ "RECALIBRATED_OUTPUT": null,
+ "TUMOR_FQ1": null,
+ "TUMOR_FQ2": null,
+ "TUMOR_BAM": null,
+ "TUMOR_READGROUP": "@RG\\tID:tumor-read-group\\tSM:tumor-sample-name\\tPL:ILLUMINA",
+ "NO_VCF": null,
+ "RUN_TNSNV": null,
+ "REALIGN_SITES": null,
+ "ZONES": null,
+ "DISK_SIZE": 300,
+ "MACHINE_TYPE": "n1-highcpu-64",
+ "CPU_PLATFORM": "Intel Broadwell",
+ "PROJECT_ID": null,
+ "DOCKER_IMAGE": "sentieon/sentieon-google-cloud:0.2.2",
+ "CALLING_ARGS": null,
+ "CALLING_ALGO": "Haplotyper",
+ "DNASCOPE_MODEL": "https://s3.amazonaws.com/sentieon-release/other/SentieonDNAscopeModelBeta0.4a-201808.05.model",
+ "PREEMPTIBLE_TRIES": 0,
+ "NONPREEMPTIBLE_TRY": true
}
diff --git a/runner/sentieon_runner.py b/runner/sentieon_runner.py
index ba2b232..21d4817 100644
--- a/runner/sentieon_runner.py
+++ b/runner/sentieon_runner.py
@@ -11,98 +11,160 @@
import argparse
import os
import sys
-from pprint import pprint
import copy
import time
import ssl
+import warnings
+import logging
+import random
+import google.auth
+import googleapiclient.errors
from apiclient.discovery import build
-import google.auth
+from pprint import pformat
+from googleapiclient.errors import HttpError
script_dir = os.path.dirname(os.path.realpath(__file__))
-germline_yaml = script_dir + "/sentieon_germline.yaml"
-somatic_yaml = script_dir + "/sentieon_somatic.yaml"
+germline_yaml = script_dir + "/germline.yaml"
+somatic_yaml = script_dir + "/somatic.yaml"
+ccdg_yaml = script_dir + "/ccdg.yaml"
default_json = script_dir + "/runner_default.json"
-target_url_base = "https://www.googleapis.com/compute/v1/projects/{project}/zones/{zone}/instances/{instance}"
+target_url_base = ("https://www.googleapis.com/compute/v1/projects/{project}/"
+ "zones/{zone}/instances/{instance}")
def cloud_storage_exists(client, gs_path):
-
try:
bucket, blob = gs_path[5:].split('/', 1)
bucket = client.bucket(bucket)
blob = bucket.blob(blob)
res = blob.exists()
- except: # Catch all exceptions
- raise ValueError("Error: Could not find {gs_path} in Google Cloud Storage".format(**locals()))
+ except: # Catch all exceptions
+ raise ValueError("Error: Could not find {gs_path} in Google Cloud "
+ "Storage".format(**locals()))
return res
def check_inputs_exist(job_vars, credentials):
from google.cloud import storage
- client = storage.Client(credentials=credentials)
+ with warnings.catch_warnings():
+ warnings.filterwarnings("ignore", "Your application has authenticated "
+ "using end user credentials from Google Cloud "
+ "SDK")
+ client = storage.Client(credentials=credentials)
# The DBSNP, BQSR and Realign sites files
sites_files = []
- sites_files += job_vars["BQSR_SITES"].split(',') if job_vars["BQSR_SITES"] else []
- sites_files += job_vars["REALIGN_SITES"].split(',') if job_vars["REALIGN_SITES"] else []
+ sites_files += (job_vars["BQSR_SITES"].split(',') if
+ job_vars["BQSR_SITES"] else [])
+ sites_files += (job_vars["REALIGN_SITES"].split(',') if
+ job_vars["REALIGN_SITES"] else [])
sites_files += [job_vars["DBSNP"]] if job_vars["DBSNP"] else []
for sites_file in sites_files:
if not cloud_storage_exists(client, sites_file):
- sys.exit("Error: Could not find supplied file {}".format(sites_file))
+ logging.error("Could not find supplied file "
+ "{}".format(sites_file))
+ sys.exit(-1)
if sites_file.endswith("vcf.gz"):
if not cloud_storage_exists(client, sites_file + ".tbi"):
- sys.exit("Error: Could not find index for file {}".format(sites_file))
+ logging.error("Could not find index for file "
+ "{}".format(sites_file))
+ sys.exit(-1)
else:
if not cloud_storage_exists(client, sites_file + ".idx"):
- sys.exit("Error: Could not find index for file {}".format(sites_file))
+ logging.error("Could not find index for file "
+ "{}".format(sites_file))
+ sys.exit(-1)
# The data input files
- gs_split_files = (job_vars["FQ1"], job_vars["TUMOR_FQ1"], job_vars["FQ2"], job_vars["TUMOR_FQ2"], job_vars["BAM"], job_vars["TUMOR_BAM"])
+ gs_split_files = (
+ job_vars["FQ1"],
+ job_vars["TUMOR_FQ1"],
+ job_vars["FQ2"],
+ job_vars["TUMOR_FQ2"],
+ job_vars["BAM"],
+ job_vars["TUMOR_BAM"])
gs_files = ()
for split_file in gs_split_files:
if not split_file:
continue
for input_file in split_file.split(','):
if not cloud_storage_exists(client, input_file):
- sys.exit("Error: Could not find the supplied file {}".format(input_file))
+ logging.error("Could not find the supplied file "
+ "{}".format(input_file))
+ sys.exit(-1)
for input_file in gs_files:
if not cloud_storage_exists(client, input_file):
- sys.exit("Error: Could not file the supplied file {}".format(input_file))
+ logging.error("Could not file the supplied file "
+ "{}".format(input_file))
+ sys.exit(-1)
# All reference files
ref = job_vars["REF"]
ref_base = ref[:-3] if ref.endswith(".fa") else ref[:-6]
if not cloud_storage_exists(client, ref):
- sys.exit("Error: Reference file not found")
+ logging.error("Reference file not found")
+ sys.exit(-1)
if not cloud_storage_exists(client, ref + ".fai"):
- sys.exit("Error: Reference fai index not found")
+ logging.error("Reference fai index not found")
+ sys.exit(-1)
if (not cloud_storage_exists(client, ref + ".dict") and
- not cloud_storage_exists(client, ref_base + ".dict")):
- sys.exit("Error: Reference dict index not found")
+ not cloud_storage_exists(client, ref_base + ".dict")):
+ logging.error("Reference dict index not found")
+ sys.exit(-1)
# FQ specific
if job_vars["FQ1"] or job_vars["TUMOR_FQ1"]:
for suffix in [".amb", ".ann", ".bwt", ".pac", ".sa"]:
- if not cloud_storage_exists(client, ref + suffix):
- sys.exit("Error: Reference BWA index {} not found".format(suffix))
+ if (not cloud_storage_exists(client, ref + suffix) and
+ not cloud_storage_exists(client, ref + ".64" + suffix)):
+ logging.error("Reference BWA index {} not "
+ "found".format(suffix))
+ sys.exit(-1)
# BAM specific
bam_vars = ("BAM", "TUMOR_BAM")
for bam_type in bam_vars:
if job_vars[bam_type]:
for bam in job_vars[bam_type].split(','):
if (not cloud_storage_exists(client, bam + ".bai") and
- not cloud_storage_exists(client, bam + "bai")):
- sys.exit("Error: BAM supplied but BAI not found")
-
+ not cloud_storage_exists(client, bam + "bai")):
+ logging.error("BAM supplied but BAI not found")
+ sys.exit(-1)
+
def main(vargs=None):
parser = argparse.ArgumentParser()
parser.add_argument("pipeline_config", help="The json configuration file")
- parser.add_argument("--no_check_inputs_exist", action="store_true", help="Do not check that the input files exist before running the pipeline")
- parser.add_argument("--polling_interval", type=float, default=30, help="Seconds between polling the running operation")
+ parser.add_argument(
+ "--verbose",
+ "-v",
+ action="count",
+ help="Increase the runner verbosity")
+ parser.add_argument(
+ "--no_check_inputs_exist",
+ action="store_true",
+ help="Do not check that the input files exist before running the "
+ "pipeline")
+ parser.add_argument(
+ "--polling_interval",
+ type=float,
+ default=30,
+ help="Seconds between polling the running operation")
args = parser.parse_args()
polling_interval = args.polling_interval
+ logging.getLogger("googleapiclient."
+ "discovery_cache").setLevel(logging.ERROR)
+ log_format = "%(filename)s::%(funcName)s [%(levelname)s] %(message)s"
+ if not hasattr(args, "verbose") or args.verbose is None:
+ log_level = logging.WARNING
+ elif args.verbose == 1:
+ log_level = logging.INFO
+ elif args.verbose >= 2:
+ log_level = logging.DEBUG
+ else:
+ log_level = logging.WARNING
+ logging.basicConfig(level=log_level, format=log_format)
+
# Grab input arguments from the json file
job_vars = json.load(open(default_json))
job_vars.update(json.load(open(args.pipeline_config)))
@@ -110,14 +172,35 @@ def main(vargs=None):
if job_vars["NONPREEMPTIBLE_TRY"]:
non_preemptible_tries = 1
preemptible = True if preemptible_tries > 0 else False
- credentials, project_id = google.auth.default()
+ with warnings.catch_warnings():
+ warnings.filterwarnings("ignore", "Your application has authenticated "
+ "using end user credentials from Google Cloud "
+ "SDK")
+ credentials, project_id = google.auth.default()
+
+ # Warn with depreciated JSON keys
+ if "MIN_RAM_GB" in job_vars or "MIN_CPU" in job_vars:
+ logging.warning("'MIN_RAM_GB' and 'MIN_CPU' are now ignored. "
+ "Please use 'MACHINE_TYPE' to specify the instance "
+ "type")
# Grab the yaml for the workflow
- pipeline_yaml = germline_yaml if job_vars["PIPELINE"] == "DNA" or job_vars["PIPELINE"] == "DNAscope" or job_vars["PIPELINE"] == "DNAseq" else somatic_yaml
+ pipeline = job_vars["PIPELINE"]
+ if pipeline == "GERMLINE":
+ pipeline_yaml = germline_yaml
+ elif pipeline == "SOMATIC":
+ pipeline_yaml = somatic_yaml
+ elif pipeline == "CCDG":
+ pipeline_yaml = ccdg_yaml
+ else:
+ logging.error("Pipeline '" + pipeline + "'. Valid "
+ "values are 'GERMLINE' and 'SOMATIC'")
+ sys.exit(-1)
try:
- pipeline_dict = yaml.load(open(pipeline_yaml))
+ pipeline_dict = yaml.safe_load(open(pipeline_yaml))
except IOError:
- sys.exit("Error. No yaml \"{}\" found.".format(pipeline_yaml))
+ logging.error("No yaml \"{}\" found.".format(pipeline_yaml))
+ sys.exit(-1)
# Try not to create nearly empty directories
while job_vars["OUTPUT_BUCKET"].endswith('/'):
@@ -125,172 +208,295 @@ def main(vargs=None):
# Some basic error checking to fail early
if not job_vars["PROJECT_ID"]:
- sys.exit("Error: Please supply a PROJECT_ID")
+ logging.error("Please supply a PROJECT_ID")
+ sys.exit(-1)
# Shared errors
if job_vars["FQ1"] and job_vars["BAM"]:
- sys.exit("Error: Please supply either 'FQ1' or 'BAM' (not both)")
+ logging.error("Please supply either 'FQ1' or 'BAM' (not both)")
+ sys.exit(-1)
if job_vars["INTERVAL"] and job_vars["INTERVAL_FILE"]:
- sys.exit("Error: Please supply either 'INTERVAL' or 'INTERVAL_FILE'")
+ logging.error("Please supply either 'INTERVAL' or 'INTERVAL_FILE'")
+ sys.exit(-1)
if ((job_vars["FQ1"] and job_vars["READGROUP"]) and
- len(job_vars["FQ1"].split(',')) != len(job_vars["READGROUP"].split(','))):
- sys.exit("Error: The number of fastq files must match the number of supplied readgroups")
+ (len(job_vars["FQ1"].split(',')) !=
+ len(job_vars["READGROUP"].split(',')))):
+ logging.error("The number of fastq files must match the number of "
+ "supplied readgroups")
+ sys.exit(-1)
# Pipeline specific errors
- if job_vars["PIPELINE"] == "DNA" or job_vars["PIPELINE"] == "DNAscope" or job_vars["PIPELINE"] == "DNAseq":
+ if pipeline == "GERMLINE" or pipeline == "CCDG":
if not job_vars["FQ1"] and not job_vars["BAM"]:
- sys.exit("Error: Please supply either 'FQ1' or 'BAM'")
- if job_vars["NO_HAPLOTYPER"] and job_vars["NO_METRICS"] and job_vars["NO_BAM_OUTPUT"]:
- sys.exit("Error: No output files requested")
+ logging.error("Please supply either 'FQ1' or 'BAM'")
+ sys.exit(-1)
+ if (job_vars["NO_HAPLOTYPER"] and
+ job_vars["NO_METRICS"] and
+ job_vars["NO_BAM_OUTPUT"]):
+ logging.error("No output files requested")
+ sys.exit(-1)
if job_vars["RECALIBRATED_OUTPUT"] and job_vars["BQSR_SITES"] is None:
- sys.exit("Error: Cannot output a recalibrated BAM file without running BQSR. Please supply 'BQSR_SITES'")
- elif job_vars["PIPELINE"] == "TN" or job_vars["PIPELINE"] == "TNscope" or job_vars["PIPELINE"] == "TNseq":
+ logging.error("Cannot output a recalibrated BAM file without "
+ "running BQSR. Please supply 'BQSR_SITES'")
+ sys.exit(-1)
+ valid_algos = ("Haplotyper", "DNAscope")
+ if job_vars["CALLING_ALGO"] not in valid_algos:
+ logging.error(job_vars["CALLING_ALGO"] + "' is not a "
+ "valid germline variant calling algo. Please set "
+ "'CALLING_ALGO' to one of " + str(valid_algos))
+ sys.exit(-1)
+ # Additional CCDG checks
+ if pipeline == "CCDG":
+ if job_vars["BQSR_SITES"] is None:
+ logging.error("The CCDG pipeline requires known sites for "
+ "BQSR. Please supply 'BQSR_SITES'")
+ sys.exit(-1)
+ elif pipeline == "SOMATIC":
if job_vars["TUMOR_FQ1"] and job_vars["TUMOR_BAM"]:
- sys.exit("Error: Please supply either 'TUMOR_FQ1' or 'TUMOR_BAM' (not both)")
- if not job_vars["TUMOR_FQ1"] and not job_vars["TUMOR_BAM"]:
- sys.exit("Error: Please supply either 'TUMOR_FQ1' or 'TUMOR_BAM'")
- if job_vars["RUN_TNSNV"] and not job_vars["REALIGN_SITES"]:
- sys.exit("Error: TNsnv requires indel realignment. Please supply 'REALIGN_SITES'")
- if job_vars["NO_BAM_OUTPUT"] and job_vars["NO_VCF"] and job_vars["NO_METRICS"]:
- sys.exit("Error: No output files requested")
+ logging.error("Please supply either 'TUMOR_FQ1' or 'TUMOR_BAM' "
+ "(not both)")
+ sys.exit(-1)
+ if (not job_vars["TUMOR_FQ1"] and
+ not job_vars["TUMOR_BAM"]):
+ logging.error("Please supply either 'TUMOR_FQ1' or 'TUMOR_BAM'")
+ sys.exit(-1)
+ if (job_vars["RUN_TNSNV"] and
+ not job_vars["REALIGN_SITES"]):
+ logging.error("TNsnv requires indel realignment. Please supply "
+ "'REALIGN_SITES'")
+ sys.exit(-1)
+ if (job_vars["NO_BAM_OUTPUT"] and
+ job_vars["NO_VCF"] and job_vars["NO_METRICS"]):
+ logging.error("No output files requested")
+ sys.exit(-1)
if ((job_vars["TUMOR_FQ1"] and job_vars["TUMOR_READGROUP"]) and
- len(job_vars["TUMOR_FQ1"].split(',')) != len(job_vars["TUMOR_READGROUP"].split(','))):
- sys.exit("Error: The number of tumor fastq files must match the number of supplied readgroups")
-
- else:
- sys.exit("Error: DNAseq, DNAscope, TNseq, and TNscope are currently supported")
+ (len(job_vars["TUMOR_FQ1"].split(',')) !=
+ len(job_vars["TUMOR_READGROUP"].split(',')))):
+ logging.error("The number of tumor fastq files must match the "
+ "number of supplied readgroups")
+ sys.exit(-1)
+ valid_algos = ("TNhaplotyper", "TNhaplotyper2", "TNscope", "TNsnv")
+ if job_vars["CALLING_ALGO"] not in valid_algos:
+ logging.error(job_vars["CALLING_ALGO"] + "' is not a "
+ "valid somatic variant calling algo. Please set "
+ "'CALLING_ALGO' to one of " + str(valid_algos))
+ sys.exit(-1)
if not args.no_check_inputs_exist:
check_inputs_exist(job_vars, credentials)
- # Construct the pipeline arguments
- args_dict = {}
- args_dict["projectId"] = job_vars["PROJECT_ID"]
- args_dict["logging"] = {"gcsPath": job_vars["OUTPUT_BUCKET"] + "/worker_logs/"}
- resources_dict = {}
- resources_dict["minimumRamGb"] = job_vars["MIN_RAM_GB"]
- resources_dict["minimumCpuCores"] = job_vars["MIN_CPU"]
- resources_dict["zones"] = job_vars["ZONES"].split(',') if job_vars["ZONES"] else []
- args_dict["resources"] = copy.copy(resources_dict)
-
- # Translate None back to "None"
- input_dict = {}
+ # Resources dict
+ zones = job_vars["ZONES"].split(',') if job_vars["ZONES"] else []
+ if not zones:
+ logging.error("Please supply at least one zone to run the pipeline")
+ region = zones[0][:-2]
+ disk = {
+ "name": "local-disk",
+ "type": "local-ssd",
+ "sizeGb": int(job_vars["DISK_SIZE"])
+ }
+ vm_dict = {
+ "machineType": job_vars["MACHINE_TYPE"],
+ "preemptible": preemptible,
+ "disks": [disk],
+ "serviceAccount": {"scopes": [
+ "https://www.googleapis.com/auth/cloud-platform"]},
+ "cpuPlatform": job_vars["CPU_PLATFORM"]
+ }
+ resources_dict = {
+ "zones": zones,
+ "virtualMachine": vm_dict
+ }
+
+ # Environment
+ env_dict = {}
for input_var in pipeline_dict["inputParameters"]:
- input_dict[input_var["name"]] = job_vars[input_var["name"]]
- if input_dict[input_var["name"]] is None:
- input_dict[input_var["name"]] = "None"
- args_dict["inputs"] = input_dict
-
- # Construct the pipeline object
- resources_dict = {}
- resources_dict["preemptible"] = preemptible
-
- # Ideally we could use mdadm to combine local SSD, but that is not possible inside the container
- disk = {}
- disk["name"] = "local-disk"
- disk["mountPoint"] = "/mnt/work"
- if int(job_vars["DISK_SIZE"]) <= 375:
- print("Disk is less than 375 GB, using a single local SSD")
- disk["type"] = "LOCAL_SSD"
+ env_dict[input_var["name"]] = job_vars[input_var["name"]]
+ if env_dict[input_var["name"]] is None:
+ env_dict[input_var["name"]] = "None"
+
+ # Action
+ if pipeline == "GERMLINE":
+ _cmd = "/opt/sentieon/gc_germline.sh"
+ elif pipeline == "SOMATIC":
+ _cmd = "/opt/sentieon/gc_somatic.sh"
+ elif pipeline == "CCDG":
+ _cmd = "/opt/sentieon/gc_ccdg_germline.sh"
else:
- print("Disk is greater than 375 GB, using a persistant SSD")
- disk["type"] = "PERSISTENT_SSD"
- disk["sizeGb"] = int(job_vars["DISK_SIZE"])
-
- disks = [disk]
- resources_dict["disks"] = disks
- pipeline_dict["resources"] = resources_dict
- pipeline_dict["projectId"] = job_vars["PROJECT_ID"]
- pipeline_dict["docker"] = {
- "imageName": job_vars["DOCKER_IMAGE"],
- "cmd": "bash /opt/sentieon/gc_germline.sh" if job_vars["PIPELINE"] == "DNA" or job_vars["PIPELINE"] == "DNAscope" or job_vars["PIPELINE"] == "DNAseq" else "bash /opt/sentieon/gc_somatic.sh"
+ logging.error("Error: Unknown pipeline " + pipeline)
+ sys.exit(-1)
+
+ run_action = {
+ "containerName": "run-pipeline",
+ "imageUri": job_vars["DOCKER_IMAGE"],
+ "commands": ["/bin/bash", _cmd],
+ "mounts": [{
+ "disk": "local-disk",
+ "path": "/mnt/work",
+ "readOnly": False
+ }],
+ }
+
+ cleanup_action = {
+ "containerName": "cleanup",
+ "imageUri": job_vars["DOCKER_IMAGE"],
+ "commands": [
+ "/bin/bash",
+ "-c",
+ ("gsutil cp /google/logs/action/1/stderr "
+ "\"{}/worker_logs/stderr.txt\" && "
+ "gsutil cp /google/logs/action/1/stdout "
+ "\"{}/worker_logs/stdout.txt\"").format(
+ job_vars["OUTPUT_BUCKET"], job_vars["OUTPUT_BUCKET"])],
+ "alwaysRun": True
}
# Run the pipeline #
- service = build('genomics', 'v1alpha2', credentials=credentials)
+ project = job_vars["PROJECT_ID"]
+ service_parent = "projects/" + project + "/locations/" + region
+ service = build('lifesciences', 'v2beta', credentials=credentials)
compute_service = build("compute", "v1", credentials=credentials)
operation = None
counter = 0
- project = job_vars["PROJECT_ID"]
-
+
while non_preemptible_tries > 0 or preemptible_tries > 0:
if operation:
- while not operation["done"]:
- time.sleep(polling_interval)
- try:
- operation = service.operations().get(name=operation['name']).execute()
- except ssl.SSLError:
- print("Network error while polling running operation.")
- sys.stdout.flush()
+ while not operation.get("done", False):
+ new_op, tries = None, 0
+ while tries <= 5:
+ time.sleep(polling_interval)
+ try:
+ ops = service.projects().locations().operations()
+ new_op = ops.get(name=operation['name']).execute()
+ break
+ except googleapiclient.errors.HttpError as e:
+ logging.warning(str(e))
+ tries += 1
+ except ssl.SSLError as e:
+ logging.warning(str(e))
+ tries += 1
+ if not new_op:
+ logging.error("Network error while polling running "
+ "operation.")
sys.exit(1)
- pprint(operation, indent=2)
+ operation = new_op
+ logging.debug(pformat(operation, indent=2))
if "error" in operation:
- try:
- zone = operation["metadata"]["runtimeMetadata"]["computeEngine"]["zone"]
- except KeyError:
- print("Genomics operation failed before running:")
- pprint(operation["error"], indent=2)
- sys.stdout.flush()
+ assigned_events = list(filter(
+ lambda x: "workerAssigned" in x.keys(),
+ operation["metadata"]["events"]))
+ if not assigned_events:
+ logging.error("Genomics operation failed before running:")
+ logging.error(pformat(operation["error"], indent=2))
sys.exit(2)
- instance = operation["metadata"]["runtimeMetadata"]["computeEngine"]["instanceName"]
+ startup_event = assigned_events[-1]
+ instance = startup_event["workerAssigned"]["instance"]
+ zone = startup_event["workerAssigned"]["zone"]
url = target_url_base.format(**locals())
- time.sleep(30) # Don't poll too quickly
- compute_ops = compute_service.zoneOperations().list(project=project, zone=zone, filter="(targetLink eq {url}) (operationType eq compute.instances.preempted)".format(**locals())).execute()
- if "items" in compute_ops and any([x["operationType"] == "compute.instances.preempted" for x in compute_ops["items"]]):
- print("Run {} failed. Retrying...".format(counter))
+ time.sleep(300) # It may take some time to set the operation
+ compute_ops = (
+ compute_service.zoneOperations().list(
+ project=project, zone=zone, filter=(
+ "(targetLink eq {url}) (operationType eq "
+ "compute.instances.preempted)"
+ ).format(**locals())).execute())
+ if ("items" in compute_ops and
+ any([(x["operationType"] ==
+ "compute.instances.preempted")
+ for x in compute_ops["items"]])):
+ logging.warning("Run {} failed. "
+ "Retrying...".format(counter))
else:
- print("Run {} failed, but not due to preemption. Exit".format(counter))
+ logging.error("Run {} failed, but not due to preemption. "
+ "Exit".format(counter))
operation = None
break
else:
break
if preemptible_tries > 0:
- args_dict["resources"]["preemptible"] = True
+ vm_dict["preemptible"] = True
preemptible_tries -= 1
else:
- args_dict["resources"]["preemptible"] = False
+ vm_dict["preemptible"] = False
non_preemptible_tries -= 1
- print("Running pipeline:")
+ logging.debug("Running pipeline:")
body = {
- "ephemeralPipeline": pipeline_dict,
- "pipelineArgs": args_dict
+ "pipeline": {
+ "actions": [run_action, cleanup_action],
+ "resources": resources_dict,
+ "environment": env_dict
+ }
}
- pprint(body, indent=2)
- sys.stdout.flush()
- operation = service.pipelines().run(body=body).execute()
+ logging.debug(pformat(body, indent=2))
+ sys.stderr.flush()
+ backoff, backoff_interval = 0, 1
+ while backoff < 6:
+ time.sleep(backoff_interval * random.random() * (2 ** backoff - 1))
+ try:
+ op_pipelines = service.projects().locations().pipelines()
+ request = op_pipelines.run(
+ parent=service_parent,
+ body=body)
+ operation = request.execute()
+ break
+ except googleapiclient.errors.HttpError as e:
+ logging.warning(str(e))
+ backoff += 1
+ if not operation:
+ logging.error("Failed to launch job")
+ sys.exit(3)
+ else:
+ logging.warning("Launched job: " + operation["name"])
counter += 1
+ logging.debug(pformat(operation, indent=2))
if operation:
- while not operation["done"]:
- time.sleep(polling_interval)
- try:
- operation = service.operations().get(name=operation["name"]).execute()
- except ssl.SSLError:
- print("Network error while waiting for the final operation to finish")
- sys.stdout.flush()
+ while not operation.get("done", False):
+ new_op, tries = None, 0
+ while tries <= 5:
+ time.sleep(polling_interval)
+ try:
+ new_op = service.projects().locations().operations().get(
+ name=operation["name"]).execute()
+ break
+ except googleapiclient.errors.HttpError as e:
+ logging.warning(str(e))
+ tries += 1
+ except ssl.SSLError:
+ logging.warning(str(e))
+ tries += 1
+ if not new_op:
+ logging.error("Network error while waiting for the final "
+ "operation to finish")
sys.exit(1)
+ operation = new_op
+ logging.debug(pformat(operation, indent=2))
if "error" in operation:
- pprint(operation, indent=2)
- try:
- zone = operation["metadata"]["runtimeMetadata"]["computeEngine"]["zone"]
- except KeyError:
- print("Genomics operation failed before running:")
- pprint(operation["error"], indent=2)
- sys.stdout.flush()
+ assigned_events = list(filter(
+ lambda x: "workerAssigned" in x.keys(),
+ operation["metadata"]["events"]))
+ if not assigned_events:
+ logging.error("Genomics operation failed before running:")
+ logging.error(pformat(operation["error"], indent=2))
sys.exit(2)
- instance = operation["metadata"]["runtimeMetadata"]["computeEngine"]["instanceName"]
+ startup_event = assigned_events[-1]
+ instance = startup_event["workerAssigned"]["instance"]
+ zone = startup_event["workerAssigned"]["zone"]
url = target_url_base.format(**locals())
- compute_ops = compute_service.zoneOperations().list(project=project, zone=zone, filter="(targetLink eq {url}) (operationType eq compute.instances.preempted)".format(**locals())).execute()
- if "items" in compute_ops:
- print("Final run failed due to preemption")
- else:
- print("Final run failed, not due to preemption")
+ compute_ops = compute_service.zoneOperations().list(
+ project=project,
+ zone=zone,
+ filter=("(targetLink eq {url}) (operationType eq "
+ "compute.instances.preempted)").format(
+ **locals())).execute()
+ logging.error("Final run failed.")
else:
- print("Operation succeeded")
+ logging.warning("Operation succeeded")
+
if __name__ == "__main__":
main()
diff --git a/runner/sentieon_somatic.yaml b/runner/somatic.yaml
similarity index 83%
rename from runner/sentieon_somatic.yaml
rename to runner/somatic.yaml
index 54970fb..370be84 100644
--- a/runner/sentieon_somatic.yaml
+++ b/runner/somatic.yaml
@@ -19,6 +19,12 @@ inputParameters:
description: The refence genome (and assoicated indicies)
# Optional parameters
+- name: EMAIL
+ description: An email to use to obtain an evaluation license
+ defaultValue: None
+- name: SENTIEON_VERSION
+ description: Version of the Sentieon software to use
+ defaultValue: 201808.07
- name: FQ1
defaultValue: None
description: Fastq for a single sample (comma-separated)
@@ -65,12 +71,17 @@ inputParameters:
description: Stream fastq input directly from storage
defaultValue: None
- name: PIPELINE
- description: Run DNAseq or DNAscope
- defaultValue: TNseq
+ description: Run the somatic pipeline
+ defaultValue: SOMATIC
- name: REALIGN_SITES
description: Known sites for indel realignment (comma-separated)
defaultValue: None
- name: SENTIEON_KEY
description: A Sentieon License Key
defaultValue: None
-
+- name: CALLING_ARGS
+ description: Additional parameters to set during variant calling
+ defaultValue: None
+- name: CALLING_ALGO
+ description: The variant calling algorithm to use
+ defaultValue: TNhaplotyper
diff --git a/runner/tool_yaml_to_pipeline_json.py b/runner/tool_yaml_to_pipeline_json.py
index 9841c52..40ad150 100644
--- a/runner/tool_yaml_to_pipeline_json.py
+++ b/runner/tool_yaml_to_pipeline_json.py
@@ -6,6 +6,7 @@
import os
import json
+
def add_to_yaml(a, b, ignore_keys=set(("name", "description", "PIPELINE"))):
'''
If an object is in b but not a, add it to a.
@@ -22,8 +23,8 @@ def add_to_yaml(a, b, ignore_keys=set(("name", "description", "PIPELINE"))):
add_to_yaml(a[k], b[k])
else:
a[k] = b[k]
- elif type(a) is list: # In this case, order is unimportant
- try: # Only applicable with zones.
+ elif type(a) is list: # In this case, order is unimportant
+ try: # Only applicable with zones.
a_set, b_set = set(a), set(b)
intersection = list(a_set.intersection(b_set))
for i in range(len(a) - 1, -1, -1):
@@ -35,7 +36,8 @@ def add_to_yaml(a, b, ignore_keys=set(("name", "description", "PIPELINE"))):
except TypeError:
pass
# item["name"] essentially acts as a key
- a_dict, b_dict = dict([(x["name"], x) for x in a]), dict([(x["name"], x) for x in b])
+ a_dict = dict([(x["name"], x) for x in a])
+ b_dict = dict([(x["name"], x) for x in b])
for k, v in b_dict.items():
if k in ignore_keys:
continue
@@ -50,13 +52,15 @@ def add_to_yaml(a, b, ignore_keys=set(("name", "description", "PIPELINE"))):
print("'{}' does not equal '{}'".format(a, b))
raise ValueError("Values are unequal")
-germline = yaml.load(open(os.path.dirname(os.path.realpath(__file__)) + "/sentieon_germline.yaml"))
+
+script_dir = os.path.dirname(os.path.realpath(__file__))
+germline = yaml.load(open(script_dir + "/germline.yaml"))
tn = None
try:
- tn = yaml.load(open(os.path.dirname(os.path.realpath(__file__)) + "/sentieon_somatic.yaml"))
+ tn = yaml.load(open(script_dir + "/somatic.yaml"))
except IOError:
pass
-output = os.path.dirname(os.path.realpath(__file__)) + "/runner_default.json"
+output = script_dir + "/runner_default.json"
additional_input_params = {
"ZONES": None,
@@ -65,7 +69,7 @@ def add_to_yaml(a, b, ignore_keys=set(("name", "description", "PIPELINE"))):
"MIN_RAM_GB": 56,
"PIPELINE": "DNA",
"PROJECT_ID": None,
- "DOCKER_IMAGE": "sentieon/sentieon-google-cloud:201711.02",
+ "DOCKER_IMAGE": "sentieon/sentieon-google-cloud:0.2.0",
"PREEMPTIBLE_TRIES": 0,
"NONPREEMPTIBLE_TRY": True
}