diff --git a/README.md b/README.md index c917f4f..8fc03ca 100644 --- a/README.md +++ b/README.md @@ -1,82 +1,353 @@ - Sentieon - -# Run Sentieon pipelines on Google Cloud Platform - -Here we present a convenient tool for your to run our typical pieplines on Google Cloud Plarform. If you want to customize your pipeline, you may want to visit https://www.sentieon.com/support/. - -## Pipelines - -- Mapping(FASTQ -> BAM): `sentieon bwa-mem` -- Variant Calling(BAM -> VCF) - - Germline: `DNAseq` `DNAscope` - - WGS - - WES - - Somatic: `TNhaplotyper` `TNhaplotyper2` `TNscope` - - Tumor-Normal - - Tumor only - -## Prerequisite & Setup - -- Python 2.7+ -- GCP account and Cloud SDK - - You may want to follow the instructions in "Before you begin" section in this [Google Cloud Genomics Tutorial](https://cloud.google.com/genomics/docs/tutorials/sentieon) to set up the working environment. -- Set up your virtual environment and install dependencies - ```bash - #pip install virtualenv - virtualenv env - source env/bin/activate - pip install --upgrade \ - pyyaml \ - google-api-python-client \ - google-auth \ - google-cloud-storage \ - google-auth-httplib2 - ``` -- Download the example files and set your current directory - - ```bash - git clone https://github.com/sentieon/sentieon-google-genomics.git - cd sentieon-google-genomics - ``` - -## Run your first Sentieon job on GCP - -Right now, we are granting free-trial license to your account automatically. You will get 14 days free trial starting from the time you run your first Sentieon job. - - 1. Specify your `PROJECT` and `BUCKET` in `examples/example.json` - ```json - { + Sentieon + +# Run Sentieon pipelines on the Google Cloud Platform +*Easy to use pipelines for the Sentieon tools on the Google Cloud* + +For a tutorial, see Google's tutorial on [running a Sentieon DNAseq pipeline](https://cloud.google.com/genomics/docs/tutorials/sentieon). For more customized pipelines and additional details on the Sentieon software, please visit https://www.sentieon.com. + +## Table of Contents +- [Highlights](#highlights) +- [Prerequisites](#prerequisites) +- [Running a pipeline](#running) + - [Configure your environment](#configure) + - [Download the example files](#example) + - [Understanding the input format](#input) + - [Run the example pipeline](#run) + - [Understanding the output](#understand) + - [Other example pipelines](#other_examples) +- [Recommended configurations](#recommended) + - [Germline WGS and WES](#germline) + - [Somatic WGS and WES](#somatic) +- [Additional options - germline](#configurations_germline) + - [Input file options](#germline_input) + - [Machine options](#germline_machine) + - [Pipeline configuration](#germline_config) + - [Pipeline options](#germline_options) +- [Additional options - somatic](#configurations_somatic) + - [Input file options](#somatic_input) + - [Machine options](#somatic_machine) + - [Pipeline configuration](#somatic_config) + - [Pipeline options](#somatic_options) +- [Where to get help](#help) + + + +## Highlights + +- Easily run the Sentieon Pipelines on the Google Cloud. +- Pipelines are optimized by Sentieon to be well-tuned for efficiently processing WES and WGS samples. +- All Sentieon pipelines and variant callers are available including DNAseq, DNAscope, TNseq, and TNscope +- Matching results to the GATK [Germline](https://github.com/Sentieon/sentieon-dnaseq) and Somatic Best Practices Pipelines +- Automatic 14-day free-trial of the Sentieon software on the Google Cloud + + + +## Prerequisites + +1. [Install Python 2.7+](https://www.python.org/downloads/). +2. Select or create a GCP project. +3. Make sure that billing is enabled for your Google Cloud Platform project. +4. Enable the Cloud Life Sciences, Compute Engine, and Cloud Storage APIs. +5. Install and initialize the Cloud SDK. +6. Update and install gcloud components: +```bash +gcloud components update && +gcloud components install alpha +``` +7. [Install git](https://git-scm.com/downloads) to download the required files. +8. By default, Compute Engine has resource quotas in place to prevent inadvertent usage. By increasing quotas, you can launch more virtual machines concurrently, increasing throughput and reducing turnaround time. +For best results in this tutorial, you should request additional quota above your project's default. Recommendations for quota increases are provided in the following list, as well as the minimum quotas needed to run the tutorial. Make your quota requests in the us-central1 region: + CPUs: 64 + Persistent Disk Standard (GB): 375 +You can leave other quota request fields empty to keep your current quotas. + + + +## Running a pipeline + + + + +### Configuring your environment + +Setup a Python virtualenv to manage the environment. First, install virtualenv if necessary +```bash +pip install --upgrade virtualenv +``` +Install the required Python dependencies +```bash +virtualenv env +source env/bin/activate +pip install --upgrade \ + pyyaml \ + google-api-python-client \ + google-auth \ + google-cloud-storage \ + google-auth-httplib2 +``` + + + +### Download the pipeline script + +Download the pipeline script and move into the new directory. +```bash +git clone https://github.com/sentieon/sentieon-google-genomics.git +cd sentieon-google-genomics +``` + + + +### Understanding the input format + +The runner script accepts a JSON file as input. In the repository you downloaded, there is an `examples/example.json` file with the following content: +```json +{ "FQ1": "gs://sentieon-test/pipeline_test/inputs/test1_1.fastq.gz", "FQ2": "gs://sentieon-test/pipeline_test/inputs/test1_2.fastq.gz", "REF": "gs://sentieon-test/pipeline_test/reference/hs37d5.fa", - "OUTPUT_BUCKET": "YOUR_BUCKET_HERE", + "OUTPUT_BUCKET": "gs://BUCKET", + "ZONES": "us-central1-a,us-central1-b,us-central1-c,us-central1-f", + "PROJECT_ID": "PROJECT_ID", + "EMAIL": "EMAIL" +} +``` + +The following table describes the JSON keys in the file: + +| JSON key | Description | +| ------------- | ----------------------------------------------------------------------------- | +| FQ1 | The first pair of reads in the input fastq file. | +| FQ2 | The second pair of reads in the input fastq file. | +| BAM | The input BAM file, if applicable. | +| REF | The reference genome. If set, the reference index files are assumed to exist. | +| OUTPUT_BUCKET | The bucket and directory used to store the data output from the pipeline. | +| ZONES | A comma-separated list of GCP zones to use for the worker node. | +| PROJECT_ID | Your GCP project ID. | +| EMAIL | Your email | + +The `FQ1`, `FQ2`, `REF`, and `ZONES` fields will work with the defaults. However, the `OUTPUT_BUCKET`, `PROJECT_ID`, and `EMAIL` fields will need to be updated to point to your specific output bucket/path, Project ID, and email address. + + + +### Run the example pipelines + +Edit the `OUTPUT_BUCKET`, `PROJECT_ID`, and `EMAIL` fields in the `examples/example.json` to your output bucket/path, the GCP Project ID that you setup earlier, and email you want associated with your Sentieon license. By supplying the `EMAIL` field, your PROJECT_ID will automatically receive a 14 day free trial for the Sentieon software on the Google Cloud. + +You after modifying the `examples/example.json` file, you can use the following command to run the DNAseq pipeline on a small test dataset. +```bash +python runner/sentieon_runner.py examples/example.json +``` + + + +### Understanding the output + +If execution is successful, the runner script will print some logging information followed by `Operation succeeded` to the terminal. Output files from the pipeline can then be found in the `OUTPUT_BUCKET` location in Google Cloud Storage including alignment (BAM) files, variant calls, sample metrics and logging information. + +In the event of run failure, some diagnostic information will be printed to the screen followed by an error message. For assistance, please send the diagnostic information along with any log files in `OUTPUT_BUCKET`/worker_logs/ to support@sentieon.com. + + + +### Other example pipelines +In the `examples` directory, you can find the following example configurations: + +| Configuration | Pipeline | +| --------------- | ------------------------------------------------------------------ | +| 100x_wes.json | DNAseq pipeline from FASTQ to VCF for Whole Exome Sequencing Data | +| 30x_wgs.json | DNAseq pipeline from FASTQ to VCF for Whole Genome Sequencing Data | +| tn_example.json | TNseq pipeline from FASTQ to VCF for Tumor Normal Pairs | + + + +## Recommended configurations + +Below are some recommended configurations for some common use-cases. The cost and runtime estimates below assume that jobs are run on preemptible instances that were not preempted during job execution. + + + +### Germline Whole Genome and Whole Exome Sequencing + +The following configuration will run a 30x human genome at a cost of approximately $1.35 and will take about 2 hours. This configuration can also be used to run a 100x whole exome at a cost of approximately $0.30 and will take about 35 minutes. +```json +{ + "FQ1": "gs://my-bucket/sample1_1.fastq.gz", + "FQ2": "gs://my-bucket/sample1._2.fastq.gz", + "REF": "gs://sentieon-test/pipeline_test/reference/hs37d5.fa", + "OUTPUT_BUCKET": "gs://BUCKET", + "ZONES": "us-central1-a,us-central1-b,us-central1-c,us-central1-f", + "PROJECT_ID": "PROJECT_ID", + "EMAIL": "EMAIL", + "BQSR_SITES": "gs://sentieon-test/pipeline_test/reference/Mills_and_1000G_gold_standard.indels.b37.vcf.gz,gs://sentieon-test/pipeline_test/reference/1000G_phase1.indels.b37.vcf.gz,gs://sentieon-test/pipeline_test/reference/dbsnp_138.b37.vcf.gz", + "DBSNP": "gs://sentieon-test/pipeline_test/reference/dbsnp_138.b37.vcf.gz", + "PREEMPTIBLE_TRIES": "2", + "NONPREEMPTIBLE_TRY": true, + "STREAM_INPUT": "True", + "DISK_SIZE": 300, + "PIPELINE": "GERMLINE", + "CALLING_ALGO": "Haplotyper" +} +``` +The `CALLING_ALGO` key can be changed to `DNAscope` to use the Sentieon DNAscope variant caller for improved variant calling accuracy. For large input files, `DISK_SIZE` should be increased. + + + + +### Somatic Whole Genome and Whole Exome Sequencing + +The following configuration will run a paired 60-30x human genome at a cost of approximately $3.70 and will take about 7 hours. This configuration can also be used to run a paired 150-150x human exome at a cost of approximately $0.60 and will take about 1.5 hours. +```json +{ + "TUMOR_FQ1": "gs://my-bucket/tumor1_1.fastq.gz", + "TUMOR_FQ2": "gs://my-bucket/tumor1_2.fastq.gz", + "FQ1": "gs://my-bucket/normal1_1.fastq.gz", + "FQ2": "gs://my-bucket/normal1._2.fastq.gz", + "REF": "gs://sentieon-test/pipeline_test/reference/hs37d5.fa", + "OUTPUT_BUCKET": "gs://BUCKET", "ZONES": "us-central1-a,us-central1-b,us-central1-c,us-central1-f", - "PROJECT_ID": "YOUR_PROJECT_HERE" - } - ``` - - 2. Run - ```bash - python runner/sentieon_runner.py examples/example.json - ``` - -## Pipeline Examples - - ### JSON examples - - In `examples` directory, you could find - - file | pipeline - --- | --- - 100x_wes.json | DNAseq pipeline from FASTQ to VCF for Whole Exome Sequencing Data - 30x_wgs.json | DNAseq pipeline from FASTQ to VCF for Whole Genome Sequencing Data - tn_example.json | TNseq pipeline from FASTQ to VCF for Tumor Normal Pairs - - ### Parameters - - In `runner/sentieon_germline.yaml` and `runner/sentieon_somatic.yaml`, you could find adjustable parameters for germline and somatic pipelines. - - - - + "PROJECT_ID": "PROJECT_ID", + "EMAIL": "EMAIL", + "BQSR_SITES": "gs://sentieon-test/pipeline_test/reference/Mills_and_1000G_gold_standard.indels.b37.vcf.gz,gs://sentieon-test/pipeline_test/reference/1000G_phase1.indels.b37.vcf.gz,gs://sentieon-test/pipeline_test/reference/dbsnp_138.b37.vcf.gz", + "DBSNP": "gs://sentieon-test/pipeline_test/reference/dbsnp_138.b37.vcf.gz", + "PREEMPTIBLE_TRIES": "2", + "NONPREEMPTIBLE_TRY": true, + "STREAM_INPUT": "True", + "DISK_SIZE": 300, + "PIPELINE": "SOMATIC", + "CALLING_ALGO": "TNhaplotyper" +} +``` +The `CALLING_ALGO` key key can be change to `TNsnv`, `TNhaplotyper`, `TNhaplotyper2`, or `TNscope` to use Sentieon's TNsnv, TNhaplotyper, TNhaplotyper2 or TNscope variant callers, respectively. For large input files, `DISK_SIZE` should be increased. + + + +## Additional options - Germline + + + +### Input file options + +| JSON Key | Description | +|--------------- | ------------------------------------------------------------------------------------ | +| FQ1 | A comma-separated list of input R1 FASTQ files | +| FQ2 | A comma-separated list of input R2 FASTQ files | +| READGROUP | A comma-separted list of readgroups headers to add to the read data during alignment | +| BAM | A comma-separated list of input BAM files | +| REF | The path to the reference genome | +| BQSR_SITES | A comma-separated list of known sites for BQSR | +| DBSNP | A dbSNP file to use during variant calling | +| INTERVAL | A string of interval(s) to use during variant calling | +| INTERVAL_FILE | A file of intervals(s) to use during variant calling | +| DNASCOPE_MODEL | A trained model to use during DNAscope variant calling | + + + +### Machine options + +| JSON Key | Description | +| ------------ | --------------------------------------------------------------------------- | +| ZONES | GCE Zones to potentially launch the job in | +| DISK_SIZE | The size of the hard disk to use (should be 3x the size of the input files) | +| MACHINE_TYPE | The type of GCE machine to use to run the pipeline | + + + +### Pipeline configurations + +| JSON Key | Description | +| ------------------- | ----------------------------------------------------------------------- | +| SENTIEON_VERSION | The version of the Sentieon software package to use | +| DEDUP | Type of duplicate removal to run (nodup, markdup or rmdup) | +| NO_METRICS | Skip running metrics collection | +| NO_BAM_OUTPUT | Skip outputting a preprocessed BAM file | +| NO_HAPLOTYPER | Skip variant calling | +| GVCF_OUTPUT | Output variant calls in gVCF format rather than VCF format | +| STREAM_INPUT | Stream the input FASTQ files directly from Google Cloud Storage | +| RECALIBRATED_OUTPUT | Apply BQSR to the output preprocessed alignments (not recommended) | +| CALLING_ARGS | A string of additional arguments to pass to the variant caller | +| PIPELINE | Set to `GERMLINE` to run the germline variant calling pipeline | +| CALLING_ALGO | The Sentieon variant calling algo to run. Either Haplotyper or DNAscope | + + + +### Pipeline options + +| JSON Key | Description | +| ------------------- | --------------------------------------------------------------------------------------------------- | +| OUTPUT_BUCKET | The Google Cloud Storage Bucket and path prefix to use for the output files | +| EMAIL | An email address to use to obtain an evaluation license for your GCP Project | +| SENTIEON_KEY | Your Sentieon license key (only applicable for paying customers) | +| PROJECT_ID | Your GCP Project ID to use when running jobs | +| PREEMPTIBLE_TRIES | Number of attempts to run the pipeline using preemptible instances | +| NONPREEMPTIBLE_TRY | After `PREEMPTIBLE_TRIES` are exhausted, whether to try one additional run with standard instances | + + + +## Additional options - Somatic + + + +### Input file options + +| JSON Key | Description | +|---------------- | ------------------------------------------------------------------------------------------- | +| TUMOR_FQ1 | A comma-separated list of input R1 tumor FASTQ files | +| TUMOR FQ2 | A comma-separated list of input R2 tumor FASTQ files | +| FQ1 | A comma-separated list of input R1 normal FASTQ files | +| FQ2 | A comma-separated list of input R2 normal FASTQ files | +| TUMOR_READGROUP | A comma-separted list of readgroups headers to add to the tumor read data during alignment | +| READGROUP | A comma-separted list of readgroups headers to add to the normal read data during alignment | +| TUMOR_BAM | A comma-separated list of input tumor BAM files | +| BAM | A comma-separated list of input normal BAM files | +| REF | The path to the reference genome | +| BQSR_SITES | A comma-separated list of known sites for BQSR | +| DBSNP | A dbSNP file to use during variant calling | +| INTERVAL | A string of interval(s) to use during variant calling | +| INTERVAL_FILE | A file of intervals(s) to use during variant calling | + + + +### Machine options + +| JSON Key | Description | +| ------------ | --------------------------------------------------------------------------- | +| ZONES | GCE Zones to potentially launch the job in | +| DISK_SIZE | The size of the hard disk to use (should be 3x the size of the input files) | +| MACHINE_TYPE | The type of GCE machine to use to run the pipeline | + + + +### Pipeline configurations + +| JSON Key | Description | +| ------------------- | ------------------------------------------------------------------------------------------------------- | +| SENTIEON_VERSION | The version of the Sentieon software package to use | +| DEDUP | Type of duplicate removal to run (nodup, markdup or rmdup) | +| NO_METRICS | Skip running metrics collection | +| NO_BAM_OUTPUT | Skip outputting a preprocessed BAM file | +| NO_VCF | Skip variant calling | +| STREAM_INPUT | Stream the input FASTQ files directly from Google Cloud Storage | +| RECALIBRATED_OUTPUT | Apply BQSR to the output preprocessed alignments (not recommended) | +| CALLING_ARGS | A string of additional arguments to pass to the variant caller | +| PIPELINE | Set to `SOMATIC` to run the somatic variant calling pipeline | +| RUN_TNSNV | If using the TNseq pipeline, use TNsnv for variant calling | +| CALLING_ALGO | The Sentieon somatic variant calling algo to run. Either TNsnv, TNhaplotyper, TNhaplotyper2, or TNscope | + + + +### Pipeline options + +| JSON Key | Description | +| ------------------- | --------------------------------------------------------------------------------------------------- | +| OUTPUT_BUCKET | The Google Cloud Storage Bucket and path prefix to use for the output files | +| EMAIL | An email address to use to obtain an evaluation license for your GCP Project | +| SENTIEON_KEY | Your Sentieon license key (only applicable for paying customers) | +| PROJECT_ID | Your GCP Project ID to use when running jobs | +| PREEMPTIBLE_TRIES | Number of attempts to run the pipeline using preemptible instances | +| NONPREEMPTIBLE_TRY | After `PREEMPTIBLE_TRIES` are exhausted, whether to try one additional run with standard instances | + + + +## Where to get help + +Please email support@sentieon.com with any questions. diff --git a/examples/30x_wgs_ccdg.json b/examples/30x_wgs_ccdg.json new file mode 100644 index 0000000..82e41da --- /dev/null +++ b/examples/30x_wgs_ccdg.json @@ -0,0 +1,12 @@ +{ + "FQ1": "gs://your-bucket/wgs_sample1_1.fastq.gz", + "FQ2": "gs://your-bucket/wgs_sample1_2.fastq.gz", + "REF": "gs://sentieon-test/pipeline_test/reference/hg38/GRCh38_full_analysis_set_plus_decoy_hla.fa", + "OUTPUT_BUCKET": "YOUR_BUCKET_HERE", + "BQSR_SITES": "gs://sentieon-test/pipeline_test/reference/hg38/Homo_sapiens_assembly38.dbsnp138.vcf.gz,gs://sentieon-test/pipeline_test/reference/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz,gs://sentieon-test/pipeline_test/reference/hg38/Homo_sapiens_assembly38.known_indels.vcf.gz", + "DBSNP": "gs://sentieon-test/pipeline_test/reference/hg38/Homo_sapiens_assembly38.dbsnp138.vcf.gz", + "PREEMPTIBLE_TRIES": "2", + "STREAM_INPUT": "True", + "ZONES": "us-central1-a,us-central1-b,us-central1-c,us-central1-f", + "PROJECT_ID": "YOUR_PROJECT_HERE" +} diff --git a/examples/tn_example.json b/examples/tn_example.json index 4286ec7..7fcfcbd 100644 --- a/examples/tn_example.json +++ b/examples/tn_example.json @@ -7,7 +7,7 @@ "OUTPUT_BUCKET": "YOUR_BUCKET_HERE", "ZONES": "us-central1-a,us-central1-b,us-central1-c,us-central1-f", "PROJECT_ID": "YOUR_PROJECT_HERE", - "PIPELINE": "TNseq", - "RUN_TNSNV": null + "PIPELINE": "SOMATIC", + "CALLING_ALGO": "TNhaplotyper" } diff --git a/pipeline_scripts/Dockerfile b/pipeline_scripts/Dockerfile index 282438f..424d7b9 100644 --- a/pipeline_scripts/Dockerfile +++ b/pipeline_scripts/Dockerfile @@ -1,12 +1,11 @@ -FROM sentieon/sentieon:201808.01 as builder -ARG version=201808.01 +FROM python:2.7.16-slim-stretch as downloader +# Install gsutil RUN apt-get update && \ apt-get install -y \ - gcc \ - python-dev \ - python-setuptools \ curl \ + python-pip \ + gcc \ lsb-release && \ export CLOUD_SDK_REPO="cloud-sdk-$(lsb_release -c -s)" && \ echo "deb http://packages.cloud.google.com/apt $CLOUD_SDK_REPO main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && \ @@ -14,6 +13,7 @@ RUN apt-get update && \ apt-get update && apt-get install -y google-cloud-sdk && \ pip install crcmod +# Install samtools RUN apt-get update && \ apt-get install -y \ lbzip2 \ @@ -22,34 +22,39 @@ RUN apt-get update && \ libbz2-dev \ liblzma-dev \ make && \ - curl -Lo samtools-1.7.tar.bz2 https://github.com/samtools/samtools/releases/download/1.7/samtools-1.7.tar.bz2 && \ - tar -xf samtools-1.7.tar.bz2 && \ - cd samtools-1.7 && \ + curl -Lo samtools-1.9.tar.bz2 https://github.com/samtools/samtools/releases/download/1.9/samtools-1.9.tar.bz2 && \ + tar -xf samtools-1.9.tar.bz2 && \ + cd samtools-1.9 && \ ./configure && \ make install -RUN apt-get update && \ - apt-get install -y \ - g++ && \ - curl -Lo samblaster-v.0.1.24.tar.gz https://github.com/GregoryFaust/samblaster/releases/download/v.0.1.24/samblaster-v.0.1.24.tar.gz && \ - tar -xf samblaster-v.0.1.24.tar.gz && \ - cd samblaster-v.0.1.24 && \ - make +# Install samblaster +RUN curl -Lo samblaster-v.0.1.24.tar.gz https://github.com/GregoryFaust/samblaster/releases/download/v.0.1.24/samblaster-v.0.1.24.tar.gz && \ + tar -xf samblaster-v.0.1.24.tar.gz && \ + cd samblaster-v.0.1.24 && \ + make && \ + cp samblaster /usr/local/bin/ + +# Install metadata script dependencies +RUN pip install requests urllib3 -FROM sentieon/sentieon:201808.01 -ARG version=201808.01 +FROM python:2.7.16-slim-stretch -COPY --from=builder /usr/lib/google-cloud-sdk /usr/lib/google-cloud-sdk -COPY --from=builder /usr/local/lib/python2.7/site-packages/crcmod/ /usr/local/lib/python2.7/site-packages/crcmod/ -COPY --from=builder /usr/local/bin/samtools /usr/local/bin -COPY --from=builder /samblaster-v.0.1.24/samblaster /usr/local/bin +LABEL container.base.image="python:2.7.16-slim-stretch" + +COPY --from=downloader /usr/lib/google-cloud-sdk /usr/lib/google-cloud-sdk +COPY --from=downloader /usr/local/lib/python2.7/site-packages /usr/local/lib/python2.7/site-packages +COPY --from=downloader /usr/local/bin/samtools /usr/local/bin +COPY --from=downloader /usr/local/bin/samblaster /usr/local/bin + +CMD ["/bin/bash"] RUN apt-get update && apt-get install -y \ + curl \ + libncurses5-dev \ bc \ dnsutils \ iputils-ping && \ - pip install --upgrade requests && \ ln -s ../lib/google-cloud-sdk/bin/gsutil /usr/bin/gsutil -ADD gc_functions.sh gc_somatic.sh gc_germline.sh gen_credentials.py /opt/sentieon/ - +ADD gc_functions.sh gc_somatic.sh gc_germline.sh gc_ccdg_germline.sh gen_credentials.py /opt/sentieon/ diff --git a/pipeline_scripts/gc_ccdg_germline.sh b/pipeline_scripts/gc_ccdg_germline.sh new file mode 100644 index 0000000..3ec3abd --- /dev/null +++ b/pipeline_scripts/gc_ccdg_germline.sh @@ -0,0 +1,230 @@ +#!/usr/bin/env bash + +set -xveo pipefail +set +H + +BASEDIR=$(dirname "$0") +scratch=/mnt/work +nt=$(nproc) +source $BASEDIR/gc_functions.sh + +# Set "None" variables to an empty string +environmental_variables=(FQ1 FQ2 BAM OUTPUT_BUCKET REF READGROUP BQSR_SITES \ + DBSNP INTERVAL INTERVAL_FILE NO_METRICS NO_BAM_OUTPUT NO_HAPLOTYPER \ + GVCF_OUTPUT STREAM_INPUT PIPELINE OUTPUT_CRAM_FORMAT SENTIEON_KEY EMAIL \ + SENTIEON_VERSION CALLING_ARGS DNASCOPE_MODEL CALLING_ALGO) +unset_none_variables ${environmental_variables[@]} +OUTPUT_CRAM_FORMAT="" # Not yet supported + +readonly FQ1 FQ2 BAM OUTPUT_BUCKET REF READGROUP BQSR_SITES DBSNP INTERVAL \ + INTERVAL_FILE NO_METRICS NO_BAM_OUTPUT NO_HAPLOTYPER GVCF_OUTPUT \ + STREAM_INPUT PIPELINE OUTPUT_CRAM_FORMAT SENTIEON_KEY EMAIL \ + SENTIEON_VERSION CALLING_ARGS DNASCOPE_MODEL CALLING_ALGO + +release_dir="/opt/sentieon/sentieon-genomics-${SENTIEON_VERSION}/" + +# Basic error handling # +if [[ -n "$FQ1" && -n "$BAM" ]]; then + echo "Please supply either fastq or BAM files" + exit 1 +fi + +if [[ -n "$INTERVAL" && -n "$INTERVAL_FILE" ]]; then + echo "Please supply either an INTERVAL or INTERVAL_FILE, not both" + exit 1 +fi + +if [[ -n "$NO_BAM_OUTPUT" && -n "$NO_HAPLOTYPER" && -n "$NO_METRICS" ]]; then + echo "Nothing to do" + exit 1 +fi + +# ********************************** +# 0. Setup +# ********************************** +gc_setup +export LD_PRELOAD=${release_dir}/lib/libjemalloc.so +export MALLOC_CONF=lg_dirty_mult:-1 + +## Download input files +if [[ -n "$BAM" ]]; then + download_bams "$BAM" local_bams $input_dir +else + local_bams=() +fi + +download_intervals +download_reference +if [[ $CALLING_ALGO == "DNAscope" && -n "$DNASCOPE_MODEL" ]]; then + curl -L -o ${input_dir}/dnascope.model "$DNASCOPE_MODEL" +fi + +## Handle the sites files +IFS=',' read -r -a gs_bqsr_sites <<< "$BQSR_SITES" +transfer_all_sites $bqsr_dir "${gs_bqsr_sites[@]}" +local_bqsr_sites=("${local_sites[@]}") +bqsr_sites="$local_str" + +if [[ -n "$DBSNP" ]]; then + transfer_all_sites $dbsnp_dir "$DBSNP" + dbsnp="${local_sites[0]}" +fi + +# ****************************************** +# 1. Mapping reads with BWA-MEM, sorting +# ****************************************** +output_ext="bam" +if [[ -n $FQ1 ]]; then + bwa_mem_align "" "$FQ1" "$FQ2" "$READGROUP" local_bams $output_ext "-K 100000000 -Y" "$util_sort_xargs" "true" +fi + +local_bams_str="" +for bam in "${local_bams[@]}"; do + local_bams_str+=" -i \"$bam\" " +done + +# ****************************************** +# 2. Metrics command +# ****************************************** +metrics_cmd1= +metrics_cmd2= +metrics_files= +if [[ -z "$NO_METRICS" ]]; then + build_metrics_cmd "" metrics_cmd1 metrics_cmd2 metrics_files +fi + +# ****************************************** +# 3. Remove duplicates +# ****************************************** +output_ext="bam" + +run_mark_duplicates "" "markdup" metrics_cmd1 "$local_bams_str" dedup_bam_str dedup_bams "$dedup_xargs" $output_ext "true" "${local_bams[@]}" +if [[ -z "$NO_METRICS" ]]; then + (gsutil cp $metrics_dir/dedup_metrics.txt "$out_metrics" && + rm $metrics_dir/dedup_metrics.txt) & + upload_dedup_pid=$! +else + rm $metrics_dir/dedup_metrics.txt & +fi +for bam in "${local_bams[@]}"; do + if [[ -f "$bam" ]]; then + rm "$bam" & + fi + if [[ -f "${bam}".bai ]]; then + rm "${bam}".bai & + fi + if [[ -f "${bam}".crai ]]; then + rm "${bam}".crai & + fi +done + +upload_metrics metrics_cmd1 metrics_cmd2 upload_metrics_pid ${metrics_files[@]} + +# ****************************************** +# 4. Base recalibration +# ****************************************** +bqsr_intervals="chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr20,chr21,chr22" +cmd="$release_dir/bin/sentieon driver -t $nt -r \"$ref\" --interval $bqsr_intervals $dedup_bam_str --algo QualCal $bqsr_sites $work/recal_data.table" +run "$cmd" "BQSR" + +cmd="$release_dir/bin/sentieon driver -t $nt -r \"$ref\" $dedup_bam_str --read_filter QualCalFilter,table=$work/recal_data.table,prior=-1.0,indel=false,levels=10/20/30,min_qual=6 --algo ReadWriter --cram_write_options version=3.0,compressor=gzip+rans $work/recalibrated.cram" +run "$cmd" "ReadWriter" + +for bam in "${dedup_bams[@]}"; do + if [[ -f "$bam" ]]; then + rm "$bam" & + fi + if [[ -f "${bam}".bai ]]; then + rm "$bam".bai & + fi + if [[ -f "${bam}".crai ]]; then + rm "${bam}".crai & + fi +done + +if [[ -z "$NO_BAM_OUTPUT" ]]; then + gsutil cp $work/recalibrated.cram $work/recalibrated.cram.crai "$out_bam" & + upload_recal_pid=$! +fi + + +# ****************************************** +# 5. Variant Calling +# ****************************************** + +## Generate a non-decoy BED file +generate_nondecoy_bed "${ref}".fai "${ref}"_nondecoy.bed +call_interval="$interval" +call_interval=${call_interval:-"${ref}"_nondecoy.bed} + +outgvcf=$work/hc.g.vcf.gz +outvcf=$work/hc.vcf.gz + +algo=Haplotyper +extra_vcf_args="" +if [[ $CALLING_ALGO == "DNAscope" ]]; then + algo="DNAscope" + if [[ -n "$DNASCOPE_MODEL" ]]; then + extra_vcf_args="--model ${input_dir}/dnascope.model" + else + extra_vcf_args="--var_type snp,indel,bnd" + fi + outvcf=$work/dnascope.vcf.gz + outgvcf=$work/dnascope.g.vcf.gz + tmpvcf=$work/tmp.vcf.gz +fi + +if [[ -z $NO_HAPLOTYPER ]]; then + if [[ -n "$GVCF_OUTPUT" ]]; then + cmd="$release_dir/bin/sentieon driver --interval \"$call_interval\" -t $nt -r \"$ref\" -i $work/recalibrated.cram --algo $algo $CALLING_ARGS ${dbsnp:+-d \"$dbsnp\"} --emit_mode gvcf ${outgvcf}" + outfile=$outgvcf + else + cmd="$release_dir/bin/sentieon driver --interval \"$call_interval\" -t $nt -r \"$ref\" -i $work/recalibrated.cram --algo $algo $CALLING_ARGS $extra_vcf_args ${dbsnp:+-d \"$dbsnp\"} ${outvcf}" + outfile=$outvcf + fi + run "$cmd" "Variant calling" + + # DNAscope SV calling + if [[ $CALLING_ALGO == "DNAscope" && -z "$GVCF_OUTPUT" && -z "$DNASCOPE_MODEL" ]]; then + mv $outvcf $tmpvcf + mv ${outvcf}.tbi ${tmpvcf}.tbi + cmd="$release_dir/bin/sentieon driver -t $nt -r \"$ref\" --algo SVSolver -v $tmpvcf $outvcf" + run "$cmd" "SVSolver" + fi + + # DNAscope model apply + if [[ $CALLING_ALGO == "DNAscope" && -z "$GVCF_OUTPUT" && -n "$DNASCOPE_MODEL" ]]; then + mv $outvcf $tmpvcf + mv ${outvcf}.tbi ${tmpvcf}.tbi + cmd="$release_dir/bin/sentieon driver -t $nt -r \"$ref\" --algo DNAModelApply --model ${input_dir}/dnascope.model -v $tmpvcf $outvcf" + run "$cmd" "DNAscope model apply" + fi + + gsutil cp $outfile ${outfile}.tbi "$out_variants" & + upload_vcf_pid=$! +fi + +# Wait for all running uploads to finish +set +e +if [[ -n $upload_recal_pid ]]; then + wait $upload_recal_pid +fi +if [[ -n $upload_metrics_pid ]]; then + wait $upload_metrics_pid +fi +if [[ -n $upload_dedup_pid ]]; then + wait $upload_dedup_pid +fi +if [[ -n $upload_deduped_pid ]]; then + wait $upload_deduped_pid +fi +if [[ -n $upload_bqsr_pid ]]; then + wait $upload_bqsr_pid +fi +if [[ -n $upload_vcf_pid ]]; then + wait $upload_vcf_pid +fi +if [[ -n $upload_bqsr_metrics_pid ]]; then + wait $upload_bqsr_metrics_pid +fi +exit 0 diff --git a/pipeline_scripts/gc_functions.sh b/pipeline_scripts/gc_functions.sh index 58fdf98..3b85484 100644 --- a/pipeline_scripts/gc_functions.sh +++ b/pipeline_scripts/gc_functions.sh @@ -45,7 +45,7 @@ transfer() src_file=$1 dst_file=$2 start_s=`date +%s` - gsutil cp $src_file $dst_file + gsutil cp "$src_file" "$dst_file" check_error $? "Transfer $src_file to $dst_file" end_s=`date +%s` runtime=$(delta_time $start_s $end_s) @@ -60,21 +60,21 @@ transfer_all_sites() local_str="" for src_file in "${src_files[@]}"; do # File - local_file=$dst_dir/$(basename $src_file) - transfer $src_file $local_file - local_sites+=($local_file) - local_str+=" -k $local_file " + local_file=$dst_dir/$(basename "$src_file") + transfer "$src_file" "$local_file" + local_sites+=("$local_file") + local_str+=" -k \"$local_file\" " # Index - if $(test -e ${src_file}.idx) || $(gsutil -q stat ${src_file}.idx); then - idx=${src_file}.idx - elif $(test -e ${src_file}.tbi) || $(gsutil -q stat ${src_file}.tbi); then - idx=${src_file}.tbi + if $(test -e "${src_file}".idx) || $(gsutil -q stat "${src_file}".idx); then + idx="${src_file}".idx + elif $(test -e "${src_file}".tbi) || $(gsutil -q stat "${src_file}".tbi); then + idx="${src_file}".tbi else echo "Cannot find idx for $src_file" exit 1 fi - local_idx=$dst_dir/$(basename $idx) - transfer $idx $local_idx + local_idx=$dst_dir/$(basename "$idx") + transfer "$idx" "$local_idx" done } @@ -88,7 +88,7 @@ upload_metrics() eval "fun_metrics_cmd2=\$$fun_var_cmd2" if [[ -n "$fun_metrics_cmd2" && -z "$fun_metrics_cmd1" && -f "${fun_metrics_files[0]}" ]]; then (run "$fun_metrics_cmd2" "Plotting metrics results." && - gsutil cp ${fun_metrics_files[@]} $out_metrics && + gsutil cp ${fun_metrics_files[@]} "$out_metrics" && rm ${fun_metrics_files[@]}) & eval "$fun_pid=$! " eval "$fun_var_cmd2=''" @@ -106,6 +106,10 @@ unset_none_variables() gc_setup() { + ## Download the Sentieon software + curl -L https://s3.amazonaws.com/sentieon-release/software/sentieon-genomics-${SENTIEON_VERSION}.tar.gz | tar -zxf - -C /opt/sentieon + PATH=/opt/sentieon/sentieon-genomics-${SENTIEON_VERSION}/bin:$PATH + ## Dirs work=$scratch/work metrics_dir=$scratch/metrics @@ -130,8 +134,7 @@ gc_setup() ## Setup license information # cred=$license_dir/credentials.json project_file=$license_dir/credentials.json.project - python /opt/sentieon/gen_credentials.py $cred "$SENTIEON_KEY" & - credentials_pid=$! + python /opt/sentieon/gen_credentials.py ${EMAIL:+--email $EMAIL} $cred "$SENTIEON_KEY" sleep 10 if [[ -n $SENTIEON_KEY ]]; then export SENTIEON_AUTH_MECH=proxy_GOOGLE @@ -171,22 +174,22 @@ download_bams() dest_arr=$2 download_input_dir=$3 - bams=($(echo $bams | tr ',' ' ')) + IFS=',' read -r -a bams <<< "$bams" tmp_bam_dest=() - for bam in ${bams[@]}; do - local_bam=$download_input_dir/$(basename $bam) - transfer $bam $local_bam - if $(test -e ${bam}.bai) || $(gsutil -q stat ${bam}.bai); then - bai=${bam}.bai - elif $(test -e ${bam%%.bam}.bai) || $(gsutil -q stat ${bam%%.bam}.bai); then - bai=${bam%%.bam}.bai + for bam in "${bams[@]}"; do + local_bam=$download_input_dir/$(basename "$bam") + transfer "$bam" "$local_bam" + if $(test -e "${bam}".bai) || $(gsutil -q stat "${bam}".bai); then + bai="${bam}".bai + elif $(test -e "${bam%%.bam}".bai) || $(gsutil -q stat "${bam%%.bam}".bai); then + bai="${bam%%.bam}".bai else echo "Cannot find the index file for $bam" exit 1 fi - local_bai=$download_input_dir/$(basename $bai) - transfer $bai $local_bai - tmp_bam_dest+=($local_bam) + local_bai=$download_input_dir/$(basename "$bai") + transfer "$bai" "$local_bai" + tmp_bam_dest+=("$local_bam") done eval "${dest_arr}=(${tmp_bam_dest})" @@ -195,13 +198,13 @@ download_bams() download_intervals() { if [[ -n "$INTERVAL" ]]; then - interval=" --interval $INTERVAL " + interval="$INTERVAL" interval_list="" elif [[ -n "$INTERVAL_FILE" ]]; then - local_interval_file=$input_dir/$(basename $INTERVAL_FILE) - transfer $INTERVAL_FILE $local_interval_file - interval=" --interval $local_interval_file " - interval_list=" --interval_list $local_interval_file " + local_interval_file=$input_dir/$(basename "$INTERVAL_FILE") + transfer "$INTERVAL_FILE" "$local_interval_file" + interval="$local_interval_file" + interval_list="$local_interval_file" else interval="" interval_list="" @@ -210,25 +213,36 @@ download_intervals() download_reference() { - ref=$ref_dir/$(basename $REF) - transfer $REF $ref - transfer ${REF}.fai ${ref}.fai - if $(test -e ${REF}.dict) || $(gsutil -q stat ${REF}.dict); then - transfer ${REF}.dict ${ref}.dict - elif $(test -e ${REF%%.fa}.dict) || $(gsutil -q stat ${REF%%.fa}.dict); then - transfer ${REF%%.fa}.dict ${ref%%.fa}.dict - elif $(test -e ${REF%%.fasta}.dict) || $(gsutil -q stat ${REF%%.fasta}.dict); then - transfer ${REF%%.fasta}.dict ${ref%%.fasta}.dict + ref=$ref_dir/$(basename "$REF") + transfer "$REF" "$ref" + transfer "${REF}".fai "${ref}".fai + if $(test -e "${REF}".dict) || $(gsutil -q stat "${REF}".dict); then + transfer "${REF}".dict "${ref}".dict + elif $(test -e "${REF%%.fa}".dict) || $(gsutil -q stat "${REF%%.fa}".dict); then + transfer "${REF%%.fa}".dict "${ref%%.fa}".dict + elif $(test -e "${REF%%.fasta}".dict) || $(gsutil -q stat "${REF%%.fasta}".dict); then + transfer "${REF%%.fasta}".dict "${ref%%.fasta}".dict else echo "Cannot find reference dictionary" exit 1 fi if [[ -n "$FQ1" || -n "$TUMOR_FQ1" ]]; then - transfer ${REF}.amb ${ref}.amb - transfer ${REF}.ann ${ref}.ann - transfer ${REF}.bwt ${ref}.bwt - transfer ${REF}.pac ${ref}.pac - transfer ${REF}.sa ${ref}.sa + if $(test -e "${REF}".64.amb) || $(gsutil -q stat "${REF}".64.amb); then + middle=".64" + elif $(test -e "${REF}".amb) || $(gsutil -q stat "${REF}".amb); then + middle="" + else + echo "Cannot file BWA index files" + exit 1 + fi + transfer "${REF}"${middle}.amb "${ref}"${middle}.amb + transfer "${REF}"${middle}.ann "${ref}"${middle}.ann + transfer "${REF}"${middle}.bwt "${ref}"${middle}.bwt + transfer "${REF}"${middle}.pac "${ref}"${middle}.pac + transfer "${REF}"${middle}.sa "${ref}"${middle}.sa + if $(test -e "${REF}"${middle}.alt) || $(gsutil -q stat "${REF}"${middle}.alt); then + transfer "${REF}"${middle}.alt "${ref}"${middle}.alt + fi fi } @@ -242,11 +256,12 @@ bwa_mem_align() fun_output_ext=$1; shift fun_bwa_xargs=$1; shift fun_util_sort_xargs=$1; shift + fun_run_samblaster=$1; shift fun_bam_dest=() - fun_fq1=($(echo $fun_fq1 | tr ',' ' ')) - fun_fq2=($(echo $fun_fq2 | tr ',' ' ')) - fun_rgs=($(echo $fun_rgs | tr ',' ' ')) + IFS=',' read -r -a fun_fq1 <<< "$fun_fq1" + IFS=',' read -r -a fun_fq2 <<< "$fun_fq2" + IFS=',' read -r -a fun_rgs <<< "$fun_rgs" if [[ -z "$bwt_max_mem" ]]; then mem_kb=$(cat /proc/meminfo | grep "MemTotal" | awk '{print $2}') @@ -258,25 +273,31 @@ bwa_mem_align() fq1=${fun_fq1[$i]} fq2=${fun_fq2[$i]} readgroup=${fun_rgs[$i]} - bwa_cmd="$release_dir/bin/bwa mem ${fun_bwa_xargs} -K 10000000 -M -R \"${readgroup}\" -t $nt $ref " + bwa_cmd="$release_dir/bin/bwa mem ${fun_bwa_xargs} -R \"${readgroup}\" -t $nt \"$ref\" " if [[ -n "$STREAM_INPUT" ]]; then bwa_cmd="$bwa_cmd <(gsutil cp $fq1 -) " if [[ -n "$fq2" ]]; then bwa_cmd="$bwa_cmd <(gsutil cp $fq2 -) " fi else - local_fq1=$input_dir/$(basename $fq1) - transfer $fq1 $local_fq1 - bwa_cmd="$bwa_cmd $local_fq1" + local_fq1=$input_dir/$(basename "$fq1") + transfer "$fq1" "$local_fq1" + bwa_cmd="$bwa_cmd \"$local_fq1\"" if [[ -n "$fq2" ]]; then - local_fq2=$input_dir/$(basename $fq2) - transfer $fq2 $local_fq2 - bwa_cmd="$bwa_cmd $local_fq2" + local_fq2=$input_dir/$(basename "$fq2") + transfer "$fq2" "$local_fq2" + bwa_cmd="$bwa_cmd \"$local_fq2\"" fi fi local_bam=$work/${fun_base}sorted_bam-${i}.${fun_output_ext} + bwa_log=$work/${fun_base}_bwa.log + bwa_cmd="$bwa_cmd 2>$bwa_log" + if [[ "$fun_run_samblaster" == "true" ]]; then + bwa_cmd="$bwa_cmd | samblaster --addMateTags -a" + fi bwa_cmd="$bwa_cmd | $release_dir/bin/sentieon util sort ${fun_util_sort_xargs} --block_size 512M -o $local_bam -t $nt --sam2bam -i -" run "$bwa_cmd" "BWA-mem and sorting" + gsutil cp $bwa_log "$out_bam" fun_bam_dest+=($local_bam) done echo "BWA ended" @@ -286,13 +307,13 @@ bwa_mem_align() i=$((i - 1)) fq1=${fun_fq1[$i]} fq2=${fun_fq2[$i]} - local_fq1=$input_dir/$(basename $fq1) - local_fq2=$input_dir/$(basename $fq2) + local_fq1=$input_dir/$(basename "$fq1") + local_fq2=$input_dir/$(basename "$fq2") if [[ -f "$local_fq1" ]]; then - rm $fq1 & + rm $local_fq1 & fi if [[ -f "$local_fq2" ]]; then - rm $fq2 & + rm $local_fq2 & fi done @@ -328,35 +349,37 @@ run_mark_duplicates() { fun_bams_dest=$1; shift fun_dedup_xargs=$1; shift fun_output_ext=$1; shift + fun_two_pass=$1; shift fun_local_bams=("$@") if [[ "$fun_dedup" == "nodup" || (-z "$fun_bam_str" || -z "${fun_local_bams[@]}") ]]; then eval "${fun_bam_str_dest}=\"$fun_bam_str\"" - eval "${fun_bams_dest}=(${fun_local_bams[@]})" + eval "${fun_bams_dest}=(\"${fun_local_bams[@]}\")" else # LocusCollector - cmd="$release_dir/bin/sentieon driver --traverse_param=200000/10000 $fun_bam_str -t $nt -r $ref --algo LocusCollector $work/${fun_base}score.txt" + cmd="$release_dir/bin/sentieon driver --traverse_param=200000/10000 $fun_bam_str -t $nt -r \"$ref\" --algo LocusCollector $work/${fun_base}score.txt" if [[ -n $(eval "echo \$$fun_metrics_cmd") ]]; then eval "cmd+=\" \$$fun_metrics_cmd \"" eval "$fun_metrics_cmd=''" fi run "$cmd" "Locus collector" + # Dedup pre-pass + if [[ "$fun_two_pass" == "true" ]]; then + cmd="$release_dir/bin/sentieon driver --traverse_param=200000/10000 $fun_bam_str -t $nt -r \"$ref\" --algo Dedup --score_info $work/${fun_base}score.txt --metrics $metrics_dir/${fun_base}dedup_metrics.txt --output_dup_read_name $work/${fun_base}dedup_qname.txt.gz" + run "$cmd" "Dedup pre-pass" + fun_dedup_xargs="${fun_dedup_xargs} --dup_read_name $work/${fun_base}dedup_qname.txt.gz" + else + fun_dedup_xargs="${fun_dedup_xargs} --score_info $work/${fun_base}score.txt --metrics $metrics_dir/${fun_base}dedup_metrics.txt" + fi + # Dedup dedup_bam=$work/${fun_base}dedup.${fun_output_ext} if [[ "$fun_dedup" != "markdup" ]]; then - fun_dedup_xargs+=" --rmdup " + fun_dedup_xargs="${fun_dedup_xargs} --rmdup " fi - cmd="$release_dir/bin/sentieon driver -r $ref --traverse_param=200000/10000 $fun_bam_str -t $nt --algo Dedup ${fun_dedup_xargs} --score_info $work/${fun_base}score.txt --metrics $metrics_dir/${fun_base}dedup_metrics.txt $dedup_bam" + cmd="$release_dir/bin/sentieon driver -r \"$ref\" --traverse_param=200000/10000 $fun_bam_str -t $nt --algo Dedup ${fun_dedup_xargs} $dedup_bam" run "$cmd" $fun_dedup - for bam in ${local_bams[@]}; do - if [[ -n $bam ]]; then - rm $bam & - fi - if [[ -n ${bam}.bai ]]; then - rm ${bam}.bai & - fi - done eval "${fun_bam_str_dest}=\" -i $dedup_bam \"" eval "${fun_bams_dest}=(${dedup_bam})" fi @@ -382,7 +405,7 @@ run_bqsr() if [[ -n "$bqsr_sites" && -n "$fun_bam_str" ]]; then fun_bqsr_table=$work/${fun_base}recal_data.table fun_bqsr_post=$work/${fun_base}recal_data.table.post - cmd="$release_dir/bin/sentieon driver $interval -t $nt -r '$ref' $fun_bam_str --algo QualCal $bqsr_sites $fun_bqsr_table" + cmd="$release_dir/bin/sentieon driver ${interval:+--interval \"$interval\"} -t $nt -r \"$ref\" $fun_bam_str --algo QualCal $bqsr_sites $fun_bqsr_table" if [[ -n $(eval "echo \$$fun_metrics_cmd") ]]; then eval "cmd+=\" \$$fun_metrics_cmd \"" eval "$fun_metrics_cmd=''" @@ -395,11 +418,11 @@ run_bqsr() fi fi - eval "$fun_bqsr2=\"$fun_bqsr_cmd2\"" - eval "$fun_bqsr3=\"$fun_bqsr_cmd3\"" - eval "$fun_bqsr4=\"$fun_bqsr_cmd4\"" - eval "$fun_table=\"$fun_bqsr_table\"" - eval "$fun_plot=\"$plot\"" + eval "$fun_bqsr2='$fun_bqsr_cmd2'" + eval "$fun_bqsr3='$fun_bqsr_cmd3'" + eval "$fun_bqsr4='$fun_bqsr_cmd4'" + eval "$fun_table='$fun_bqsr_table'" + eval "$fun_plot='$plot'" } run_bqsr_post() @@ -417,11 +440,11 @@ run_bqsr_post() eval "fun_bqsr_cmd4=\$$fun_bqsr4" if [[ -n "$fun_bqsr_cmd2" && -n "$fun_bam_str" && -f "$fun_table" ]]; then - cmd="$release_dir/bin/sentieon driver $interval -t $nt -r $ref $fun_bam_str -q $fun_table $fun_bqsr_cmd2" + cmd="$release_dir/bin/sentieon driver ${interval:+--interval \"$interval\"} -t $nt -r $ref $fun_bam_str -q $fun_table $fun_bqsr_cmd2" run "$cmd" "BQSR post" run "$fun_bqsr_cmd3" "BQSR CSV" run "$fun_bqsr_cmd4" "BQSR plot" - gsutil cp $fun_plot $out_metrics & + gsutil cp $fun_plot "$out_metrics" & eval "$fun_upload_pid=$1 " fi @@ -435,6 +458,6 @@ generate_nondecoy_bed() fun_reference_fai=$1; shift fun_output_dest=$1; shift - grep -v "hs37d5\|chrEBV\|hs38d1\|decoy" $fun_reference_fai | awk 'BEGIN{OFS="\t"} {print $1,0,$2}' > $fun_output_dest + grep -v "hs37d5\|chrEBV\|hs38d1\|decoy" "$fun_reference_fai" | awk 'BEGIN{OFS="\t"} {print $1,0,$2}' > "$fun_output_dest" } diff --git a/pipeline_scripts/gc_germline.sh b/pipeline_scripts/gc_germline.sh index 3998dac..3d06fef 100644 --- a/pipeline_scripts/gc_germline.sh +++ b/pipeline_scripts/gc_germline.sh @@ -4,35 +4,34 @@ set -xveo pipefail set +H BASEDIR=$(dirname "$0") -version="201808.01" -release_dir="/opt/sentieon/sentieon-genomics-${version}/" scratch=/mnt/work nt=$(nproc) source $BASEDIR/gc_functions.sh -export LD_PRELOAD=/opt/sentieon/sentieon-genomics-${version}/lib/libjemalloc.so -export MALLOC_CONF=lg_dirty_mult:-1 - # Set "None" variables to an empty string environmental_variables=(FQ1 FQ2 BAM OUTPUT_BUCKET REF READGROUP DEDUP \ BQSR_SITES DBSNP INTERVAL INTERVAL_FILE NO_METRICS NO_BAM_OUTPUT \ NO_HAPLOTYPER GVCF_OUTPUT STREAM_INPUT PIPELINE OUTPUT_CRAM_FORMAT \ - SENTIEON_KEY RECALIBRATED_OUTPUT) + SENTIEON_KEY RECALIBRATED_OUTPUT EMAIL SENTIEON_VERSION CALLING_ARGS \ + DNASCOPE_MODEL CALLING_ALGO) unset_none_variables ${environmental_variables[@]} OUTPUT_CRAM_FORMAT="" # Not yet supported readonly FQ1 FQ2 BAM OUTPUT_BUCKET REF READGROUP DEDUP BQSR_SITES DBSNP \ INTERVAL INTERVAL_FILE NO_METRICS NO_BAM_OUTPUT NO_HAPLOTYPER GVCF_OUTPUT \ - STREAM_INPUT PIPELINE OUTPUT_CRAM_FORMAT SENTIEON_KEY RECALIBRATED_OUTPUT + STREAM_INPUT PIPELINE OUTPUT_CRAM_FORMAT SENTIEON_KEY RECALIBRATED_OUTPUT \ + EMAIL SENTIEON_VERSION CALLING_ARGS DNASCOPE_MODEL CALLING_ALGO -# Some basic error handling # +release_dir="/opt/sentieon/sentieon-genomics-${SENTIEON_VERSION}/" + +# Basic error handling # if [[ -n "$FQ1" && -n "$BAM" ]]; then echo "Please supply either fastq or BAM files" exit 1 fi if [[ -n "$INTERVAL" && -n "$INTERVAL_FILE" ]]; then - echo "Please supply either an INTERVAL or INTERVAL_FILE" + echo "Please supply either an INTERVAL or INTERVAL_FILE, not both" exit 1 fi @@ -45,28 +44,31 @@ fi # 0. Setup # ********************************** gc_setup +export LD_PRELOAD=${release_dir}/lib/libjemalloc.so +export MALLOC_CONF=lg_dirty_mult:-1 ## Download input files if [[ -n "$BAM" ]]; then - download_bams $BAM local_bams $input_dir + download_bams "$BAM" local_bams $input_dir else local_bams=() fi download_intervals download_reference +if [[ $CALLING_ALGO == "DNAscope" && -n "$DNASCOPE_MODEL" ]]; then + curl -L -o ${input_dir}/dnascope.model "$DNASCOPE_MODEL" +fi ## Handle the sites files -gs_bqsr_sites=($(echo $BQSR_SITES | tr ',' ' ')) +IFS=',' read -r -a gs_bqsr_sites <<< "$BQSR_SITES" transfer_all_sites $bqsr_dir "${gs_bqsr_sites[@]}" -local_bqsr_sites="${local_sites[@]}" +local_bqsr_sites=("${local_sites[@]}") bqsr_sites="$local_str" -REALIGN_SITES= - if [[ -n "$DBSNP" ]]; then - transfer_all_sites $dbsnp_dir $DBSNP - dbsnp=${local_sites[0]} + transfer_all_sites $dbsnp_dir "$DBSNP" + dbsnp="${local_sites[0]}" fi # ****************************************** @@ -74,15 +76,12 @@ fi # ****************************************** output_ext="bam" if [[ -n $FQ1 ]]; then -# if [[ -n "$NO_BAM_OUTPUT" || "$DEDUP" != "nodup" ]]; then -# util_sort_xargs="${util_sort_xargs} --bam_compression 1 " -# fi - bwa_mem_align "" $FQ1 $FQ2 $READGROUP local_bams $output_ext "$bwa_xargs" "$util_sort_xargs" + bwa_mem_align "" "$FQ1" "$FQ2" "$READGROUP" local_bams $output_ext "-M -K 10000000" "$util_sort_xargs" "false" fi local_bams_str="" -for bam in ${local_bams[@]}; do - local_bams_str+=" -i $bam " +for bam in "${local_bams[@]}"; do + local_bams_str+=" -i \"$bam\" " done # ****************************************** @@ -99,43 +98,40 @@ fi # 3. Remove duplicates # ****************************************** output_ext="bam" -#if [[ -n "$NO_BAM_OUTPUT" ]]; then -# dedup_xargs=" --bam_compression 1 " -#fi -run_mark_duplicates "" $DEDUP metrics_cmd1 "$local_bams_str" dedup_bam_str dedup_bams "$dedup_xargs" $output_ext ${local_bams[@]} +run_mark_duplicates "" "$DEDUP" metrics_cmd1 "$local_bams_str" dedup_bam_str dedup_bams "$dedup_xargs" $output_ext "false" "${local_bams[@]}" if [[ "$DEDUP" != "nodup" ]]; then if [[ -z "$NO_METRICS" ]]; then - (gsutil cp $metrics_dir/dedup_metrics.txt $out_metrics && + (gsutil cp $metrics_dir/dedup_metrics.txt "$out_metrics" && rm $metrics_dir/dedup_metrics.txt) & upload_dedup_pid=$! else rm $metrics_dir/dedup_metrics.txt & fi - for bam in ${local_bams[@]}; do - if [[ -f $bam ]]; then - rm $bam & + for bam in "${local_bams[@]}"; do + if [[ -f "$bam" ]]; then + rm "$bam" & fi - if [[ -f ${bam}.bai ]]; then - rm ${bam}.bai & + if [[ -f "${bam}".bai ]]; then + rm "${bam}".bai & fi - if [[ -f ${bam}.crai ]]; then - rm ${bam}.crai & + if [[ -f "${bam}".crai ]]; then + rm "${bam}".crai & fi done fi if [[ -z "$NO_BAM_OUTPUT" && (-z "$bqsr_sites" || -z "$RECALIBRATED_OUTPUT" ) ]]; then upload_list="" - for bam in ${dedup_bams[@]}; do - upload_list+=" $bam " + for bam in "${dedup_bams[@]}"; do + upload_list+=" \"$bam\" " if [[ -f "${bam}.bai" ]]; then - upload_list+=" ${bam}.bai " + upload_list+=" \"${bam}.bai\" " elif [[ -f "${bam}.crai" ]]; then - upload_list+=" ${bam}.crai " + upload_list+=" \"${bam}.crai\" " fi done - gsutil cp $upload_list $out_bam & + eval gsutil cp $upload_list "$out_bam" & upload_deduped_pid=$! fi @@ -150,12 +146,12 @@ if [[ -n "$bqsr_sites" && -z "$NO_BAM_OUTPUT" && -n "$RECALIBRATED_OUTPUT" ]]; t outrecal=$work/recalibrated.bam cmd="$release_dir/bin/sentieon driver $dedup_bam_str -q $bqsr_table --algo ReadWriter $outrecal" (run "$cmd" "ReadWriter"; - gsutil cp $outrecal ${outrecal}.bai $out_bam) & + gsutil cp $outrecal ${outrecal}.bai "$out_bam") & upload_recal_pid=$! fi if [[ -n "$bqsr_sites" && -z "$NO_BAM_OUTPUT" && -z "$RECALIBRATED_OUTPUT" ]]; then - gsutil cp $bqsr_table $out_bam & + gsutil cp $bqsr_table "$out_bam" & upload_bqsr_pid=$! fi @@ -167,17 +163,22 @@ upload_metrics metrics_cmd1 metrics_cmd2 upload_metrics_pid ${metrics_files[@]} # ****************************************** ## Generate a non-decoy BED file -generate_nondecoy_bed ${ref}.fai ${ref}_nondecoy.bed +generate_nondecoy_bed "${ref}".fai "${ref}"_nondecoy.bed +call_interval="$interval" +call_interval=${call_interval:-"${ref}"_nondecoy.bed} outgvcf=$work/hc.g.vcf.gz outvcf=$work/hc.vcf.gz algo=Haplotyper extra_vcf_args="" -extra_gvcf_args="" -if [[ $PIPELINE == "DNAscope" ]]; then +if [[ $CALLING_ALGO == "DNAscope" ]]; then algo="DNAscope" - extra_vcf_args="--var_type snp,indel,bnd" + if [[ -n "$DNASCOPE_MODEL" ]]; then + extra_vcf_args="--model ${input_dir}/dnascope.model" + else + extra_vcf_args="--var_type snp,indel,bnd" + fi outvcf=$work/dnascope.vcf.gz outgvcf=$work/dnascope.g.vcf.gz tmpvcf=$work/tmp.vcf.gz @@ -185,13 +186,12 @@ fi if [[ -z $NO_HAPLOTYPER ]]; then if [[ -n "$GVCF_OUTPUT" ]]; then - cmd="$release_dir/bin/sentieon driver ${interval:- --interval ${ref}_nondecoy.bed} -t $nt -r '$ref' $dedup_bam_str ${bqsr_table:+-q $bqsr_table} --algo $algo $extra_gvcf_args ${dbsnp:+-d $dbsnp} --emit_mode gvcf ${outgvcf}" + cmd="$release_dir/bin/sentieon driver --interval \"$call_interval\" -t $nt -r \"$ref\" $dedup_bam_str ${bqsr_table:+-q $bqsr_table} --algo $algo $CALLING_ARGS ${dbsnp:+-d \"$dbsnp\"} --emit_mode gvcf ${outgvcf}" outfile=$outgvcf else - cmd="$release_dir/bin/sentieon driver ${interval:- --interval ${ref}_nondecoy.bed} -t $nt -r '$ref' $dedup_bam_str ${bqsr_table:+-q $bqsr_table} --algo $algo $extra_vcf_args ${dbsnp:+-d $dbsnp} ${outvcf}" + cmd="$release_dir/bin/sentieon driver --interval \"$call_interval\" -t $nt -r \"$ref\" $dedup_bam_str ${bqsr_table:+-q $bqsr_table} --algo $algo $CALLING_ARGS $extra_vcf_args ${dbsnp:+-d \"$dbsnp\"} ${outvcf}" outfile=$outvcf fi - if [[ -n $metrics_cmd1 ]]; then cmd+=" $metrics_cmd1" metrics_cmd1= @@ -200,20 +200,30 @@ if [[ -z $NO_HAPLOTYPER ]]; then cmd+=" $bqsr_cmd2" bqsr_cmd2= fi + run "$cmd" "Variant calling" - run "$cmd" "Haplotyper variant calling" - if [[ $PIPELIINE == "DNAscope" ]]; then + # DNAscope SV calling + if [[ $CALLING_ALGO == "DNAscope" && -z "$GVCF_OUTPUT" && -z "$DNASCOPE_MODEL" ]]; then mv $outvcf $tmpvcf mv ${outvcf}.tbi ${tmpvcf}.tbi - cmd="$release_dir/bin/sentieon driver -t $nt -r '$ref' --algo SVSolver -v $tmpvcf $outvcf" + cmd="$release_dir/bin/sentieon driver -t $nt -r \"$ref\" --algo SVSolver -v $tmpvcf $outvcf" run "$cmd" "SVSolver" fi - gsutil cp $outfile ${outfile}.tbi $out_variants & + + # DNAscope model apply + if [[ $CALLING_ALGO == "DNAscope" && -z "$GVCF_OUTPUT" && -n "$DNASCOPE_MODEL" ]]; then + mv $outvcf $tmpvcf + mv ${outvcf}.tbi ${tmpvcf}.tbi + cmd="$release_dir/bin/sentieon driver -t $nt -r \"$ref\" --algo DNAModelApply --model ${input_dir}/dnascope.model -v $tmpvcf $outvcf" + run "$cmd" "DNAscope model apply" + fi + + gsutil cp $outfile ${outfile}.tbi "$out_variants" & upload_vcf_pid=$! fi if [[ -n $metrics_cmd1 ]]; then - cmd="$release_dir/bin/sentieon driver $interval -t $nt -r '$ref' $dedup_bam_str ${bqsr_table:+-q $bqsr_table}" + cmd="$release_dir/bin/sentieon driver ${interval:+--interval \"$interval\"} -t $nt -r \"$ref\" $dedup_bam_str ${bqsr_table:+-q $bqsr_table}" cmd+=" $metrics_cmd1" run "$cmd" "Metrics collection" fi @@ -222,7 +232,7 @@ upload_metrics metrics_cmd1 metrics_cmd2 upload_metrics_pid ${metrics_files[@]} if [[ -n $bqsr_cmd2 ]]; then - cmd="$release_dir/bin/sentieon driver $interval -t $nt -r '$ref' $dedup_bam_str -q $bqsr_table $bqsr_cmd2" + cmd="$release_dir/bin/sentieon driver ${interval:+--interval \"$interval\"} -t $nt -r \"$ref\" $dedup_bam_str -q $bqsr_table $bqsr_cmd2" bqsr_cmd2= run "$cmd" "BQSR Post" fi @@ -230,12 +240,10 @@ fi if [[ -n $bqsr_cmd3 ]]; then run "$bqsr_cmd3" "BQSR CSV" run "$bqsr_cmd4" "BQSR plot" - gsutil cp $plot $out_metrics & + gsutil cp $plot "$out_metrics" & upload_bqsr_metrics_pid=$! fi -kill $credentials_pid - # Wait for all running uploads to finish set +e if [[ -n $upload_recal_pid ]]; then diff --git a/pipeline_scripts/gc_somatic.sh b/pipeline_scripts/gc_somatic.sh index ea0564b..657b896 100644 --- a/pipeline_scripts/gc_somatic.sh +++ b/pipeline_scripts/gc_somatic.sh @@ -4,20 +4,16 @@ set -xveo pipefail set +H BASEDIR=$(dirname "$0") -version="201808.01" -release_dir="/opt/sentieon/sentieon-genomics-${version}/" scratch=/mnt/work nt=$(nproc) source $BASEDIR/gc_functions.sh -export LD_PRELOAD=/opt/sentieon/sentieon-genomics-${version}/lib/libjemalloc.so -export MALLOC_CONF=lg_dirty_mult:-1 - # Set "None" varibles to an empty string environmental_variables=(FQ1 FQ2 TUMOR_FQ1 TUMOR_FQ2 BAM TUMOR_BAM \ OUTPUT_BUCKET REF READGROUP TUMOR_READGROUP DEDUP BQSR_SITES DBSNP \ INTERVAL INTERVAL_FILE NO_METRICS NO_BAM_OUTPUT NO_VCF RUN_TNSNV \ - STREAM_INPUT PIPELINE REALIGN_SITES OUTPUT_CRAM_FORMAT SENTIEON_KEY) + STREAM_INPUT PIPELINE REALIGN_SITES OUTPUT_CRAM_FORMAT SENTIEON_KEY \ + EMAIL SENTIEON_VERSION CALLING_ARGS CALLING_ALGO) unset_none_variables ${environmental_variables[@]} OUTPUT_CRAM_FORMAT="" # Not yet supported @@ -60,22 +56,27 @@ fi readonly FQ1 FQ2 TUMOR_FQ1 TUMOR_FQ2 BAM TUMOR_BAM \ OUTPUT_BUCKET REF READGROUP TUMOR_READGROUP DEDUP BQSR_SITES DBSNP \ INTERVAL INTERVAL_FILE NO_METRICS NO_BAM_OUTPUT NO_VCF RUN_TNSNV \ - STREAM_INPUT PIPELINE REALIGN_SITES OUTPUT_CRAM_FORMAT + STREAM_INPUT PIPELINE REALIGN_SITES OUTPUT_CRAM_FORMAT EMAIL \ + SENTIEON_VERSION CALLING_ARGS CALLING_ALGO + +release_dir="/opt/sentieon/sentieon-genomics-${SENTIEON_VERSION}/" # ***************************** # 0. Setup # ***************************** gc_setup +export LD_PRELOAD=${release_dir}/lib/libjemalloc.so +export MALLOC_CONF=lg_dirty_mult:-1 ## Download input files if [[ -n "$BAM" ]]; then - download_bams $BAM local_bams $input_dir + download_bams "$BAM" local_bams $input_dir else local_bams=() fi if [[ -n "$TUMOR_BAM" ]]; then - download_bams $TUMOR_BAM tumor_bams $input_dir + download_bams "$TUMOR_BAM" tumor_bams $input_dir else tumor_bams=() fi @@ -84,45 +85,41 @@ download_intervals download_reference ## Handle the sites files -gs_bqsr_sites=($(echo $BQSR_SITES | tr ',' ' ')) +IFS=',' read -r -a gs_bqsr_sites <<< "$BQSR_SITES" transfer_all_sites $bqsr_dir "${gs_bqsr_sites[@]}" -local_bqsr_sites="${local_sites[@]}" +local_bqsr_sites=("${local_sites[@]}") bqsr_sites="$local_str" -gs_realign_sites=($(echo $REALIGN_SITES | tr ',' ' ')) +IFS=',' read -r -a gs_realign_sites <<< "$REALIGN_SITES" transfer_all_sites $realign_dir "${gs_realign_sites[@]}" -local_realign_sites="${local_sites[@]}" +local_realign_sites=("${local_sites[@]}") realign_sites="$local_str" if [[ -n "$DBSNP" ]]; then - transfer_all_sites $dbsnp_dir $DBSNP - dbsnp=${local_sites[0]} + transfer_all_sites $dbsnp_dir "$DBSNP" + dbsnp="${local_sites[0]}" fi # ****************************************** # 1. Mapping reads with BWA-MEM, sorting # ****************************************** output_ext="bam" -#if [[ -n "$NO_BAM_OUTPUT" || "$DEDUP" != "nodup" || -n "$REALIGN_SITES" ]]; then -# util_sort_xargs="${util_sort_xargs} --bam_compression 1 " -#fi - if [[ -n "$FQ1" ]]; then - bwa_mem_align "normal_" $FQ1 $FQ2 $READGROUP local_bams $output_ext "$bwa_xargs" "$util_sort_xargs" + bwa_mem_align "normal_" "$FQ1" "$FQ2" "$READGROUP" local_bams $output_ext "-M -K 10000000" "$util_sort_xargs" "false" fi local_bams_str="" -for bam in ${local_bams[@]}; do - local_bams_str+=" -i $bam " +for bam in "${local_bams[@]}"; do + local_bams_str+=" -i \"$bam\" " done if [[ -n "$TUMOR_FQ1" ]]; then - bwa_mem_align "tumor_" $TUMOR_FQ1 $TUMOR_FQ2 $TUMOR_READGROUP tumor_bams $output_ext "$bwa_xargs" "$util_sort_xargs" + bwa_mem_align "tumor_" "$TUMOR_FQ1" "$TUMOR_FQ2" "$TUMOR_READGROUP" tumor_bams $output_ext "-M -K 10000000" "$util_sort_xargs" "false" fi tumor_bams_str="" -for bam in ${tumor_bams[@]}; do - tumor_bams_str+=" -i $bam " +for bam in "${tumor_bams[@]}"; do + tumor_bams_str+=" -i \"$bam\" " done # Detect the tumor and normal sample names @@ -152,12 +149,9 @@ fi # 3. Remove duplicates # ****************************************** output_ext="bam" -#if [[ -n "$NO_BAM_OUTPUT" || -n "$REALIGN_SITES" ]]; then -# dedup_xargs=" --bam_compression 1 " -#fi -run_mark_duplicates "normal_" $DEDUP metrics_cmd1 "$local_bams_str" dedup_bam_str dedup_bams "$dedup_xargs" $output_ext ${local_bams[@]} -run_mark_duplicates "tumor_" $DEDUP tumor_metrics_cmd1 "$tumor_bams_str" tumor_dedup_bam_str tumor_dedup_bams "$dedup_xargs" $output_ext ${tumor_bams[@]} +run_mark_duplicates "normal_" "$DEDUP" metrics_cmd1 "$local_bams_str" dedup_bam_str dedup_bams "$dedup_xargs" $output_ext "false" "${local_bams[@]}" +run_mark_duplicates "tumor_" "$DEDUP" tumor_metrics_cmd1 "$tumor_bams_str" tumor_dedup_bam_str tumor_dedup_bams "$dedup_xargs" $output_ext "false" "${tumor_bams[@]}" if [[ "$DEDUP" != "nodup" ]]; then if [[ -z "$NO_METRICS" ]]; then @@ -165,36 +159,36 @@ if [[ "$DEDUP" != "nodup" ]]; then if [[ -n "$dedup_bam_str" ]]; then to_upload+=" $metrics_dir/normal_dedup_metrics.txt" fi - (gsutil cp $to_upload $out_metrics && + (gsutil cp $to_upload "$out_metrics" && rm $metrics_dir/*_dedup_metrics.txt) & upload_dedup_pid=$! else rm $metrics_dir/*_dedup_metrics.txt & fi - for bam in ${local_bams[@]} ${tumor_bams[@]}; do - if [[ -f $bam ]]; then - rm $bam & + for bam in "${local_bams[@]}" "${tumor_bams[@]}"; do + if [[ -f "$bam" ]]; then + rm "$bam" & fi - if [[ -f ${bam}.bai ]]; then - rm ${bam}.bai & + if [[ -f "${bam}".bai ]]; then + rm "${bam}".bai & fi - if [[ -f ${bam}.crai ]]; then - rm ${bam}.crai & + if [[ -f "${bam}".crai ]]; then + rm "${bam}".crai & fi done fi if [[ -z "$NO_BAM_OUTPUT" && -z "$REALIGN_SITES" ]]; then upload_list="" - for bam in ${dedup_bams[@]} ${tumor_dedup_bams[@]}; do - upload_list+=" $bam " + for bam in "${dedup_bams[@]}" "${tumor_dedup_bams[@]}"; do + upload_list+=" \"$bam\" " if [[ -f "${bam}.bai" ]]; then - upload_list+=" ${bam}.bai " + upload_list+=" \"${bam}.bai\" " elif [[ -f "${bam}.crai" ]]; then - upload_list+=" ${bam}.crai " + upload_list+=" \"${bam}.crai\" " fi done - gsutil cp $upload_list $out_bam & + eval gsutil cp $upload_list "$out_bam" & upload_deduped_pid=$! fi @@ -205,30 +199,27 @@ upload_metrics tumor_metrics_cmd1 tumor_metrics_cmd2 tumor_upload_metrics_pid ${ # 4. Indel Realignment # ****************************************** output_ext="bam" -#if [[ -n "$NO_BAM_OUTPUT" || -n "$dedup_bam_str" ]]; then -# realign_xargs=" --bam_compression 1 " -#fi if [[ -n "$REALIGN_SITES" && -n "$RUN_TNSNV" ]]; then realigned_bam=$work/normal_realigned.${output_ext} tumor_realigned_bam=$work/tumor_realigned.${output_ext} if [[ -n "$dedup_bam_str" ]]; then - cmd="$release_dir/bin/sentieon driver $dedup_bam_str -t $nt -r $ref --algo Realigner $realign_xargs $interval_list $realign_sites $realigned_bam" + cmd="$release_dir/bin/sentieon driver $dedup_bam_str -t $nt -r \"$ref\" --algo Realigner ${interval_list:+--interval_list \"$interval_list\"} $realign_sites $realigned_bam" run "$cmd" "Indel Realign - Normal" fi - cmd="$release_dir/bin/sentieon driver $tumor_dedup_bam_str -t $nt -r $ref --algo Realigner $realign_xargs $interval_list $realign_sites $tumor_realigned_bam" + cmd="$release_dir/bin/sentieon driver $tumor_dedup_bam_str -t $nt -r \"$ref\" --algo Realigner ${interval_list:+--interval_list \"$interval_list\"} $realign_sites $tumor_realigned_bam" run "$cmd" "Indel Realign - Tumor" # Cleanup - for bam in ${dedup_bams[@]} ${tumor_dedup_bams[@]}; do + for bam in "${dedup_bams[@]}" "${tumor_dedup_bams[@]}"; do if [[ -f "$bam" ]]; then - rm $bam & + rm "$bam" & fi if [[ -f "${bam}.bai" ]]; then - rm ${bam}.bai & + rm "${bam}".bai & fi if [[ -f "${bam}.crai" ]]; then - rm ${bam}.crai & + rm "${bam}".crai & fi done @@ -242,9 +233,9 @@ if [[ -n "$REALIGN_SITES" && -n "$RUN_TNSNV" ]]; then tumor_realigned_bams=($tumor_realigned_bam) tumor_realigned_bam_str=" -i $tumor_realigned_bam " else - realigned_bams=${dedup_bams[@]} + realigned_bams="${dedup_bams[@]}" realigned_bam_str=${dedup_bam_str} - tumor_realigned_bams=${tumor_dedup_bams[@]} + tumor_realigned_bams="${tumor_dedup_bams[@]}" tumor_realigned_bam_str=${tumor_dedup_bam_str} fi @@ -263,7 +254,7 @@ if [[ -n "$realigned_bam_str" ]]; then fi if [[ -n "$bqsr_sites" && -z "$NO_BAM_OUTPUT" ]]; then - gsutil cp $bqsr_table $tumor_bqsr_table $out_bam & + gsutil cp $bqsr_table $tumor_bqsr_table "$out_bam" & upload_bqsr_pid=$! fi @@ -274,30 +265,27 @@ upload_metrics tumor_metrics_cmd1 tumor_metrics_cmd2 tumor_upload_metrics_pid ${ # 6. Indel corealignment # ****************************************** output_ext="bam" -#if [[ -n "$NO_BAM_OUTPUT" ]]; then -# corealign_xargs=" --bam_compression 1 " -#fi if [[ -n "$REALIGN_SITES" && -n "$RUN_TNSNV" && -n "$realigned_bam_str" ]]; then corealigned_bam=$work/corealigned.${output_ext} - cmd="$release_dir/bin/sentieon driver $realigned_bam_str $tumor_realigned_bam_str -t $nt -r $ref --algo Realigner $corealign_xargs $interval_list $realign_sites $corealigned_bam" + cmd="$release_dir/bin/sentieon driver $realigned_bam_str $tumor_realigned_bam_str -t $nt -r \"$ref\" --algo Realigner ${interval_list:+--interval_list \"$interval_list\"} $realign_sites $corealigned_bam" run "$cmd" "Indel co-realignment" # Cleanup - for bam in ${realigned_bams[@]} ${tumor_realigned_bams[@]}; do - if [[ -f $bam ]]; then - rm $bam & + for bam in "${realigned_bams[@]}" "${tumor_realigned_bams[@]}"; do + if [[ -f "$bam" ]]; then + rm "$bam" & fi - if [[ -f ${bam}.bai ]]; then - rm ${bam}.bai & + if [[ -f "${bam}".bai ]]; then + rm "${bam}".bai & fi - if [[ -f ${bam}.crai ]]; then - rm ${bam}.crai & + if [[ -f "${bam}".crai ]]; then + rm "${bam}".crai & fi done - for site_file in ${local_realign_sites[@]}; do - if [[ -f $site_file ]]; then - rm $site_file & + for site_file in "${local_realign_sites[@]}"; do + if [[ -f "$site_file" ]]; then + rm "$site_file" & fi done @@ -308,21 +296,21 @@ if [[ -n "$REALIGN_SITES" && -n "$RUN_TNSNV" && -n "$realigned_bam_str" ]]; then elif [[ -f "${corealigned_bam}.crai" ]]; then upload_list+=" ${corealigned_bam}.crai " fi - gsutil cp $upload_list $out_bam & + gsutil cp $upload_list "$out_bam" & upload_corealigned_pid=$! corealigned_bam_str=" -i $corealigned_bam " elif [[ -n "$REALIGN_SITES" && -n "$RUN_TNSNV" ]]; then upload_list="" - for bam in ${tumor_realigned_bams[@]}; do - upload_list+=" $bam " + for bam in "${tumor_realigned_bams[@]}"; do + upload_list+=" \"$bam\" " if [[ -f "${bam}.bai" ]]; then - upload_list+=" ${bam}.bai " + upload_list+=" \"${bam}.bai\" " elif [[ -f "${bam}.crai" ]]; then - upload_list+=" ${bam}.crai " + upload_list+=" \"${bam}.crai\" " fi done - gsutil cp $upload_list $out_bam & + eval gsutil cp $upload_list "$out_bam" & upload_corealigned_pid=$! corealigned_bam_str=" $tumor_realigned_bam_str " @@ -336,22 +324,16 @@ corealigned_bqsr_str=" ${tumor_bqsr_table:+-q $tumor_bqsr_table} ${bqsr_table:+- # ******************************************* ## Generate a non-decoy BED file -generate_nondecoy_bed ${ref}.fai ${ref}_nondecoy.bed +generate_nondecoy_bed "${ref}".fai "${ref}"_nondecoy.bed +call_interval="$interval" +call_interval=${call_interval:-"${ref}"_nondecoy.bed} if [[ -z "$NO_VCF" ]]; then extra_calling_args="" - if [[ "$PIPELINE" == "TNscope" ]]; then - algo="TNscope" - vcf=$work/tnscope.vcf.gz - elif [[ -n "$RUN_TNSNV" ]]; then - algo="TNsnv" - vcf=$work/tnsnv.vcf.gz - else - algo="TNhaplotyper" - vcf=$work/tnhaplotyper.vcf.gz - fi + algo="$CALLING_ALGO" + vcf="$work"/"$CALLING_ALGO".vcf.gz - cmd="$release_dir/bin/sentieon driver ${interval:- --interval ${ref}_nondecoy.bed} $corealigned_bam_str $corealigned_bqsr_str -t $nt -r $ref --algo $algo ${normal_sample:+--normal_sample $normal_sample} --tumor_sample $tumor_sample ${dbsnp:+--dbsnp $dbsnp} $vcf" + cmd="$release_dir/bin/sentieon driver --interval \"$call_interval\" $corealigned_bam_str $corealigned_bqsr_str -t $nt -r \"$ref\" --algo $algo $CALLING_ARGS ${normal_sample:+--normal_sample $normal_sample} --tumor_sample $tumor_sample ${dbsnp:+--dbsnp \"$dbsnp\"} $vcf" if [[ -n $tumor_metrics_cmd1 ]]; then cmd+=" $metrics_cmd1 " @@ -368,12 +350,12 @@ if [[ -z "$NO_VCF" ]]; then fi run "$cmd" "Variant calling" - gsutil cp $vcf ${vcf}.tbi $out_variants & + gsutil cp $vcf ${vcf}.tbi "$out_variants" & upload_vcf_pid=$! fi if [[ -n $tumor_metrics_cmd1 ]]; then - cmd="$release_dir/bin/sentieon driver ${interval:- --interval ${ref}_nondecoy.bed} -t $nt -r '$ref' $corealigned_bam_str $corealigned_bqsr_str" + cmd="$release_dir/bin/sentieon driver --interval \"$call_interval\" -t $nt -r \"$ref\" $corealigned_bam_str $corealigned_bqsr_str" cmd+=" $metrics_cmd1 " cmd+=" $tumor_metrics_cmd1 " metrics_cmd1= @@ -385,7 +367,7 @@ upload_metrics metrics_cmd1 metrics_cmd2 upload_metrics_pid ${metrics_files[@]} upload_metrics tumor_metrics_cmd1 tumor_metrics_cmd2 tumor_upload_metrics_pid ${tumor_metrics_files[@]} if [[ -n $bqsr_cmd2 ]]; then - cmd="$release_dir/bin/sentieon driver ${interval:- --interval ${ref}_nondecoy.bed} -t $nt -r $ref $corealigned_bam_str $corealigned_bqsr_str $bqsr_cmd2 $tumor_bqsr_cmd2" + cmd="$release_dir/bin/sentieon driver --interval \"$call_interval\" -t $nt -r \"$ref\" $corealigned_bam_str $corealigned_bqsr_str $bqsr_cmd2 $tumor_bqsr_cmd2" bqsr_cmd2= tumor_bqsr_cmd2= run "$cmd" "BQSR Post" @@ -400,12 +382,10 @@ if [[ -n $tumor_bqsr_cmd3 ]]; then fi run "$tumor_bqsr_cmd3" "Tumor BQSR CSV" run "$tumor_bqsr_cmd4" "Tumor BQSR plot" - gsutil cp $upload_list $tumor_bqsr_plot $out_metrics & + gsutil cp $upload_list $tumor_bqsr_plot "$out_metrics" & upload_bqsr_metrics_pid=$! fi -kill $credentials_pid - # Wait for all running uploads to finish set +e if [[ -n $upload_normal_bqsr_plot_pid ]]; then @@ -438,4 +418,4 @@ fi if [[ -n $upload_bqsr_metrics_pid ]]; then wait $upload_bqsr_metrics_pid fi - +exit 0 diff --git a/pipeline_scripts/gen_credentials.py b/pipeline_scripts/gen_credentials.py index 0351cf2..2e753d2 100644 --- a/pipeline_scripts/gen_credentials.py +++ b/pipeline_scripts/gen_credentials.py @@ -6,23 +6,40 @@ import time import argparse import json +import os audience = "https://sentieon.com" headers = {'Metadata-Flavor': 'Google'} request_format = "full" -metadata_url = "http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/identity?audience={}&format={}" -project_url = "http://metadata.google.internal/computeMetadata/v1/project/project-id?format={}" +metadata_url = ("http://metadata.google.internal/computeMetadata/v1/instance/" + "service-accounts/default/identity?audience={}&format={}") +project_url = ("http://metadata.google.internal/computeMetadata/v1/project/" + "project-id?format={}") + def process_args(): - parser = argparse.ArgumentParser(description="Write fresh instance metadata credentials to a file for license authentication") - parser.add_argument("auth_data_file", help="A file to hold the instance metadata JWT") + parser = argparse.ArgumentParser(description="Write fresh instance " + "metadata credentials to a file for " + "license authentication") + parser.add_argument("auth_data_file", help="A file to hold the instance " + "metadata JWT") parser.add_argument("sentieon_key", help="A license key string") + parser.add_argument("--email", help="An email associated with the license") return parser.parse_args() -def main(args): - if not args: - args = process_args() +def send_to_background(): + pid = os.fork() + if pid == 0: + os.setsid() # Guarantee no controlling terminal + pid = os.fork() + if pid != 0: + os._exit(0) + else: + os._exit(0) + + +def main(args): url = project_url.format(request_format) response = requests.get(url, headers=headers) project_id = response.text @@ -50,10 +67,15 @@ def main(args): out["google_session_token"] = response.text if args.sentieon_key: out["license_key"] = args.sentieon_key + if args.email: + out["email"] = args.email with open(args.auth_data_file, 'w') as f: json.dump(out, f) - time.sleep(55 * 60) # sleep for 55 minutes before refreshing the token or until killed + # sleep for 55 minutes before refreshing the token or until killed + time.sleep(55 * 60) -if __name__ == "__main__": - main(None) +if __name__ == "__main__": + args = process_args() + send_to_background() + main(args) diff --git a/runner/ccdg.yaml b/runner/ccdg.yaml new file mode 100644 index 0000000..4cc8d92 --- /dev/null +++ b/runner/ccdg.yaml @@ -0,0 +1,72 @@ +name: Sentieon_ccdg +description: Run a Sentieon CCDG pipeline on the Google Cloud + +inputParameters: +# Required parameters +# FQ1 or BAM is required +- name: FQ1 + defaultValue: None + description: Fastq for a single sample (comma-separated) +- name: FQ2 + defaultValue: None + description: Fastq pairs for a single sample (comma-separated) +- name: BAM + defaultValue: None + description: Bam files for a single sample (comma-separated) +- name: OUTPUT_BUCKET + description: The output Google Cloud Storage directory +- name: REF + description: The refence genome (and assoicated indicies) + +# Optional parameters +- name: EMAIL + description: An email to use to obtain an evaluation license + defaultValue: None +- name: SENTIEON_VERSION + description: Version of the Sentieon software to use + defaultValue: 201808.07 +- name: READGROUP + description: Readgroup information to add during alignment + defaultValue: "@RG\\tID:read-group\\tSM:sample-name\\tPL:ILLUMINA" +- name: BQSR_SITES + description: Known sites for BQSR (comma-separated) + defaultValue: None +- name: DBSNP + description: A dbSNP file to use during variant calling + defaultValue: None +- name: INTERVAL + description: A string of interval(s) to use during variant calling + defaultValue: None +- name: INTERVAL_FILE + description: An interval file + defaultValue: None +- name: NO_METRICS + description: Set to not run metrics collection + defaultValue: None +- name: NO_BAM_OUTPUT + description: Set to not output a preprocessed BAM file + defaultValue: None +- name: NO_HAPLOTYPER + description: Set to not output a VCF + defaultValue: None +- name: GVCF_OUTPUT + description: Set to output a GVCF instead of a VCF + defaultValue: None +- name: STREAM_INPUT + description: Stream fastq input directly from storage + defaultValue: None +- name: PIPELINE + description: Run the CCDG preprocessing pipeline + defaultValue: CCDG +- name: SENTIEON_KEY + description: A Sentieon License Key + defaultValue: None +- name: DNASCOPE_MODEL + description: A trained model to use with DNAscope + defaultValue: https://s3.amazonaws.com/sentieon-release/other/SentieonDNAscopeModelBeta0.4a-201808.05.model +- name: CALLING_ARGS + description: Additional parameters to set during variant calling + defaultValue: None +- name: CALLING_ALGO + description: The variant calling algorithm to use + defaultValue: Haplotyper diff --git a/runner/sentieon_germline.yaml b/runner/germline.yaml similarity index 73% rename from runner/sentieon_germline.yaml rename to runner/germline.yaml index 0bd1fae..fda8d3e 100644 --- a/runner/sentieon_germline.yaml +++ b/runner/germline.yaml @@ -19,6 +19,12 @@ inputParameters: description: The refence genome (and assoicated indicies) # Optional parameters +- name: EMAIL + description: An email to use to obtain an evaluation license + defaultValue: None +- name: SENTIEON_VERSION + description: Version of the Sentieon software to use + defaultValue: 201808.07 - name: READGROUP description: Readgroup information to add during alignment defaultValue: "@RG\\tID:read-group\\tSM:sample-name\\tPL:ILLUMINA" @@ -53,11 +59,20 @@ inputParameters: description: Stream fastq input directly from storage defaultValue: None - name: PIPELINE - description: Run DNAseq or DNAscope - defaultValue: DNA + description: Run germline variant calling + defaultValue: GERMLINE - name: SENTIEON_KEY description: A Sentieon License Key defaultValue: None - name: RECALIBRATED_OUTPUT description: Set to apply BQSR to the output BAM file defaultValue: None +- name: DNASCOPE_MODEL + description: A trained model to use with DNAscope + defaultValue: https://s3.amazonaws.com/sentieon-release/other/SentieonDNAscopeModelBeta0.4a-201808.05.model +- name: CALLING_ARGS + description: Additional parameters to set during variant calling + defaultValue: None +- name: CALLING_ALGO + description: The variant calling algorithm to use + defaultValue: Haplotyper diff --git a/runner/runner_default.json b/runner/runner_default.json index b26cc52..89697cf 100644 --- a/runner/runner_default.json +++ b/runner/runner_default.json @@ -1,36 +1,41 @@ { - "PREEMPTIBLE_TRIES": 0, - "DEDUP": "rmdup", - "NO_HAPLOTYPER": null, - "TUMOR_FQ1": null, - "TUMOR_FQ2": null, - "BAM": null, - "REF": null, - "INTERVAL_FILE": null, - "REALIGN_SITES": null, - "PROJECT_ID": null, - "DISK_SIZE": 300, - "INTERVAL": null, - "NO_METRICS": null, - "TUMOR_READGROUP": "@RG\\tID:tumor-read-group\\tSM:tumor-sample-name\\tPL:ILLUMINA", - "DBSNP": null, - "MIN_RAM_GB": 56, - "NO_BAM_OUTPUT": null, - "RUN_TNSNV": null, - "OUTPUT_BUCKET": null, - "STREAM_INPUT": null, - "READGROUP": "@RG\\tID:read-group\\tSM:sample-name\\tPL:ILLUMINA", - "BQSR_SITES": null, - "PIPELINE": "DNA", - "MIN_CPU": 64, - "NONPREEMPTIBLE_TRY": true, - "ZONES": null, - "TUMOR_BAM": null, - "NO_VCF": null, - "GVCF_OUTPUT": null, - "FQ2": null, - "FQ1": null, - "DOCKER_IMAGE": "sentieon/sentieon-google-cloud:201808.01", + "FQ1": null, + "FQ2": null, + "BAM": null, + "OUTPUT_BUCKET": null, + "REF": null, + "EMAIL": null, + "SENTIEON_VERSION": "201808.07", + "READGROUP": "@RG\\tID:read-group\\tSM:sample-name\\tPL:ILLUMINA", + "DEDUP": "rmdup", + "BQSR_SITES": null, + "DBSNP": null, + "INTERVAL": null, + "INTERVAL_FILE": null, + "NO_METRICS": null, + "NO_BAM_OUTPUT": null, + "NO_HAPLOTYPER": null, + "GVCF_OUTPUT": null, + "STREAM_INPUT": null, + "PIPELINE": "GERMLINE", "SENTIEON_KEY": null, - "RECALIBRATED_OUTPUT": null + "RECALIBRATED_OUTPUT": null, + "TUMOR_FQ1": null, + "TUMOR_FQ2": null, + "TUMOR_BAM": null, + "TUMOR_READGROUP": "@RG\\tID:tumor-read-group\\tSM:tumor-sample-name\\tPL:ILLUMINA", + "NO_VCF": null, + "RUN_TNSNV": null, + "REALIGN_SITES": null, + "ZONES": null, + "DISK_SIZE": 300, + "MACHINE_TYPE": "n1-highcpu-64", + "CPU_PLATFORM": "Intel Broadwell", + "PROJECT_ID": null, + "DOCKER_IMAGE": "sentieon/sentieon-google-cloud:0.2.2", + "CALLING_ARGS": null, + "CALLING_ALGO": "Haplotyper", + "DNASCOPE_MODEL": "https://s3.amazonaws.com/sentieon-release/other/SentieonDNAscopeModelBeta0.4a-201808.05.model", + "PREEMPTIBLE_TRIES": 0, + "NONPREEMPTIBLE_TRY": true } diff --git a/runner/sentieon_runner.py b/runner/sentieon_runner.py index ba2b232..21d4817 100644 --- a/runner/sentieon_runner.py +++ b/runner/sentieon_runner.py @@ -11,98 +11,160 @@ import argparse import os import sys -from pprint import pprint import copy import time import ssl +import warnings +import logging +import random +import google.auth +import googleapiclient.errors from apiclient.discovery import build -import google.auth +from pprint import pformat +from googleapiclient.errors import HttpError script_dir = os.path.dirname(os.path.realpath(__file__)) -germline_yaml = script_dir + "/sentieon_germline.yaml" -somatic_yaml = script_dir + "/sentieon_somatic.yaml" +germline_yaml = script_dir + "/germline.yaml" +somatic_yaml = script_dir + "/somatic.yaml" +ccdg_yaml = script_dir + "/ccdg.yaml" default_json = script_dir + "/runner_default.json" -target_url_base = "https://www.googleapis.com/compute/v1/projects/{project}/zones/{zone}/instances/{instance}" +target_url_base = ("https://www.googleapis.com/compute/v1/projects/{project}/" + "zones/{zone}/instances/{instance}") def cloud_storage_exists(client, gs_path): - try: bucket, blob = gs_path[5:].split('/', 1) bucket = client.bucket(bucket) blob = bucket.blob(blob) res = blob.exists() - except: # Catch all exceptions - raise ValueError("Error: Could not find {gs_path} in Google Cloud Storage".format(**locals())) + except: # Catch all exceptions + raise ValueError("Error: Could not find {gs_path} in Google Cloud " + "Storage".format(**locals())) return res def check_inputs_exist(job_vars, credentials): from google.cloud import storage - client = storage.Client(credentials=credentials) + with warnings.catch_warnings(): + warnings.filterwarnings("ignore", "Your application has authenticated " + "using end user credentials from Google Cloud " + "SDK") + client = storage.Client(credentials=credentials) # The DBSNP, BQSR and Realign sites files sites_files = [] - sites_files += job_vars["BQSR_SITES"].split(',') if job_vars["BQSR_SITES"] else [] - sites_files += job_vars["REALIGN_SITES"].split(',') if job_vars["REALIGN_SITES"] else [] + sites_files += (job_vars["BQSR_SITES"].split(',') if + job_vars["BQSR_SITES"] else []) + sites_files += (job_vars["REALIGN_SITES"].split(',') if + job_vars["REALIGN_SITES"] else []) sites_files += [job_vars["DBSNP"]] if job_vars["DBSNP"] else [] for sites_file in sites_files: if not cloud_storage_exists(client, sites_file): - sys.exit("Error: Could not find supplied file {}".format(sites_file)) + logging.error("Could not find supplied file " + "{}".format(sites_file)) + sys.exit(-1) if sites_file.endswith("vcf.gz"): if not cloud_storage_exists(client, sites_file + ".tbi"): - sys.exit("Error: Could not find index for file {}".format(sites_file)) + logging.error("Could not find index for file " + "{}".format(sites_file)) + sys.exit(-1) else: if not cloud_storage_exists(client, sites_file + ".idx"): - sys.exit("Error: Could not find index for file {}".format(sites_file)) + logging.error("Could not find index for file " + "{}".format(sites_file)) + sys.exit(-1) # The data input files - gs_split_files = (job_vars["FQ1"], job_vars["TUMOR_FQ1"], job_vars["FQ2"], job_vars["TUMOR_FQ2"], job_vars["BAM"], job_vars["TUMOR_BAM"]) + gs_split_files = ( + job_vars["FQ1"], + job_vars["TUMOR_FQ1"], + job_vars["FQ2"], + job_vars["TUMOR_FQ2"], + job_vars["BAM"], + job_vars["TUMOR_BAM"]) gs_files = () for split_file in gs_split_files: if not split_file: continue for input_file in split_file.split(','): if not cloud_storage_exists(client, input_file): - sys.exit("Error: Could not find the supplied file {}".format(input_file)) + logging.error("Could not find the supplied file " + "{}".format(input_file)) + sys.exit(-1) for input_file in gs_files: if not cloud_storage_exists(client, input_file): - sys.exit("Error: Could not file the supplied file {}".format(input_file)) + logging.error("Could not file the supplied file " + "{}".format(input_file)) + sys.exit(-1) # All reference files ref = job_vars["REF"] ref_base = ref[:-3] if ref.endswith(".fa") else ref[:-6] if not cloud_storage_exists(client, ref): - sys.exit("Error: Reference file not found") + logging.error("Reference file not found") + sys.exit(-1) if not cloud_storage_exists(client, ref + ".fai"): - sys.exit("Error: Reference fai index not found") + logging.error("Reference fai index not found") + sys.exit(-1) if (not cloud_storage_exists(client, ref + ".dict") and - not cloud_storage_exists(client, ref_base + ".dict")): - sys.exit("Error: Reference dict index not found") + not cloud_storage_exists(client, ref_base + ".dict")): + logging.error("Reference dict index not found") + sys.exit(-1) # FQ specific if job_vars["FQ1"] or job_vars["TUMOR_FQ1"]: for suffix in [".amb", ".ann", ".bwt", ".pac", ".sa"]: - if not cloud_storage_exists(client, ref + suffix): - sys.exit("Error: Reference BWA index {} not found".format(suffix)) + if (not cloud_storage_exists(client, ref + suffix) and + not cloud_storage_exists(client, ref + ".64" + suffix)): + logging.error("Reference BWA index {} not " + "found".format(suffix)) + sys.exit(-1) # BAM specific bam_vars = ("BAM", "TUMOR_BAM") for bam_type in bam_vars: if job_vars[bam_type]: for bam in job_vars[bam_type].split(','): if (not cloud_storage_exists(client, bam + ".bai") and - not cloud_storage_exists(client, bam + "bai")): - sys.exit("Error: BAM supplied but BAI not found") - + not cloud_storage_exists(client, bam + "bai")): + logging.error("BAM supplied but BAI not found") + sys.exit(-1) + def main(vargs=None): parser = argparse.ArgumentParser() parser.add_argument("pipeline_config", help="The json configuration file") - parser.add_argument("--no_check_inputs_exist", action="store_true", help="Do not check that the input files exist before running the pipeline") - parser.add_argument("--polling_interval", type=float, default=30, help="Seconds between polling the running operation") + parser.add_argument( + "--verbose", + "-v", + action="count", + help="Increase the runner verbosity") + parser.add_argument( + "--no_check_inputs_exist", + action="store_true", + help="Do not check that the input files exist before running the " + "pipeline") + parser.add_argument( + "--polling_interval", + type=float, + default=30, + help="Seconds between polling the running operation") args = parser.parse_args() polling_interval = args.polling_interval + logging.getLogger("googleapiclient." + "discovery_cache").setLevel(logging.ERROR) + log_format = "%(filename)s::%(funcName)s [%(levelname)s] %(message)s" + if not hasattr(args, "verbose") or args.verbose is None: + log_level = logging.WARNING + elif args.verbose == 1: + log_level = logging.INFO + elif args.verbose >= 2: + log_level = logging.DEBUG + else: + log_level = logging.WARNING + logging.basicConfig(level=log_level, format=log_format) + # Grab input arguments from the json file job_vars = json.load(open(default_json)) job_vars.update(json.load(open(args.pipeline_config))) @@ -110,14 +172,35 @@ def main(vargs=None): if job_vars["NONPREEMPTIBLE_TRY"]: non_preemptible_tries = 1 preemptible = True if preemptible_tries > 0 else False - credentials, project_id = google.auth.default() + with warnings.catch_warnings(): + warnings.filterwarnings("ignore", "Your application has authenticated " + "using end user credentials from Google Cloud " + "SDK") + credentials, project_id = google.auth.default() + + # Warn with depreciated JSON keys + if "MIN_RAM_GB" in job_vars or "MIN_CPU" in job_vars: + logging.warning("'MIN_RAM_GB' and 'MIN_CPU' are now ignored. " + "Please use 'MACHINE_TYPE' to specify the instance " + "type") # Grab the yaml for the workflow - pipeline_yaml = germline_yaml if job_vars["PIPELINE"] == "DNA" or job_vars["PIPELINE"] == "DNAscope" or job_vars["PIPELINE"] == "DNAseq" else somatic_yaml + pipeline = job_vars["PIPELINE"] + if pipeline == "GERMLINE": + pipeline_yaml = germline_yaml + elif pipeline == "SOMATIC": + pipeline_yaml = somatic_yaml + elif pipeline == "CCDG": + pipeline_yaml = ccdg_yaml + else: + logging.error("Pipeline '" + pipeline + "'. Valid " + "values are 'GERMLINE' and 'SOMATIC'") + sys.exit(-1) try: - pipeline_dict = yaml.load(open(pipeline_yaml)) + pipeline_dict = yaml.safe_load(open(pipeline_yaml)) except IOError: - sys.exit("Error. No yaml \"{}\" found.".format(pipeline_yaml)) + logging.error("No yaml \"{}\" found.".format(pipeline_yaml)) + sys.exit(-1) # Try not to create nearly empty directories while job_vars["OUTPUT_BUCKET"].endswith('/'): @@ -125,172 +208,295 @@ def main(vargs=None): # Some basic error checking to fail early if not job_vars["PROJECT_ID"]: - sys.exit("Error: Please supply a PROJECT_ID") + logging.error("Please supply a PROJECT_ID") + sys.exit(-1) # Shared errors if job_vars["FQ1"] and job_vars["BAM"]: - sys.exit("Error: Please supply either 'FQ1' or 'BAM' (not both)") + logging.error("Please supply either 'FQ1' or 'BAM' (not both)") + sys.exit(-1) if job_vars["INTERVAL"] and job_vars["INTERVAL_FILE"]: - sys.exit("Error: Please supply either 'INTERVAL' or 'INTERVAL_FILE'") + logging.error("Please supply either 'INTERVAL' or 'INTERVAL_FILE'") + sys.exit(-1) if ((job_vars["FQ1"] and job_vars["READGROUP"]) and - len(job_vars["FQ1"].split(',')) != len(job_vars["READGROUP"].split(','))): - sys.exit("Error: The number of fastq files must match the number of supplied readgroups") + (len(job_vars["FQ1"].split(',')) != + len(job_vars["READGROUP"].split(',')))): + logging.error("The number of fastq files must match the number of " + "supplied readgroups") + sys.exit(-1) # Pipeline specific errors - if job_vars["PIPELINE"] == "DNA" or job_vars["PIPELINE"] == "DNAscope" or job_vars["PIPELINE"] == "DNAseq": + if pipeline == "GERMLINE" or pipeline == "CCDG": if not job_vars["FQ1"] and not job_vars["BAM"]: - sys.exit("Error: Please supply either 'FQ1' or 'BAM'") - if job_vars["NO_HAPLOTYPER"] and job_vars["NO_METRICS"] and job_vars["NO_BAM_OUTPUT"]: - sys.exit("Error: No output files requested") + logging.error("Please supply either 'FQ1' or 'BAM'") + sys.exit(-1) + if (job_vars["NO_HAPLOTYPER"] and + job_vars["NO_METRICS"] and + job_vars["NO_BAM_OUTPUT"]): + logging.error("No output files requested") + sys.exit(-1) if job_vars["RECALIBRATED_OUTPUT"] and job_vars["BQSR_SITES"] is None: - sys.exit("Error: Cannot output a recalibrated BAM file without running BQSR. Please supply 'BQSR_SITES'") - elif job_vars["PIPELINE"] == "TN" or job_vars["PIPELINE"] == "TNscope" or job_vars["PIPELINE"] == "TNseq": + logging.error("Cannot output a recalibrated BAM file without " + "running BQSR. Please supply 'BQSR_SITES'") + sys.exit(-1) + valid_algos = ("Haplotyper", "DNAscope") + if job_vars["CALLING_ALGO"] not in valid_algos: + logging.error(job_vars["CALLING_ALGO"] + "' is not a " + "valid germline variant calling algo. Please set " + "'CALLING_ALGO' to one of " + str(valid_algos)) + sys.exit(-1) + # Additional CCDG checks + if pipeline == "CCDG": + if job_vars["BQSR_SITES"] is None: + logging.error("The CCDG pipeline requires known sites for " + "BQSR. Please supply 'BQSR_SITES'") + sys.exit(-1) + elif pipeline == "SOMATIC": if job_vars["TUMOR_FQ1"] and job_vars["TUMOR_BAM"]: - sys.exit("Error: Please supply either 'TUMOR_FQ1' or 'TUMOR_BAM' (not both)") - if not job_vars["TUMOR_FQ1"] and not job_vars["TUMOR_BAM"]: - sys.exit("Error: Please supply either 'TUMOR_FQ1' or 'TUMOR_BAM'") - if job_vars["RUN_TNSNV"] and not job_vars["REALIGN_SITES"]: - sys.exit("Error: TNsnv requires indel realignment. Please supply 'REALIGN_SITES'") - if job_vars["NO_BAM_OUTPUT"] and job_vars["NO_VCF"] and job_vars["NO_METRICS"]: - sys.exit("Error: No output files requested") + logging.error("Please supply either 'TUMOR_FQ1' or 'TUMOR_BAM' " + "(not both)") + sys.exit(-1) + if (not job_vars["TUMOR_FQ1"] and + not job_vars["TUMOR_BAM"]): + logging.error("Please supply either 'TUMOR_FQ1' or 'TUMOR_BAM'") + sys.exit(-1) + if (job_vars["RUN_TNSNV"] and + not job_vars["REALIGN_SITES"]): + logging.error("TNsnv requires indel realignment. Please supply " + "'REALIGN_SITES'") + sys.exit(-1) + if (job_vars["NO_BAM_OUTPUT"] and + job_vars["NO_VCF"] and job_vars["NO_METRICS"]): + logging.error("No output files requested") + sys.exit(-1) if ((job_vars["TUMOR_FQ1"] and job_vars["TUMOR_READGROUP"]) and - len(job_vars["TUMOR_FQ1"].split(',')) != len(job_vars["TUMOR_READGROUP"].split(','))): - sys.exit("Error: The number of tumor fastq files must match the number of supplied readgroups") - - else: - sys.exit("Error: DNAseq, DNAscope, TNseq, and TNscope are currently supported") + (len(job_vars["TUMOR_FQ1"].split(',')) != + len(job_vars["TUMOR_READGROUP"].split(',')))): + logging.error("The number of tumor fastq files must match the " + "number of supplied readgroups") + sys.exit(-1) + valid_algos = ("TNhaplotyper", "TNhaplotyper2", "TNscope", "TNsnv") + if job_vars["CALLING_ALGO"] not in valid_algos: + logging.error(job_vars["CALLING_ALGO"] + "' is not a " + "valid somatic variant calling algo. Please set " + "'CALLING_ALGO' to one of " + str(valid_algos)) + sys.exit(-1) if not args.no_check_inputs_exist: check_inputs_exist(job_vars, credentials) - # Construct the pipeline arguments - args_dict = {} - args_dict["projectId"] = job_vars["PROJECT_ID"] - args_dict["logging"] = {"gcsPath": job_vars["OUTPUT_BUCKET"] + "/worker_logs/"} - resources_dict = {} - resources_dict["minimumRamGb"] = job_vars["MIN_RAM_GB"] - resources_dict["minimumCpuCores"] = job_vars["MIN_CPU"] - resources_dict["zones"] = job_vars["ZONES"].split(',') if job_vars["ZONES"] else [] - args_dict["resources"] = copy.copy(resources_dict) - - # Translate None back to "None" - input_dict = {} + # Resources dict + zones = job_vars["ZONES"].split(',') if job_vars["ZONES"] else [] + if not zones: + logging.error("Please supply at least one zone to run the pipeline") + region = zones[0][:-2] + disk = { + "name": "local-disk", + "type": "local-ssd", + "sizeGb": int(job_vars["DISK_SIZE"]) + } + vm_dict = { + "machineType": job_vars["MACHINE_TYPE"], + "preemptible": preemptible, + "disks": [disk], + "serviceAccount": {"scopes": [ + "https://www.googleapis.com/auth/cloud-platform"]}, + "cpuPlatform": job_vars["CPU_PLATFORM"] + } + resources_dict = { + "zones": zones, + "virtualMachine": vm_dict + } + + # Environment + env_dict = {} for input_var in pipeline_dict["inputParameters"]: - input_dict[input_var["name"]] = job_vars[input_var["name"]] - if input_dict[input_var["name"]] is None: - input_dict[input_var["name"]] = "None" - args_dict["inputs"] = input_dict - - # Construct the pipeline object - resources_dict = {} - resources_dict["preemptible"] = preemptible - - # Ideally we could use mdadm to combine local SSD, but that is not possible inside the container - disk = {} - disk["name"] = "local-disk" - disk["mountPoint"] = "/mnt/work" - if int(job_vars["DISK_SIZE"]) <= 375: - print("Disk is less than 375 GB, using a single local SSD") - disk["type"] = "LOCAL_SSD" + env_dict[input_var["name"]] = job_vars[input_var["name"]] + if env_dict[input_var["name"]] is None: + env_dict[input_var["name"]] = "None" + + # Action + if pipeline == "GERMLINE": + _cmd = "/opt/sentieon/gc_germline.sh" + elif pipeline == "SOMATIC": + _cmd = "/opt/sentieon/gc_somatic.sh" + elif pipeline == "CCDG": + _cmd = "/opt/sentieon/gc_ccdg_germline.sh" else: - print("Disk is greater than 375 GB, using a persistant SSD") - disk["type"] = "PERSISTENT_SSD" - disk["sizeGb"] = int(job_vars["DISK_SIZE"]) - - disks = [disk] - resources_dict["disks"] = disks - pipeline_dict["resources"] = resources_dict - pipeline_dict["projectId"] = job_vars["PROJECT_ID"] - pipeline_dict["docker"] = { - "imageName": job_vars["DOCKER_IMAGE"], - "cmd": "bash /opt/sentieon/gc_germline.sh" if job_vars["PIPELINE"] == "DNA" or job_vars["PIPELINE"] == "DNAscope" or job_vars["PIPELINE"] == "DNAseq" else "bash /opt/sentieon/gc_somatic.sh" + logging.error("Error: Unknown pipeline " + pipeline) + sys.exit(-1) + + run_action = { + "containerName": "run-pipeline", + "imageUri": job_vars["DOCKER_IMAGE"], + "commands": ["/bin/bash", _cmd], + "mounts": [{ + "disk": "local-disk", + "path": "/mnt/work", + "readOnly": False + }], + } + + cleanup_action = { + "containerName": "cleanup", + "imageUri": job_vars["DOCKER_IMAGE"], + "commands": [ + "/bin/bash", + "-c", + ("gsutil cp /google/logs/action/1/stderr " + "\"{}/worker_logs/stderr.txt\" && " + "gsutil cp /google/logs/action/1/stdout " + "\"{}/worker_logs/stdout.txt\"").format( + job_vars["OUTPUT_BUCKET"], job_vars["OUTPUT_BUCKET"])], + "alwaysRun": True } # Run the pipeline # - service = build('genomics', 'v1alpha2', credentials=credentials) + project = job_vars["PROJECT_ID"] + service_parent = "projects/" + project + "/locations/" + region + service = build('lifesciences', 'v2beta', credentials=credentials) compute_service = build("compute", "v1", credentials=credentials) operation = None counter = 0 - project = job_vars["PROJECT_ID"] - + while non_preemptible_tries > 0 or preemptible_tries > 0: if operation: - while not operation["done"]: - time.sleep(polling_interval) - try: - operation = service.operations().get(name=operation['name']).execute() - except ssl.SSLError: - print("Network error while polling running operation.") - sys.stdout.flush() + while not operation.get("done", False): + new_op, tries = None, 0 + while tries <= 5: + time.sleep(polling_interval) + try: + ops = service.projects().locations().operations() + new_op = ops.get(name=operation['name']).execute() + break + except googleapiclient.errors.HttpError as e: + logging.warning(str(e)) + tries += 1 + except ssl.SSLError as e: + logging.warning(str(e)) + tries += 1 + if not new_op: + logging.error("Network error while polling running " + "operation.") sys.exit(1) - pprint(operation, indent=2) + operation = new_op + logging.debug(pformat(operation, indent=2)) if "error" in operation: - try: - zone = operation["metadata"]["runtimeMetadata"]["computeEngine"]["zone"] - except KeyError: - print("Genomics operation failed before running:") - pprint(operation["error"], indent=2) - sys.stdout.flush() + assigned_events = list(filter( + lambda x: "workerAssigned" in x.keys(), + operation["metadata"]["events"])) + if not assigned_events: + logging.error("Genomics operation failed before running:") + logging.error(pformat(operation["error"], indent=2)) sys.exit(2) - instance = operation["metadata"]["runtimeMetadata"]["computeEngine"]["instanceName"] + startup_event = assigned_events[-1] + instance = startup_event["workerAssigned"]["instance"] + zone = startup_event["workerAssigned"]["zone"] url = target_url_base.format(**locals()) - time.sleep(30) # Don't poll too quickly - compute_ops = compute_service.zoneOperations().list(project=project, zone=zone, filter="(targetLink eq {url}) (operationType eq compute.instances.preempted)".format(**locals())).execute() - if "items" in compute_ops and any([x["operationType"] == "compute.instances.preempted" for x in compute_ops["items"]]): - print("Run {} failed. Retrying...".format(counter)) + time.sleep(300) # It may take some time to set the operation + compute_ops = ( + compute_service.zoneOperations().list( + project=project, zone=zone, filter=( + "(targetLink eq {url}) (operationType eq " + "compute.instances.preempted)" + ).format(**locals())).execute()) + if ("items" in compute_ops and + any([(x["operationType"] == + "compute.instances.preempted") + for x in compute_ops["items"]])): + logging.warning("Run {} failed. " + "Retrying...".format(counter)) else: - print("Run {} failed, but not due to preemption. Exit".format(counter)) + logging.error("Run {} failed, but not due to preemption. " + "Exit".format(counter)) operation = None break else: break if preemptible_tries > 0: - args_dict["resources"]["preemptible"] = True + vm_dict["preemptible"] = True preemptible_tries -= 1 else: - args_dict["resources"]["preemptible"] = False + vm_dict["preemptible"] = False non_preemptible_tries -= 1 - print("Running pipeline:") + logging.debug("Running pipeline:") body = { - "ephemeralPipeline": pipeline_dict, - "pipelineArgs": args_dict + "pipeline": { + "actions": [run_action, cleanup_action], + "resources": resources_dict, + "environment": env_dict + } } - pprint(body, indent=2) - sys.stdout.flush() - operation = service.pipelines().run(body=body).execute() + logging.debug(pformat(body, indent=2)) + sys.stderr.flush() + backoff, backoff_interval = 0, 1 + while backoff < 6: + time.sleep(backoff_interval * random.random() * (2 ** backoff - 1)) + try: + op_pipelines = service.projects().locations().pipelines() + request = op_pipelines.run( + parent=service_parent, + body=body) + operation = request.execute() + break + except googleapiclient.errors.HttpError as e: + logging.warning(str(e)) + backoff += 1 + if not operation: + logging.error("Failed to launch job") + sys.exit(3) + else: + logging.warning("Launched job: " + operation["name"]) counter += 1 + logging.debug(pformat(operation, indent=2)) if operation: - while not operation["done"]: - time.sleep(polling_interval) - try: - operation = service.operations().get(name=operation["name"]).execute() - except ssl.SSLError: - print("Network error while waiting for the final operation to finish") - sys.stdout.flush() + while not operation.get("done", False): + new_op, tries = None, 0 + while tries <= 5: + time.sleep(polling_interval) + try: + new_op = service.projects().locations().operations().get( + name=operation["name"]).execute() + break + except googleapiclient.errors.HttpError as e: + logging.warning(str(e)) + tries += 1 + except ssl.SSLError: + logging.warning(str(e)) + tries += 1 + if not new_op: + logging.error("Network error while waiting for the final " + "operation to finish") sys.exit(1) + operation = new_op + logging.debug(pformat(operation, indent=2)) if "error" in operation: - pprint(operation, indent=2) - try: - zone = operation["metadata"]["runtimeMetadata"]["computeEngine"]["zone"] - except KeyError: - print("Genomics operation failed before running:") - pprint(operation["error"], indent=2) - sys.stdout.flush() + assigned_events = list(filter( + lambda x: "workerAssigned" in x.keys(), + operation["metadata"]["events"])) + if not assigned_events: + logging.error("Genomics operation failed before running:") + logging.error(pformat(operation["error"], indent=2)) sys.exit(2) - instance = operation["metadata"]["runtimeMetadata"]["computeEngine"]["instanceName"] + startup_event = assigned_events[-1] + instance = startup_event["workerAssigned"]["instance"] + zone = startup_event["workerAssigned"]["zone"] url = target_url_base.format(**locals()) - compute_ops = compute_service.zoneOperations().list(project=project, zone=zone, filter="(targetLink eq {url}) (operationType eq compute.instances.preempted)".format(**locals())).execute() - if "items" in compute_ops: - print("Final run failed due to preemption") - else: - print("Final run failed, not due to preemption") + compute_ops = compute_service.zoneOperations().list( + project=project, + zone=zone, + filter=("(targetLink eq {url}) (operationType eq " + "compute.instances.preempted)").format( + **locals())).execute() + logging.error("Final run failed.") else: - print("Operation succeeded") + logging.warning("Operation succeeded") + if __name__ == "__main__": main() diff --git a/runner/sentieon_somatic.yaml b/runner/somatic.yaml similarity index 83% rename from runner/sentieon_somatic.yaml rename to runner/somatic.yaml index 54970fb..370be84 100644 --- a/runner/sentieon_somatic.yaml +++ b/runner/somatic.yaml @@ -19,6 +19,12 @@ inputParameters: description: The refence genome (and assoicated indicies) # Optional parameters +- name: EMAIL + description: An email to use to obtain an evaluation license + defaultValue: None +- name: SENTIEON_VERSION + description: Version of the Sentieon software to use + defaultValue: 201808.07 - name: FQ1 defaultValue: None description: Fastq for a single sample (comma-separated) @@ -65,12 +71,17 @@ inputParameters: description: Stream fastq input directly from storage defaultValue: None - name: PIPELINE - description: Run DNAseq or DNAscope - defaultValue: TNseq + description: Run the somatic pipeline + defaultValue: SOMATIC - name: REALIGN_SITES description: Known sites for indel realignment (comma-separated) defaultValue: None - name: SENTIEON_KEY description: A Sentieon License Key defaultValue: None - +- name: CALLING_ARGS + description: Additional parameters to set during variant calling + defaultValue: None +- name: CALLING_ALGO + description: The variant calling algorithm to use + defaultValue: TNhaplotyper diff --git a/runner/tool_yaml_to_pipeline_json.py b/runner/tool_yaml_to_pipeline_json.py index 9841c52..40ad150 100644 --- a/runner/tool_yaml_to_pipeline_json.py +++ b/runner/tool_yaml_to_pipeline_json.py @@ -6,6 +6,7 @@ import os import json + def add_to_yaml(a, b, ignore_keys=set(("name", "description", "PIPELINE"))): ''' If an object is in b but not a, add it to a. @@ -22,8 +23,8 @@ def add_to_yaml(a, b, ignore_keys=set(("name", "description", "PIPELINE"))): add_to_yaml(a[k], b[k]) else: a[k] = b[k] - elif type(a) is list: # In this case, order is unimportant - try: # Only applicable with zones. + elif type(a) is list: # In this case, order is unimportant + try: # Only applicable with zones. a_set, b_set = set(a), set(b) intersection = list(a_set.intersection(b_set)) for i in range(len(a) - 1, -1, -1): @@ -35,7 +36,8 @@ def add_to_yaml(a, b, ignore_keys=set(("name", "description", "PIPELINE"))): except TypeError: pass # item["name"] essentially acts as a key - a_dict, b_dict = dict([(x["name"], x) for x in a]), dict([(x["name"], x) for x in b]) + a_dict = dict([(x["name"], x) for x in a]) + b_dict = dict([(x["name"], x) for x in b]) for k, v in b_dict.items(): if k in ignore_keys: continue @@ -50,13 +52,15 @@ def add_to_yaml(a, b, ignore_keys=set(("name", "description", "PIPELINE"))): print("'{}' does not equal '{}'".format(a, b)) raise ValueError("Values are unequal") -germline = yaml.load(open(os.path.dirname(os.path.realpath(__file__)) + "/sentieon_germline.yaml")) + +script_dir = os.path.dirname(os.path.realpath(__file__)) +germline = yaml.load(open(script_dir + "/germline.yaml")) tn = None try: - tn = yaml.load(open(os.path.dirname(os.path.realpath(__file__)) + "/sentieon_somatic.yaml")) + tn = yaml.load(open(script_dir + "/somatic.yaml")) except IOError: pass -output = os.path.dirname(os.path.realpath(__file__)) + "/runner_default.json" +output = script_dir + "/runner_default.json" additional_input_params = { "ZONES": None, @@ -65,7 +69,7 @@ def add_to_yaml(a, b, ignore_keys=set(("name", "description", "PIPELINE"))): "MIN_RAM_GB": 56, "PIPELINE": "DNA", "PROJECT_ID": None, - "DOCKER_IMAGE": "sentieon/sentieon-google-cloud:201711.02", + "DOCKER_IMAGE": "sentieon/sentieon-google-cloud:0.2.0", "PREEMPTIBLE_TRIES": 0, "NONPREEMPTIBLE_TRY": True }