Merge pull request #2 from Sentieon/dev

Merge updates in the Dev branch to Master
Sentieon · Mar 26, 2021 · 5426fa2 · 5426fa2
2 parents c2da855 + 902c276
commit 5426fa2
Show file tree

Hide file tree

Showing 16 changed files with 425 additions and 255 deletions.
diff --git a/README.md b/README.md
@@ -106,38 +106,42 @@ The runner script accepts a JSON file as input. In the repository you downloaded
   "FQ1": "gs://sentieon-test/pipeline_test/inputs/test1_1.fastq.gz",
   "FQ2": "gs://sentieon-test/pipeline_test/inputs/test1_2.fastq.gz",
   "REF": "gs://sentieon-test/pipeline_test/reference/hs37d5.fa",
-  "OUTPUT_BUCKET": "gs://BUCKET",
+  "OUTPUT_BUCKET": "YOUR_BUCKET_HERE",
   "ZONES": "us-central1-a,us-central1-b,us-central1-c,us-central1-f",
-  "PROJECT_ID": "PROJECT_ID",
-  "EMAIL": "EMAIL"
+  "PROJECT_ID": "YOUR_PROJECT_HERE",
+  "REQUESTER_PROJECT": "YOUR_PROJECT_HERE",
+  "EMAIL": "YOUR_EMAIL_HERE"
 }
 ```
 
 The following table describes the JSON keys in the file:
 
-| JSON key      | Description                                                                   |
-| ------------- | ----------------------------------------------------------------------------- |
-| FQ1           | The first pair of reads in the input fastq file.                              |
-| FQ2           | The second pair of reads in the input fastq file.                             |
-| BAM           | The input BAM file, if applicable.                                            |
-| REF           | The reference genome. If set, the reference index files are assumed to exist. |
-| OUTPUT_BUCKET | The bucket and directory used to store the data output from the pipeline.     |
-| ZONES         | A comma-separated list of GCP zones to use for the worker node.               |
-| PROJECT_ID    | Your GCP project ID.                                                          |
-| EMAIL         | Your email                                                                    |
+| JSON key          | Description                                                                   |
+| ----------------- | ----------------------------------------------------------------------------- |
+| FQ1               | The first pair of reads in the input fastq file.                              |
+| FQ2               | The second pair of reads in the input fastq file.                             |
+| BAM               | The input BAM file, if applicable.                                            |
+| REF               | The reference genome. If set, the reference index files are assumed to exist. |
+| OUTPUT_BUCKET     | The bucket and directory used to store the data output from the pipeline.     |
+| ZONES             | A comma-separated list of GCP zones to use for the worker node.               |
+| PROJECT_ID        | Your GCP project ID.                                                          |
+| REQUESTER_PROJECT | A project to bill when transferring data from Requester Pays buckets.         |
+| EMAIL             | Your email                                                                    |
 
-The `FQ1`, `FQ2`, `REF`, and `ZONES` fields will work with the defaults. However, the `OUTPUT_BUCKET`, `PROJECT_ID`, and `EMAIL` fields will need to be updated to point to your specific output bucket/path, Project ID, and email address.
+The `FQ1`, `FQ2`, `REF`, and `ZONES` fields will work with the defaults. However, the `OUTPUT_BUCKET`, `PROJECT_ID`, `REQUESTER_PROJECT`, and `EMAIL` fields will need to be updated to point to your specific output bucket/path, Project ID, and email address.
 
 <a name="run"/>
 
 ### Run the example pipelines
 
-Edit the `OUTPUT_BUCKET`, `PROJECT_ID`, and `EMAIL` fields in the `examples/example.json` to your output bucket/path, the GCP Project ID that you setup earlier, and email you want associated with your Sentieon license. By supplying the `EMAIL` field, your PROJECT_ID will automatically receive a 14 day free trial for the Sentieon software on the Google Cloud.
+Edit the `OUTPUT_BUCKET`, `PROJECT_ID`, `REQUESTER_PROJECT`, and `EMAIL` fields in the `examples/example.json` to your output bucket/path, the GCP Project ID that you setup earlier, and email you want associated with your Sentieon license. By supplying the `EMAIL` field, your PROJECT_ID will automatically receive a 14 day free trial for the Sentieon software on the Google Cloud.
 
 You after modifying the `examples/example.json` file, you can use the following command to run the DNAseq pipeline on a small test dataset.
 ```bash
-python runner/sentieon_runner.py examples/example.json
+python runner/sentieon_runner.py --requester_project $PROJECT_ID  examples/example.json
 ```
+The `--requester_project` argument will configure the software to use the specified PROJECT_ID when polling input files locally. Alternatively, you might set `--no_check_inputs_exist` to skip input file polling.
+
 
 <a name="understand"/>
 
@@ -278,6 +282,7 @@ The `CALLING_ALGO` key key can be change to `TNsnv`, `TNhaplotyper`, `TNhaplotyp
 | EMAIL               | An email address to use to obtain an evaluation license for your GCP Project                        |
 | SENTIEON_KEY        | Your Sentieon license key (only applicable for paying customers)                                    |
 | PROJECT_ID          | Your GCP Project ID to use when running jobs                                                        |
+| REQUESTER_PROJECT   | A project to bill when transferring data from Requester Pays buckets                                |
 | PREEMPTIBLE_TRIES   | Number of attempts to run the pipeline using preemptible instances                                  |
 | NONPREEMPTIBLE_TRY  | After `PREEMPTIBLE_TRIES` are exhausted, whether to try one additional run with standard instances  |
 
@@ -343,6 +348,7 @@ The `CALLING_ALGO` key key can be change to `TNsnv`, `TNhaplotyper`, `TNhaplotyp
 | EMAIL               | An email address to use to obtain an evaluation license for your GCP Project                        |
 | SENTIEON_KEY        | Your Sentieon license key (only applicable for paying customers)                                    |
 | PROJECT_ID          | Your GCP Project ID to use when running jobs                                                        |
+| REQUESTER_PROJECT   | A project to bill when transferring data from Requester Pays buckets                                |
 | PREEMPTIBLE_TRIES   | Number of attempts to run the pipeline using preemptible instances                                  |
 | NONPREEMPTIBLE_TRY  | After `PREEMPTIBLE_TRIES` are exhausted, whether to try one additional run with standard instances  |
 

diff --git a/examples/100x_wes.json b/examples/100x_wes.json
@@ -9,5 +9,6 @@
   "STREAM_INPUT": "True",
   "ZONES": "us-central1-a,us-central1-b,us-central1-c,us-central1-f",
   "PROJECT_ID": "YOUR_PROJECT_HERE",
+  "REQUESTER_PROJECT": "YOUR_PROJECT_HERE",
   "EMAIL": "YOUR_EMAIL_HERE"
 }
diff --git a/examples/30x_wgs.json b/examples/30x_wgs.json
@@ -9,5 +9,6 @@
   "STREAM_INPUT": "True",
   "ZONES": "us-central1-a,us-central1-b,us-central1-c,us-central1-f",
   "PROJECT_ID": "YOUR_PROJECT_HERE",
+  "REQUESTER_PROJECT": "YOUR_PROJECT_HERE",
   "EMAIL": "YOUR_EMAIL_HERE"
 }
diff --git a/examples/30x_wgs_ccdg.json b/examples/30x_wgs_ccdg.json
@@ -9,5 +9,6 @@
   "STREAM_INPUT": "True",
   "ZONES": "us-central1-a,us-central1-b,us-central1-c,us-central1-f",
   "PROJECT_ID": "YOUR_PROJECT_HERE",
+  "REQUESTER_PROJECT": "YOUR_PROJECT_HERE",
   "EMAIL": "YOUR_EMAIL_HERE"
 }
diff --git a/examples/example.json b/examples/example.json
@@ -5,5 +5,6 @@
   "OUTPUT_BUCKET": "YOUR_BUCKET_HERE",
   "ZONES": "us-central1-a,us-central1-b,us-central1-c,us-central1-f",
   "PROJECT_ID": "YOUR_PROJECT_HERE",
+  "REQUESTER_PROJECT": "YOUR_PROJECT_HERE",
   "EMAIL": "YOUR_EMAIL_HERE"
 }
diff --git a/examples/tn_example.json b/examples/tn_example.json
@@ -7,6 +7,7 @@
   "OUTPUT_BUCKET": "YOUR_BUCKET_HERE",
   "ZONES": "us-central1-a,us-central1-b,us-central1-c,us-central1-f",
   "PROJECT_ID": "YOUR_PROJECT_HERE",
+  "REQUESTER_PROJECT": "YOUR_PROJECT_HERE",
   "PIPELINE": "SOMATIC",
   "CALLING_ALGO": "TNhaplotyper",
   "EMAIL": "YOUR_EMAIL_HERE"

diff --git a/pipeline_scripts/Dockerfile b/pipeline_scripts/Dockerfile
@@ -1,17 +1,4 @@
-FROM python:2.7.16-slim-stretch as downloader
-
-# Install gsutil
-RUN apt-get update && \
-    apt-get install -y \
-        curl \
-        python-pip \
-        gcc \
-        lsb-release && \
-    export CLOUD_SDK_REPO="cloud-sdk-$(lsb_release -c -s)" && \
-    echo "deb http://packages.cloud.google.com/apt $CLOUD_SDK_REPO main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && \
-    curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add - && \
-    apt-get update && apt-get install -y google-cloud-sdk && \
-    pip install crcmod
+FROM google/cloud-sdk:333.0.0-slim as downloader
 
 # Install samtools
 RUN apt-get update && \
@@ -35,15 +22,10 @@ RUN curl -Lo samblaster-v.0.1.24.tar.gz https://github.com/GregoryFaust/samblast
   make && \
   cp samblaster /usr/local/bin/
 
-# Install metadata script dependencies
-RUN pip install requests urllib3
+FROM google/cloud-sdk:333.0.0-slim
 
-FROM python:2.7.16-slim-stretch
+LABEL container.base.image="google/cloud-sdk:333.0.0-slim"
 
-LABEL container.base.image="python:2.7.16-slim-stretch"
-
-COPY --from=downloader /usr/lib/google-cloud-sdk /usr/lib/google-cloud-sdk
-COPY --from=downloader /usr/local/lib/python2.7/site-packages /usr/local/lib/python2.7/site-packages
 COPY --from=downloader /usr/local/bin/samtools /usr/local/bin
 COPY --from=downloader /usr/local/bin/samblaster /usr/local/bin
 
@@ -54,7 +36,9 @@ RUN apt-get update && apt-get install -y \
         libncurses5-dev \
         bc \
         dnsutils \
-        iputils-ping && \
-    ln -s ../lib/google-cloud-sdk/bin/gsutil /usr/bin/gsutil
+        iputils-ping
+
+# Install metadata script dependencies
+RUN pip3 install requests urllib3
 
 ADD gc_functions.sh gc_somatic.sh gc_germline.sh gc_ccdg_germline.sh gen_credentials.py /opt/sentieon/
diff --git a/pipeline_scripts/gc_functions.sh b/pipeline_scripts/gc_functions.sh
@@ -45,7 +45,7 @@ transfer()
     src_file=$1
     dst_file=$2
     start_s=`date +%s`
-    gsutil cp "$src_file" "$dst_file"
+    gsutil ${REQUESTER_PROJECT:+-u $REQUESTER_PROJECT} cp "$src_file" "$dst_file"
     check_error $? "Transfer $src_file to $dst_file"
     end_s=`date +%s`
     runtime=$(delta_time $start_s $end_s)
@@ -65,9 +65,9 @@ transfer_all_sites()
         local_sites+=("$local_file")
         local_str+=" -k \"$local_file\" "
         # Index
-        if $(test -e "${src_file}".idx) || $(gsutil -q stat "${src_file}".idx); then
+        if $(test -e "${src_file}".idx) || $(gsutil ${REQUESTER_PROJECT:+-u $REQUESTER_PROJECT} -q stat "${src_file}".idx); then
             idx="${src_file}".idx
-        elif $(test -e "${src_file}".tbi) || $(gsutil -q stat "${src_file}".tbi); then
+        elif $(test -e "${src_file}".tbi) || $(gsutil ${REQUESTER_PROJECT:+-u $REQUESTER_PROJECT} -q stat "${src_file}".tbi); then
             idx="${src_file}".tbi
         else
             echo "Cannot find idx for $src_file"
@@ -107,7 +107,7 @@ unset_none_variables()
 gc_setup()
 {
     ## Download the Sentieon software
-    curl -L https://s3.amazonaws.com/sentieon-release/software/sentieon-genomics-${SENTIEON_VERSION}.tar.gz | tar -zxf - -C /opt/sentieon
+    curl -L https://sentieon-release.s3.amazonaws.com/software/sentieon-genomics-${SENTIEON_VERSION}.tar.gz | tar -zxf - -C /opt/sentieon
     PATH=/opt/sentieon/sentieon-genomics-${SENTIEON_VERSION}/bin:$PATH
 
     ## Dirs
@@ -134,7 +134,7 @@ gc_setup()
     ## Setup license information #
     cred=$license_dir/credentials.json
     project_file=$license_dir/credentials.json.project
-    python /opt/sentieon/gen_credentials.py ${EMAIL:+--email $EMAIL} $cred "$SENTIEON_KEY"
+    python3 /opt/sentieon/gen_credentials.py ${EMAIL:+--email $EMAIL} $cred "$SENTIEON_KEY"
     sleep 10
     if [[ -n $SENTIEON_KEY ]]; then
         export SENTIEON_AUTH_MECH=proxy_GOOGLE
@@ -179,9 +179,9 @@ download_bams()
     for bam in "${bams[@]}"; do
         local_bam=$download_input_dir/$(basename "$bam")
         transfer "$bam" "$local_bam"
-        if $(test -e "${bam}".bai) || $(gsutil -q stat "${bam}".bai); then
+        if $(test -e "${bam}".bai) || $(gsutil ${REQUESTER_PROJECT:+-u $REQUESTER_PROJECT} -q stat "${bam}".bai); then
             bai="${bam}".bai
-        elif $(test -e "${bam%%.bam}".bai) || $(gsutil -q stat "${bam%%.bam}".bai); then
+        elif $(test -e "${bam%%.bam}".bai) || $(gsutil ${REQUESTER_PROJECT:+-u $REQUESTER_PROJECT} -q stat "${bam%%.bam}".bai); then
             bai="${bam%%.bam}".bai
         else
             echo "Cannot find the index file for $bam"
@@ -216,20 +216,20 @@ download_reference()
     ref=$ref_dir/$(basename "$REF")
     transfer "$REF" "$ref"
     transfer "${REF}".fai "${ref}".fai
-    if $(test -e "${REF}".dict) || $(gsutil -q stat "${REF}".dict); then
+    if $(test -e "${REF}".dict) || $(gsutil ${REQUESTER_PROJECT:+-u $REQUESTER_PROJECT} -q stat "${REF}".dict); then
         transfer "${REF}".dict "${ref}".dict
-    elif $(test -e "${REF%%.fa}".dict) || $(gsutil -q stat "${REF%%.fa}".dict); then
+    elif $(test -e "${REF%%.fa}".dict) || $(gsutil ${REQUESTER_PROJECT:+-u $REQUESTER_PROJECT} -q stat "${REF%%.fa}".dict); then
         transfer "${REF%%.fa}".dict "${ref%%.fa}".dict
-    elif $(test -e "${REF%%.fasta}".dict) || $(gsutil -q stat "${REF%%.fasta}".dict); then
+    elif $(test -e "${REF%%.fasta}".dict) || $(gsutil ${REQUESTER_PROJECT:+-u $REQUESTER_PROJECT} -q stat "${REF%%.fasta}".dict); then
         transfer "${REF%%.fasta}".dict "${ref%%.fasta}".dict
     else
         echo "Cannot find reference dictionary"
         exit 1
     fi
     if [[ -n "$FQ1" || -n "$TUMOR_FQ1" ]]; then
-        if $(test -e "${REF}".64.amb) || $(gsutil -q stat "${REF}".64.amb); then
+        if $(test -e "${REF}".64.amb) || $(gsutil ${REQUESTER_PROJECT:+-u $REQUESTER_PROJECT} -q stat "${REF}".64.amb); then
             middle=".64"
-        elif $(test -e "${REF}".amb) || $(gsutil -q stat "${REF}".amb); then
+        elif $(test -e "${REF}".amb) || $(gsutil ${REQUESTER_PROJECT:+-u $REQUESTER_PROJECT} -q stat "${REF}".amb); then
             middle=""
         else
             echo "Cannot file BWA index files"
@@ -240,7 +240,7 @@ download_reference()
         transfer "${REF}"${middle}.bwt "${ref}"${middle}.bwt
         transfer "${REF}"${middle}.pac "${ref}"${middle}.pac
         transfer "${REF}"${middle}.sa  "${ref}"${middle}.sa
-        if $(test -e "${REF}"${middle}.alt) || $(gsutil -q stat "${REF}"${middle}.alt); then
+        if $(test -e "${REF}"${middle}.alt) || $(gsutil ${REQUESTER_PROJECT:+-u $REQUESTER_PROJECT} -q stat "${REF}"${middle}.alt); then
             transfer "${REF}"${middle}.alt "${ref}"${middle}.alt
         fi
     fi
@@ -275,9 +275,9 @@ bwa_mem_align()
         readgroup=${fun_rgs[$i]}
         bwa_cmd="$release_dir/bin/bwa mem ${fun_bwa_xargs} -R \"${readgroup}\" -t $nt \"$ref\" "
         if [[ -n "$STREAM_INPUT" ]]; then
-            bwa_cmd="$bwa_cmd <(gsutil cp $fq1 -) "
+            bwa_cmd="$bwa_cmd <(gsutil ${REQUESTER_PROJECT:+-u $REQUESTER_PROJECT} cp $fq1 -) "
             if [[ -n "$fq2" ]]; then
-                bwa_cmd="$bwa_cmd <(gsutil cp $fq2 -) "
+                bwa_cmd="$bwa_cmd <(gsutil ${REQUESTER_PROJECT:+-u $REQUESTER_PROJECT} cp $fq2 -) "
             fi
         else
             local_fq1=$input_dir/$(basename "$fq1")

diff --git a/pipeline_scripts/gc_germline.sh b/pipeline_scripts/gc_germline.sh
@@ -13,14 +13,14 @@ environmental_variables=(FQ1 FQ2 BAM OUTPUT_BUCKET REF READGROUP DEDUP \
     BQSR_SITES DBSNP INTERVAL INTERVAL_FILE NO_METRICS NO_BAM_OUTPUT \
     NO_HAPLOTYPER GVCF_OUTPUT STREAM_INPUT PIPELINE OUTPUT_CRAM_FORMAT \
     SENTIEON_KEY RECALIBRATED_OUTPUT EMAIL SENTIEON_VERSION CALLING_ARGS \
-    DNASCOPE_MODEL CALLING_ALGO)
+    DNASCOPE_MODEL CALLING_ALGO REQUESTER_PROJECT)
 unset_none_variables ${environmental_variables[@]}
 OUTPUT_CRAM_FORMAT="" # Not yet supported
 
 readonly FQ1 FQ2 BAM OUTPUT_BUCKET REF READGROUP DEDUP BQSR_SITES DBSNP \
     INTERVAL INTERVAL_FILE NO_METRICS NO_BAM_OUTPUT NO_HAPLOTYPER GVCF_OUTPUT \
     STREAM_INPUT PIPELINE OUTPUT_CRAM_FORMAT SENTIEON_KEY RECALIBRATED_OUTPUT \
-    EMAIL SENTIEON_VERSION CALLING_ARGS DNASCOPE_MODEL CALLING_ALGO
+    EMAIL SENTIEON_VERSION CALLING_ARGS DNASCOPE_MODEL CALLING_ALGO REQUESTER_PROJECT
 
 release_dir="/opt/sentieon/sentieon-genomics-${SENTIEON_VERSION}/"
 

diff --git a/pipeline_scripts/gc_somatic.sh b/pipeline_scripts/gc_somatic.sh
@@ -13,7 +13,7 @@ environmental_variables=(FQ1 FQ2 TUMOR_FQ1 TUMOR_FQ2 BAM TUMOR_BAM \
     OUTPUT_BUCKET REF READGROUP TUMOR_READGROUP DEDUP BQSR_SITES DBSNP \
     INTERVAL INTERVAL_FILE NO_METRICS NO_BAM_OUTPUT NO_VCF RUN_TNSNV \
     STREAM_INPUT PIPELINE REALIGN_SITES OUTPUT_CRAM_FORMAT SENTIEON_KEY \
-    EMAIL SENTIEON_VERSION CALLING_ARGS CALLING_ALGO)
+    EMAIL SENTIEON_VERSION CALLING_ARGS CALLING_ALGO REQUESTER_PROJECT)
 unset_none_variables ${environmental_variables[@]}
 OUTPUT_CRAM_FORMAT="" # Not yet supported
 
@@ -125,9 +125,9 @@ done
 # Detect the tumor and normal sample names
 normal_sample=""
 if [[ -f ${local_bams[0]} ]]; then
-    normal_sample=$(samtools view -H ${local_bams[0]} | grep "^@RG" | head -n 1 | sed 's/^.*SM:\(.*\)	.*$/\1/')
+    normal_sample=$(samtools view -H ${local_bams[0]} | grep "^@RG" | head -n 1 | sed 's/^.*SM:\([^	]*\).*$/\1/')
 fi
-tumor_sample=$(samtools view -H ${tumor_bams[0]} | grep "^@RG" | head -n 1 | sed 's/^.*SM:\(.*\)	.*$/\1/')
+tumor_sample=$(samtools view -H ${tumor_bams[0]} | grep "^@RG" | head -n 1 | sed 's/^.*SM:\([^	]*\).*$/\1/')
 
 # ******************************************
 # 2. Metrics command

diff --git a/pipeline_scripts/gen_credentials.py b/pipeline_scripts/gen_credentials.py
@@ -9,20 +9,27 @@
 import os
 
 audience = "https://sentieon.com"
-headers = {'Metadata-Flavor': 'Google'}
+headers = {"Metadata-Flavor": "Google"}
 request_format = "full"
-metadata_url = ("http://metadata.google.internal/computeMetadata/v1/instance/"
-                "service-accounts/default/identity?audience={}&format={}")
-project_url = ("http://metadata.google.internal/computeMetadata/v1/project/"
-               "project-id?format={}")
+metadata_url = (
+    "http://metadata.google.internal/computeMetadata/v1/instance/"
+    "service-accounts/default/identity?audience={}&format={}"
+)
+project_url = (
+    "http://metadata.google.internal/computeMetadata/v1/project/"
+    "project-id?format={}"
+)
 
 
 def process_args():
-    parser = argparse.ArgumentParser(description="Write fresh instance "
-                                     "metadata credentials to a file for "
-                                     "license authentication")
-    parser.add_argument("auth_data_file", help="A file to hold the instance "
-                        "metadata JWT")
+    parser = argparse.ArgumentParser(
+        description="Write fresh instance "
+        "metadata credentials to a file for "
+        "license authentication"
+    )
+    parser.add_argument(
+        "auth_data_file", help="A file to hold the instance " "metadata JWT"
+    )
     parser.add_argument("sentieon_key", help="A license key string")
     parser.add_argument("--email", help="An email associated with the license")
     return parser.parse_args()
@@ -43,7 +50,7 @@ def main(args):
     url = project_url.format(request_format)
     response = requests.get(url, headers=headers)
     project_id = response.text
-    with open(args.auth_data_file + ".project", 'w') as f:
+    with open(args.auth_data_file + ".project", "w") as f:
         print(project_id, file=f)
 
     url = metadata_url.format(audience, request_format)
@@ -69,7 +76,7 @@ def main(args):
             out["license_key"] = args.sentieon_key
         if args.email:
             out["email"] = args.email
-        with open(args.auth_data_file, 'w') as f:
+        with open(args.auth_data_file, "w") as f:
             json.dump(out, f)
         # sleep for 55 minutes before refreshing the token or until killed
         time.sleep(55 * 60)

diff --git a/runner/ccdg.yaml b/runner/ccdg.yaml
@@ -24,7 +24,7 @@ inputParameters:
   defaultValue: None
 - name: SENTIEON_VERSION
   description: Version of the Sentieon software to use
-  defaultValue: 201808.07
+  defaultValue: 201911
 - name: READGROUP
   description: Readgroup information to add during alignment
   defaultValue: "@RG\\tID:read-group\\tSM:sample-name\\tPL:ILLUMINA"
@@ -70,3 +70,6 @@ inputParameters:
 - name: CALLING_ALGO
   description: The variant calling algorithm to use
   defaultValue: Haplotyper
+- name: REQUESTER_PROJECT
+  description: The requester project to use for for gsutil requests on the remote server
+  defaultValue: None