Skip to content

Commit

Permalink
Merge pull request #2 from Sentieon/dev
Browse files Browse the repository at this point in the history
Merge updates in the Dev branch to Master
  • Loading branch information
DonFreed authored Mar 26, 2021
2 parents c2da855 + 902c276 commit 5426fa2
Show file tree
Hide file tree
Showing 16 changed files with 425 additions and 255 deletions.
38 changes: 22 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,38 +106,42 @@ The runner script accepts a JSON file as input. In the repository you downloaded
"FQ1": "gs://sentieon-test/pipeline_test/inputs/test1_1.fastq.gz",
"FQ2": "gs://sentieon-test/pipeline_test/inputs/test1_2.fastq.gz",
"REF": "gs://sentieon-test/pipeline_test/reference/hs37d5.fa",
"OUTPUT_BUCKET": "gs://BUCKET",
"OUTPUT_BUCKET": "YOUR_BUCKET_HERE",
"ZONES": "us-central1-a,us-central1-b,us-central1-c,us-central1-f",
"PROJECT_ID": "PROJECT_ID",
"EMAIL": "EMAIL"
"PROJECT_ID": "YOUR_PROJECT_HERE",
"REQUESTER_PROJECT": "YOUR_PROJECT_HERE",
"EMAIL": "YOUR_EMAIL_HERE"
}
```

The following table describes the JSON keys in the file:

| JSON key | Description |
| ------------- | ----------------------------------------------------------------------------- |
| FQ1 | The first pair of reads in the input fastq file. |
| FQ2 | The second pair of reads in the input fastq file. |
| BAM | The input BAM file, if applicable. |
| REF | The reference genome. If set, the reference index files are assumed to exist. |
| OUTPUT_BUCKET | The bucket and directory used to store the data output from the pipeline. |
| ZONES | A comma-separated list of GCP zones to use for the worker node. |
| PROJECT_ID | Your GCP project ID. |
| EMAIL | Your email |
| JSON key | Description |
| ----------------- | ----------------------------------------------------------------------------- |
| FQ1 | The first pair of reads in the input fastq file. |
| FQ2 | The second pair of reads in the input fastq file. |
| BAM | The input BAM file, if applicable. |
| REF | The reference genome. If set, the reference index files are assumed to exist. |
| OUTPUT_BUCKET | The bucket and directory used to store the data output from the pipeline. |
| ZONES | A comma-separated list of GCP zones to use for the worker node. |
| PROJECT_ID | Your GCP project ID. |
| REQUESTER_PROJECT | A project to bill when transferring data from Requester Pays buckets. |
| EMAIL | Your email |

The `FQ1`, `FQ2`, `REF`, and `ZONES` fields will work with the defaults. However, the `OUTPUT_BUCKET`, `PROJECT_ID`, and `EMAIL` fields will need to be updated to point to your specific output bucket/path, Project ID, and email address.
The `FQ1`, `FQ2`, `REF`, and `ZONES` fields will work with the defaults. However, the `OUTPUT_BUCKET`, `PROJECT_ID`, `REQUESTER_PROJECT`, and `EMAIL` fields will need to be updated to point to your specific output bucket/path, Project ID, and email address.

<a name="run"/>

### Run the example pipelines

Edit the `OUTPUT_BUCKET`, `PROJECT_ID`, and `EMAIL` fields in the `examples/example.json` to your output bucket/path, the GCP Project ID that you setup earlier, and email you want associated with your Sentieon license. By supplying the `EMAIL` field, your PROJECT_ID will automatically receive a 14 day free trial for the Sentieon software on the Google Cloud.
Edit the `OUTPUT_BUCKET`, `PROJECT_ID`, `REQUESTER_PROJECT`, and `EMAIL` fields in the `examples/example.json` to your output bucket/path, the GCP Project ID that you setup earlier, and email you want associated with your Sentieon license. By supplying the `EMAIL` field, your PROJECT_ID will automatically receive a 14 day free trial for the Sentieon software on the Google Cloud.

You after modifying the `examples/example.json` file, you can use the following command to run the DNAseq pipeline on a small test dataset.
```bash
python runner/sentieon_runner.py examples/example.json
python runner/sentieon_runner.py --requester_project $PROJECT_ID examples/example.json
```
The `--requester_project` argument will configure the software to use the specified PROJECT_ID when polling input files locally. Alternatively, you might set `--no_check_inputs_exist` to skip input file polling.


<a name="understand"/>

Expand Down Expand Up @@ -278,6 +282,7 @@ The `CALLING_ALGO` key key can be change to `TNsnv`, `TNhaplotyper`, `TNhaplotyp
| EMAIL | An email address to use to obtain an evaluation license for your GCP Project |
| SENTIEON_KEY | Your Sentieon license key (only applicable for paying customers) |
| PROJECT_ID | Your GCP Project ID to use when running jobs |
| REQUESTER_PROJECT | A project to bill when transferring data from Requester Pays buckets |
| PREEMPTIBLE_TRIES | Number of attempts to run the pipeline using preemptible instances |
| NONPREEMPTIBLE_TRY | After `PREEMPTIBLE_TRIES` are exhausted, whether to try one additional run with standard instances |

Expand Down Expand Up @@ -343,6 +348,7 @@ The `CALLING_ALGO` key key can be change to `TNsnv`, `TNhaplotyper`, `TNhaplotyp
| EMAIL | An email address to use to obtain an evaluation license for your GCP Project |
| SENTIEON_KEY | Your Sentieon license key (only applicable for paying customers) |
| PROJECT_ID | Your GCP Project ID to use when running jobs |
| REQUESTER_PROJECT | A project to bill when transferring data from Requester Pays buckets |
| PREEMPTIBLE_TRIES | Number of attempts to run the pipeline using preemptible instances |
| NONPREEMPTIBLE_TRY | After `PREEMPTIBLE_TRIES` are exhausted, whether to try one additional run with standard instances |

Expand Down
1 change: 1 addition & 0 deletions examples/100x_wes.json
Original file line number Diff line number Diff line change
Expand Up @@ -9,5 +9,6 @@
"STREAM_INPUT": "True",
"ZONES": "us-central1-a,us-central1-b,us-central1-c,us-central1-f",
"PROJECT_ID": "YOUR_PROJECT_HERE",
"REQUESTER_PROJECT": "YOUR_PROJECT_HERE",
"EMAIL": "YOUR_EMAIL_HERE"
}
1 change: 1 addition & 0 deletions examples/30x_wgs.json
Original file line number Diff line number Diff line change
Expand Up @@ -9,5 +9,6 @@
"STREAM_INPUT": "True",
"ZONES": "us-central1-a,us-central1-b,us-central1-c,us-central1-f",
"PROJECT_ID": "YOUR_PROJECT_HERE",
"REQUESTER_PROJECT": "YOUR_PROJECT_HERE",
"EMAIL": "YOUR_EMAIL_HERE"
}
1 change: 1 addition & 0 deletions examples/30x_wgs_ccdg.json
Original file line number Diff line number Diff line change
Expand Up @@ -9,5 +9,6 @@
"STREAM_INPUT": "True",
"ZONES": "us-central1-a,us-central1-b,us-central1-c,us-central1-f",
"PROJECT_ID": "YOUR_PROJECT_HERE",
"REQUESTER_PROJECT": "YOUR_PROJECT_HERE",
"EMAIL": "YOUR_EMAIL_HERE"
}
1 change: 1 addition & 0 deletions examples/example.json
Original file line number Diff line number Diff line change
Expand Up @@ -5,5 +5,6 @@
"OUTPUT_BUCKET": "YOUR_BUCKET_HERE",
"ZONES": "us-central1-a,us-central1-b,us-central1-c,us-central1-f",
"PROJECT_ID": "YOUR_PROJECT_HERE",
"REQUESTER_PROJECT": "YOUR_PROJECT_HERE",
"EMAIL": "YOUR_EMAIL_HERE"
}
1 change: 1 addition & 0 deletions examples/tn_example.json
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
"OUTPUT_BUCKET": "YOUR_BUCKET_HERE",
"ZONES": "us-central1-a,us-central1-b,us-central1-c,us-central1-f",
"PROJECT_ID": "YOUR_PROJECT_HERE",
"REQUESTER_PROJECT": "YOUR_PROJECT_HERE",
"PIPELINE": "SOMATIC",
"CALLING_ALGO": "TNhaplotyper",
"EMAIL": "YOUR_EMAIL_HERE"
Expand Down
30 changes: 7 additions & 23 deletions pipeline_scripts/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,17 +1,4 @@
FROM python:2.7.16-slim-stretch as downloader

# Install gsutil
RUN apt-get update && \
apt-get install -y \
curl \
python-pip \
gcc \
lsb-release && \
export CLOUD_SDK_REPO="cloud-sdk-$(lsb_release -c -s)" && \
echo "deb http://packages.cloud.google.com/apt $CLOUD_SDK_REPO main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && \
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add - && \
apt-get update && apt-get install -y google-cloud-sdk && \
pip install crcmod
FROM google/cloud-sdk:333.0.0-slim as downloader

# Install samtools
RUN apt-get update && \
Expand All @@ -35,15 +22,10 @@ RUN curl -Lo samblaster-v.0.1.24.tar.gz https://github.com/GregoryFaust/samblast
make && \
cp samblaster /usr/local/bin/

# Install metadata script dependencies
RUN pip install requests urllib3
FROM google/cloud-sdk:333.0.0-slim

FROM python:2.7.16-slim-stretch
LABEL container.base.image="google/cloud-sdk:333.0.0-slim"

LABEL container.base.image="python:2.7.16-slim-stretch"

COPY --from=downloader /usr/lib/google-cloud-sdk /usr/lib/google-cloud-sdk
COPY --from=downloader /usr/local/lib/python2.7/site-packages /usr/local/lib/python2.7/site-packages
COPY --from=downloader /usr/local/bin/samtools /usr/local/bin
COPY --from=downloader /usr/local/bin/samblaster /usr/local/bin

Expand All @@ -54,7 +36,9 @@ RUN apt-get update && apt-get install -y \
libncurses5-dev \
bc \
dnsutils \
iputils-ping && \
ln -s ../lib/google-cloud-sdk/bin/gsutil /usr/bin/gsutil
iputils-ping

# Install metadata script dependencies
RUN pip3 install requests urllib3

ADD gc_functions.sh gc_somatic.sh gc_germline.sh gc_ccdg_germline.sh gen_credentials.py /opt/sentieon/
30 changes: 15 additions & 15 deletions pipeline_scripts/gc_functions.sh
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ transfer()
src_file=$1
dst_file=$2
start_s=`date +%s`
gsutil cp "$src_file" "$dst_file"
gsutil ${REQUESTER_PROJECT:+-u $REQUESTER_PROJECT} cp "$src_file" "$dst_file"
check_error $? "Transfer $src_file to $dst_file"
end_s=`date +%s`
runtime=$(delta_time $start_s $end_s)
Expand All @@ -65,9 +65,9 @@ transfer_all_sites()
local_sites+=("$local_file")
local_str+=" -k \"$local_file\" "
# Index
if $(test -e "${src_file}".idx) || $(gsutil -q stat "${src_file}".idx); then
if $(test -e "${src_file}".idx) || $(gsutil ${REQUESTER_PROJECT:+-u $REQUESTER_PROJECT} -q stat "${src_file}".idx); then
idx="${src_file}".idx
elif $(test -e "${src_file}".tbi) || $(gsutil -q stat "${src_file}".tbi); then
elif $(test -e "${src_file}".tbi) || $(gsutil ${REQUESTER_PROJECT:+-u $REQUESTER_PROJECT} -q stat "${src_file}".tbi); then
idx="${src_file}".tbi
else
echo "Cannot find idx for $src_file"
Expand Down Expand Up @@ -107,7 +107,7 @@ unset_none_variables()
gc_setup()
{
## Download the Sentieon software
curl -L https://s3.amazonaws.com/sentieon-release/software/sentieon-genomics-${SENTIEON_VERSION}.tar.gz | tar -zxf - -C /opt/sentieon
curl -L https://sentieon-release.s3.amazonaws.com/software/sentieon-genomics-${SENTIEON_VERSION}.tar.gz | tar -zxf - -C /opt/sentieon
PATH=/opt/sentieon/sentieon-genomics-${SENTIEON_VERSION}/bin:$PATH

## Dirs
Expand All @@ -134,7 +134,7 @@ gc_setup()
## Setup license information #
cred=$license_dir/credentials.json
project_file=$license_dir/credentials.json.project
python /opt/sentieon/gen_credentials.py ${EMAIL:+--email $EMAIL} $cred "$SENTIEON_KEY"
python3 /opt/sentieon/gen_credentials.py ${EMAIL:+--email $EMAIL} $cred "$SENTIEON_KEY"
sleep 10
if [[ -n $SENTIEON_KEY ]]; then
export SENTIEON_AUTH_MECH=proxy_GOOGLE
Expand Down Expand Up @@ -179,9 +179,9 @@ download_bams()
for bam in "${bams[@]}"; do
local_bam=$download_input_dir/$(basename "$bam")
transfer "$bam" "$local_bam"
if $(test -e "${bam}".bai) || $(gsutil -q stat "${bam}".bai); then
if $(test -e "${bam}".bai) || $(gsutil ${REQUESTER_PROJECT:+-u $REQUESTER_PROJECT} -q stat "${bam}".bai); then
bai="${bam}".bai
elif $(test -e "${bam%%.bam}".bai) || $(gsutil -q stat "${bam%%.bam}".bai); then
elif $(test -e "${bam%%.bam}".bai) || $(gsutil ${REQUESTER_PROJECT:+-u $REQUESTER_PROJECT} -q stat "${bam%%.bam}".bai); then
bai="${bam%%.bam}".bai
else
echo "Cannot find the index file for $bam"
Expand Down Expand Up @@ -216,20 +216,20 @@ download_reference()
ref=$ref_dir/$(basename "$REF")
transfer "$REF" "$ref"
transfer "${REF}".fai "${ref}".fai
if $(test -e "${REF}".dict) || $(gsutil -q stat "${REF}".dict); then
if $(test -e "${REF}".dict) || $(gsutil ${REQUESTER_PROJECT:+-u $REQUESTER_PROJECT} -q stat "${REF}".dict); then
transfer "${REF}".dict "${ref}".dict
elif $(test -e "${REF%%.fa}".dict) || $(gsutil -q stat "${REF%%.fa}".dict); then
elif $(test -e "${REF%%.fa}".dict) || $(gsutil ${REQUESTER_PROJECT:+-u $REQUESTER_PROJECT} -q stat "${REF%%.fa}".dict); then
transfer "${REF%%.fa}".dict "${ref%%.fa}".dict
elif $(test -e "${REF%%.fasta}".dict) || $(gsutil -q stat "${REF%%.fasta}".dict); then
elif $(test -e "${REF%%.fasta}".dict) || $(gsutil ${REQUESTER_PROJECT:+-u $REQUESTER_PROJECT} -q stat "${REF%%.fasta}".dict); then
transfer "${REF%%.fasta}".dict "${ref%%.fasta}".dict
else
echo "Cannot find reference dictionary"
exit 1
fi
if [[ -n "$FQ1" || -n "$TUMOR_FQ1" ]]; then
if $(test -e "${REF}".64.amb) || $(gsutil -q stat "${REF}".64.amb); then
if $(test -e "${REF}".64.amb) || $(gsutil ${REQUESTER_PROJECT:+-u $REQUESTER_PROJECT} -q stat "${REF}".64.amb); then
middle=".64"
elif $(test -e "${REF}".amb) || $(gsutil -q stat "${REF}".amb); then
elif $(test -e "${REF}".amb) || $(gsutil ${REQUESTER_PROJECT:+-u $REQUESTER_PROJECT} -q stat "${REF}".amb); then
middle=""
else
echo "Cannot file BWA index files"
Expand All @@ -240,7 +240,7 @@ download_reference()
transfer "${REF}"${middle}.bwt "${ref}"${middle}.bwt
transfer "${REF}"${middle}.pac "${ref}"${middle}.pac
transfer "${REF}"${middle}.sa "${ref}"${middle}.sa
if $(test -e "${REF}"${middle}.alt) || $(gsutil -q stat "${REF}"${middle}.alt); then
if $(test -e "${REF}"${middle}.alt) || $(gsutil ${REQUESTER_PROJECT:+-u $REQUESTER_PROJECT} -q stat "${REF}"${middle}.alt); then
transfer "${REF}"${middle}.alt "${ref}"${middle}.alt
fi
fi
Expand Down Expand Up @@ -275,9 +275,9 @@ bwa_mem_align()
readgroup=${fun_rgs[$i]}
bwa_cmd="$release_dir/bin/bwa mem ${fun_bwa_xargs} -R \"${readgroup}\" -t $nt \"$ref\" "
if [[ -n "$STREAM_INPUT" ]]; then
bwa_cmd="$bwa_cmd <(gsutil cp $fq1 -) "
bwa_cmd="$bwa_cmd <(gsutil ${REQUESTER_PROJECT:+-u $REQUESTER_PROJECT} cp $fq1 -) "
if [[ -n "$fq2" ]]; then
bwa_cmd="$bwa_cmd <(gsutil cp $fq2 -) "
bwa_cmd="$bwa_cmd <(gsutil ${REQUESTER_PROJECT:+-u $REQUESTER_PROJECT} cp $fq2 -) "
fi
else
local_fq1=$input_dir/$(basename "$fq1")
Expand Down
4 changes: 2 additions & 2 deletions pipeline_scripts/gc_germline.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,14 @@ environmental_variables=(FQ1 FQ2 BAM OUTPUT_BUCKET REF READGROUP DEDUP \
BQSR_SITES DBSNP INTERVAL INTERVAL_FILE NO_METRICS NO_BAM_OUTPUT \
NO_HAPLOTYPER GVCF_OUTPUT STREAM_INPUT PIPELINE OUTPUT_CRAM_FORMAT \
SENTIEON_KEY RECALIBRATED_OUTPUT EMAIL SENTIEON_VERSION CALLING_ARGS \
DNASCOPE_MODEL CALLING_ALGO)
DNASCOPE_MODEL CALLING_ALGO REQUESTER_PROJECT)
unset_none_variables ${environmental_variables[@]}
OUTPUT_CRAM_FORMAT="" # Not yet supported

readonly FQ1 FQ2 BAM OUTPUT_BUCKET REF READGROUP DEDUP BQSR_SITES DBSNP \
INTERVAL INTERVAL_FILE NO_METRICS NO_BAM_OUTPUT NO_HAPLOTYPER GVCF_OUTPUT \
STREAM_INPUT PIPELINE OUTPUT_CRAM_FORMAT SENTIEON_KEY RECALIBRATED_OUTPUT \
EMAIL SENTIEON_VERSION CALLING_ARGS DNASCOPE_MODEL CALLING_ALGO
EMAIL SENTIEON_VERSION CALLING_ARGS DNASCOPE_MODEL CALLING_ALGO REQUESTER_PROJECT

release_dir="/opt/sentieon/sentieon-genomics-${SENTIEON_VERSION}/"

Expand Down
6 changes: 3 additions & 3 deletions pipeline_scripts/gc_somatic.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ environmental_variables=(FQ1 FQ2 TUMOR_FQ1 TUMOR_FQ2 BAM TUMOR_BAM \
OUTPUT_BUCKET REF READGROUP TUMOR_READGROUP DEDUP BQSR_SITES DBSNP \
INTERVAL INTERVAL_FILE NO_METRICS NO_BAM_OUTPUT NO_VCF RUN_TNSNV \
STREAM_INPUT PIPELINE REALIGN_SITES OUTPUT_CRAM_FORMAT SENTIEON_KEY \
EMAIL SENTIEON_VERSION CALLING_ARGS CALLING_ALGO)
EMAIL SENTIEON_VERSION CALLING_ARGS CALLING_ALGO REQUESTER_PROJECT)
unset_none_variables ${environmental_variables[@]}
OUTPUT_CRAM_FORMAT="" # Not yet supported

Expand Down Expand Up @@ -125,9 +125,9 @@ done
# Detect the tumor and normal sample names
normal_sample=""
if [[ -f ${local_bams[0]} ]]; then
normal_sample=$(samtools view -H ${local_bams[0]} | grep "^@RG" | head -n 1 | sed 's/^.*SM:\(.*\) .*$/\1/')
normal_sample=$(samtools view -H ${local_bams[0]} | grep "^@RG" | head -n 1 | sed 's/^.*SM:\([^ ]*\).*$/\1/')
fi
tumor_sample=$(samtools view -H ${tumor_bams[0]} | grep "^@RG" | head -n 1 | sed 's/^.*SM:\(.*\) .*$/\1/')
tumor_sample=$(samtools view -H ${tumor_bams[0]} | grep "^@RG" | head -n 1 | sed 's/^.*SM:\([^ ]*\).*$/\1/')

# ******************************************
# 2. Metrics command
Expand Down
31 changes: 19 additions & 12 deletions pipeline_scripts/gen_credentials.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,20 +9,27 @@
import os

audience = "https://sentieon.com"
headers = {'Metadata-Flavor': 'Google'}
headers = {"Metadata-Flavor": "Google"}
request_format = "full"
metadata_url = ("http://metadata.google.internal/computeMetadata/v1/instance/"
"service-accounts/default/identity?audience={}&format={}")
project_url = ("http://metadata.google.internal/computeMetadata/v1/project/"
"project-id?format={}")
metadata_url = (
"http://metadata.google.internal/computeMetadata/v1/instance/"
"service-accounts/default/identity?audience={}&format={}"
)
project_url = (
"http://metadata.google.internal/computeMetadata/v1/project/"
"project-id?format={}"
)


def process_args():
parser = argparse.ArgumentParser(description="Write fresh instance "
"metadata credentials to a file for "
"license authentication")
parser.add_argument("auth_data_file", help="A file to hold the instance "
"metadata JWT")
parser = argparse.ArgumentParser(
description="Write fresh instance "
"metadata credentials to a file for "
"license authentication"
)
parser.add_argument(
"auth_data_file", help="A file to hold the instance " "metadata JWT"
)
parser.add_argument("sentieon_key", help="A license key string")
parser.add_argument("--email", help="An email associated with the license")
return parser.parse_args()
Expand All @@ -43,7 +50,7 @@ def main(args):
url = project_url.format(request_format)
response = requests.get(url, headers=headers)
project_id = response.text
with open(args.auth_data_file + ".project", 'w') as f:
with open(args.auth_data_file + ".project", "w") as f:
print(project_id, file=f)

url = metadata_url.format(audience, request_format)
Expand All @@ -69,7 +76,7 @@ def main(args):
out["license_key"] = args.sentieon_key
if args.email:
out["email"] = args.email
with open(args.auth_data_file, 'w') as f:
with open(args.auth_data_file, "w") as f:
json.dump(out, f)
# sleep for 55 minutes before refreshing the token or until killed
time.sleep(55 * 60)
Expand Down
5 changes: 4 additions & 1 deletion runner/ccdg.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ inputParameters:
defaultValue: None
- name: SENTIEON_VERSION
description: Version of the Sentieon software to use
defaultValue: 201808.07
defaultValue: 201911
- name: READGROUP
description: Readgroup information to add during alignment
defaultValue: "@RG\\tID:read-group\\tSM:sample-name\\tPL:ILLUMINA"
Expand Down Expand Up @@ -70,3 +70,6 @@ inputParameters:
- name: CALLING_ALGO
description: The variant calling algorithm to use
defaultValue: Haplotyper
- name: REQUESTER_PROJECT
description: The requester project to use for for gsutil requests on the remote server
defaultValue: None
Loading

0 comments on commit 5426fa2

Please sign in to comment.