Confirm integrity of intermediate files #30

MattWellie · 2022-05-15T01:02:51Z

During the annotation phase:

genomic intervals are generated (or re-used)
for each interval, a subset sites-only VCF is created
VEP is used to annotate each interval, generating a new file
This new file is copied into -tmp storage prior to aggregation
component files are aggregated to create an annotated whole genome VCF or HT

In one test run, it was found that the annotation output was truncated, meaning that the JSON contents could not be parsed using a fixed schema. A similar failure would also result if trying to parse VCF components. This was at a time when the Storage timeout was observed for GCP writes, but Hail jobs were also being cancelled as a result, so it's unclear if a write timeout or a cancelled write were the culprit.

At the moment, steps 2, 3, and 4 are skipped if the expected annotated interval file exists (by name). In this case, the corrupted annotated interval(s) had to be deleted for the pipeline to regenerate them.

At this stage it would be useful to calculate a hash of the interval following successful completion, and write to a companion file. That way, generating the new annotation result would be skippable only if:

the file exists by name
the checksum matches a value generated after successful completion of the annotation process
if the checksum doesn't match the expected value, delete the annotation part and cancel the workflow

Assumptions:

A fail during annotation (generating a truncated file) would kill the job, so a checksum file would not be generated
The checksum won't be altered between storage locally on original node -> GCP -> processing node for calculation
- does GCP offer a checksum operation in situ? (gsutil hash)

No implementation details decided yet

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confirm integrity of intermediate files #30

Confirm integrity of intermediate files #30

MattWellie commented May 15, 2022

Confirm integrity of intermediate files #30

Confirm integrity of intermediate files #30

Comments

MattWellie commented May 15, 2022