You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
for each interval, a subset sites-only VCF is created
VEP is used to annotate each interval, generating a new file
This new file is copied into -tmp storage prior to aggregation
component files are aggregated to create an annotated whole genome VCF or HT
In one test run, it was found that the annotation output was truncated, meaning that the JSON contents could not be parsed using a fixed schema. A similar failure would also result if trying to parse VCF components. This was at a time when the Storage timeout was observed for GCP writes, but Hail jobs were also being cancelled as a result, so it's unclear if a write timeout or a cancelled write were the culprit.
At the moment, steps 2, 3, and 4 are skipped if the expected annotated interval file exists (by name). In this case, the corrupted annotated interval(s) had to be deleted for the pipeline to regenerate them.
At this stage it would be useful to calculate a hash of the interval following successful completion, and write to a companion file. That way, generating the new annotation result would be skippable only if:
the file exists by name
the checksum matches a value generated after successful completion of the annotation process
if the checksum doesn't match the expected value, delete the annotation part and cancel the workflow
Assumptions:
A fail during annotation (generating a truncated file) would kill the job, so a checksum file would not be generated
The checksum won't be altered between storage locally on original node -> GCP -> processing node for calculation
does GCP offer a checksum operation in situ? (gsutil hash)
No implementation details decided yet
The text was updated successfully, but these errors were encountered:
During the annotation phase:
-tmp
storage prior to aggregationIn one test run, it was found that the annotation output was truncated, meaning that the JSON contents could not be parsed using a fixed schema. A similar failure would also result if trying to parse VCF components. This was at a time when the Storage timeout was observed for GCP writes, but Hail jobs were also being cancelled as a result, so it's unclear if a write timeout or a cancelled write were the culprit.
At the moment, steps 2, 3, and 4 are skipped if the expected annotated interval file exists (by name). In this case, the corrupted annotated interval(s) had to be deleted for the pipeline to regenerate them.
At this stage it would be useful to calculate a hash of the interval following successful completion, and write to a companion file. That way, generating the new annotation result would be skippable only if:
Assumptions:
No implementation details decided yet
The text was updated successfully, but these errors were encountered: