Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confirm integrity of intermediate files #30

Open
MattWellie opened this issue May 15, 2022 · 0 comments
Open

Confirm integrity of intermediate files #30

MattWellie opened this issue May 15, 2022 · 0 comments

Comments

@MattWellie
Copy link
Collaborator

During the annotation phase:

  1. genomic intervals are generated (or re-used)
  2. for each interval, a subset sites-only VCF is created
  3. VEP is used to annotate each interval, generating a new file
  4. This new file is copied into -tmp storage prior to aggregation
  5. component files are aggregated to create an annotated whole genome VCF or HT

In one test run, it was found that the annotation output was truncated, meaning that the JSON contents could not be parsed using a fixed schema. A similar failure would also result if trying to parse VCF components. This was at a time when the Storage timeout was observed for GCP writes, but Hail jobs were also being cancelled as a result, so it's unclear if a write timeout or a cancelled write were the culprit.

At the moment, steps 2, 3, and 4 are skipped if the expected annotated interval file exists (by name). In this case, the corrupted annotated interval(s) had to be deleted for the pipeline to regenerate them.

At this stage it would be useful to calculate a hash of the interval following successful completion, and write to a companion file. That way, generating the new annotation result would be skippable only if:

  1. the file exists by name
  2. the checksum matches a value generated after successful completion of the annotation process
  3. if the checksum doesn't match the expected value, delete the annotation part and cancel the workflow

Assumptions:

  • A fail during annotation (generating a truncated file) would kill the job, so a checksum file would not be generated
  • The checksum won't be altered between storage locally on original node -> GCP -> processing node for calculation
    • does GCP offer a checksum operation in situ? (gsutil hash)

No implementation details decided yet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant