-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add rename_embedded_ids.py script that updates @RG SM:{id}
values
#504
base: dev
Are you sure you want to change the base?
Conversation
The changes to create_test_subset.py are somewhat draft and need testing. The rename_embedded_ids.py UI may need to become a little more general to handle all the scenarios we want it to work in. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## dev #504 +/- ##
=======================================
Coverage 79.79% 79.79%
=======================================
Files 169 169
Lines 14523 14523
=======================================
Hits 11589 11589
Misses 2934 2934 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, John!
As we discussed, I think it would be good to call metamist here to make sure that our analysis entries don't go out of sync when this script is run (which could have downstream implications in pipeline runs).
Also it would be useful to add a comment on manual actions from a data management perspective (i.e. deleting the old cram files after this script is run).
Oh also could you look at the linting and unit tests? |
aa610af
to
1c37469
Compare
Turns out the unit test failure was a problem on dev that has since been fixed and has resolved itself with a rebase onto current dev. Lint was whining about something I consider good style, sigh. But I'll probably refactor it in due course to shut the thing up. |
adb5c95
to
a4e6ea5
Compare
a4e6ea5
to
f8eb4fe
Compare
This version of this script streams from the original blob, uses `samtools reheader` to update header lines, and streams back to the output blob. It also regenerates a new corresponding index file. (Currently assumes that it is given CRAM URLs specifically, not BAM.)
f8eb4fe
to
06f8e1b
Compare
Add a fairly simpleminded script that reads CRAM files from a
gs://…
URL, appliessamtools reheader
to update the@RG SM:{id}
headers as instructed, and writes the CRAM file and associated index file back to newgs://…
URLs constructed from the old one by updating the{id}
similarly.Fortunately the samtools bug I ran into (samtools/samtools#1866) only affects CRAM v2.1 files, which were superseded by v3.0 in 2015 so we surely don't have any — so this streaming method will work after all. (It was just bad luck that I grabbed one when I was testing this locally!)
It's invoked as