Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add rename_embedded_ids.py script that updates @RG SM:{id} values #504

Open
wants to merge 3 commits into
base: dev
Choose a base branch
from

Conversation

jmarshall
Copy link
Contributor

Add a fairly simpleminded script that reads CRAM files from a gs://… URL, applies samtools reheader to update the @RG SM:{id} headers as instructed, and writes the CRAM file and associated index file back to new gs://… URLs constructed from the old one by updating the {id} similarly.

Fortunately the samtools bug I ran into (samtools/samtools#1866) only affects CRAM v2.1 files, which were superseded by v3.0 in 2015 so we surely don't have any — so this streaming method will work after all. (It was just bad luck that I grabbed one when I was testing this locally!)

It's invoked as

Copies CRAM files and associated .crai index files to URLs based on new IDs,
also replacing SM:{id} values on the CRAM files' @RG header lines.
Each command-line argument is

    OLDURL[,OLDID],NEWID

If OLDID is not specified, a pattern matching /CPGdd...dd/ is used instead
to match CPG-style IDs in the URL and @RG SM fields.

This script should usually be run via analysis-runner. It requires an image
that contains samtools.

@jmarshall
Copy link
Contributor Author

The changes to create_test_subset.py are somewhat draft and need testing. The rename_embedded_ids.py UI may need to become a little more general to handle all the scenarios we want it to work in.

@codecov-commenter
Copy link

codecov-commenter commented Jul 3, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79.79%. Comparing base (f72850b) to head (06f8e1b).

Additional details and impacted files
@@           Coverage Diff           @@
##              dev     #504   +/-   ##
=======================================
  Coverage   79.79%   79.79%           
=======================================
  Files         169      169           
  Lines       14523    14523           
=======================================
  Hits        11589    11589           
  Misses       2934     2934           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@cassimons cassimons left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks.

Copy link
Contributor

@vivbak vivbak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, John!

As we discussed, I think it would be good to call metamist here to make sure that our analysis entries don't go out of sync when this script is run (which could have downstream implications in pipeline runs).

Also it would be useful to add a comment on manual actions from a data management perspective (i.e. deleting the old cram files after this script is run).

@vivbak
Copy link
Contributor

vivbak commented Jul 6, 2023

Oh also could you look at the linting and unit tests?

@jmarshall
Copy link
Contributor Author

Turns out the unit test failure was a problem on dev that has since been fixed and has resolved itself with a rebase onto current dev.

Lint was whining about something I consider good style, sigh. But I'll probably refactor it in due course to shut the thing up.

@jmarshall jmarshall force-pushed the rename_embedded_ids branch 2 times, most recently from adb5c95 to a4e6ea5 Compare October 27, 2023 01:42
This version of this script streams from the original blob, uses
`samtools reheader` to update header lines, and streams back to the
output blob. It also regenerates a new corresponding index file.
(Currently assumes that it is given CRAM URLs specifically, not BAM.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🏗 In progress
Development

Successfully merging this pull request may close these issues.

4 participants