Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepare input files for FINEMAP #232

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

hopedisastro
Copy link
Contributor

This script will prepare the z and ld files for FINEMAP, based on outputs from associaTR (meta-analysis) and corr_matrix_maker.py.

The z file contains the effect size and standard error estimates for each variant associated with a gene.
The ld file contains the correlation matrix calcualtions for each variant associated with a gene.

Comment on lines +129 to +135
if (
to_path(
output_path(f'finemap_prep/{celltype}/{chrom}/{gene_name}.ld', 'analysis'),
).exists()
and to_path(output_path(f'finemap_prep/{celltype}/{chrom}/{gene_name}.z', 'analysis')).exists()
):
continue
Copy link
Contributor

@MattWellie MattWellie Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a biggie, but be aware of your job scaling (type * chrom * gene), and you're creating 2 existence checks per combination. All that happens in the driver job so you're delaying the start of the actual work.

I'd experiment with a change here -

# get all files in the output folder, recursively, in a single query
all_files = list(to_path(output_path('finemap_prep', 'analysis')).glob('**'))

# check whether your intended outputs are in that list
ld_file = output_path(f'finemap_prep/{celltype}/{chrom}/{gene_name}.ld', 'analysis')
z_file = output_path(f'finemap_prep/{celltype}/{chrom}/{gene_name}.z', 'analysis')
if all (filepath in all_files for filepath in [z_file, ld_file]):
    continue

I thiiiiink this should scale a lot better, by posting one large query instead of thousands of individual ones.

This also builds the full output file names, so you can pass them to the relevant methods (you pass the celltype, chrom, and gene name to your methods, but you already made the full path here to check if it exists)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants