The bNMF procedure, as applied here, is used to detect clusters of GWAS variants for some outcome of interest based on the associations of those variants with a set of additional traits. This pipeline includes pre-processing steps (such as quality control of variants and traits and the choice of proxy variants), preparation of the z-score matrix, clustering, and summarization of results.
Important: The current pipeline makes certain assumptions and uses some hard-coded filenames, including:
- "VAR_ID"s for GWAS and trait-specific summary statistics are in a specific format (CHR_POS_REF_ALT), with alleles aligned in a consistent way across traits (i.e. variant matching is performed using a simple string match).
- The variant reference file linking VAR_IDs to rsIDs is based on the VAR_ID format above, and points to a file available on the Broad Institute compute cluster. The VAR_ID to rsID pairs can be generated using the script "generate_varid_to_rsid_map_file.R"
- The LD reference file uses rsIDs, is based on a European-ancestry population, and points to a file available on the Broad Institute compute cluster. This file can be downloaded from HaploReg v4.1 (https://pubs.broadinstitute.org/mammals/haploreg/haploreg.php).
ld_pruning: LD-based pruning of the input variant set
count_traits_per_variant: Assess the fraction of traits missing each variant of interest
find_variants_needing_proxies: Determine which variants need proxies (allele considerations, missingness, etc.)
choose_proxies: Select proxies for the necessary variants and output the final variant set for clustering
fetch_summary_stats: Retrieve z-scores and sample sizes across all traits for the final variant set
prep_z_matrix: Final trait filters, sample size adjustment, and creation of non-negative input matrix (N (variants) x 2M (traits), with separate columns for positive trait associations (zero otherwise) and negative trait associations)
run_bNMF: Run the bNMF procedure (over multiple iterations)
summarize_bNMF: Summarize the results and create heatmaps for visualization
Most steps of the pipeline will print messages with details of the procedure. In addition, the following outputs will be written to the working directory.
- no_proxies_found.txt: A list of variants that were excluded and for which no acceptable proxies were found.
- run_summary.txt: A table listing the chosen K (# of clusters) and negative log-likelihood for each bNMF iteration.
- z_score_mat.rds: A binary R object containing the N x M z-score matrix after all preprocessing steps.
- z_score_mat_nonnegative.rds: A binary R object containing the N x 2M z-score matrix for direct input to the bNMF step.
- L2EU.W.mat.K[]: The matrix of feature contributions to clusters for the K in question (one per K chosen in at least one iteration).
- L2EU.H.mat.K[]: The matrix of variant contributions to clusters for the K in question.
- W_plot_K[].pdf: A heatmap displaying feature contributions to clusters for the K in question.
- H_plot_K[].pdf: A heatmap displaying variant contributions to clusters for the K in question.
- Claire Kim (design and code)
- Kenny Westerman (design and code)
- Kirk Smith (code)
- Jaegil Kim (code)
- Marcin von Grotthuss (code)
- Miriam Udler (design and supervision)