TDAmut_Tutorial.Rmd

---
title: "Using TDAmut to analyze low grade glioma data"
output: rmarkdown::github_document
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(TDAmut)
```

```{css, include = FALSE}
.lim-message {
  max-height: 150px;
}
```

## Getting Started

We will use low grade glioma expression and mutation data to demonstrate the TDAmut pipeline. The sample data is formatted from publically available TCGA data and is provided in the `TDAmut` package. Expression data is assumed to be normalized data (e.g. $log_2[1+TPM]$) formatted as a matrix (rows = samples, columns = genes). Mutation data is assumed be in a table organized by _Sample_, _Gene_, and _Type_ (e.g. missense, nonsense, splice, frameshift, ...). 

The user is afforded several options in each function of `TDAmut`. We implemented default options which can be a helpful starting point for the user.

## Creating TDAmut object for use throughout the pipeline

Here, we are creating a `TDAmut` object to be used as an intermediate across all functions. This object holds data and topological representations produced in this pipeline.


``` {r input_data, cache = TRUE, class.message = 'lim-message'}
exp_matrix <- "data/LGG_Full_TPM_matrix.csv"
mut_table <- "data/LGG_Muts.txt"
LGG_object <- create_TDAmut_object(exp_matrix, mut_table)
```

``` {r display_data}
LGG_object@expression_table[1:5, 1:5]
LGG_object@mutation_table[1:5, ]
```

## Creating topological representations of expression data

The `LGG_object` is passed to compute_complexes, which uses TDAmapper and RayleighSelection R packages to create nerve complexes across a grid of Mapper parameters (2D intervals and their percent overlap). We create several nerve complexes across a broad range of parameters to ensure stable results.

The `LGG_object` is populated with the topological representations and the parameters used to make them.

``` {r nerve_complexes, cache = TRUE, dependson = 'input_data', message = FALSE}
LGG_object <- compute_complexes(LGG_object, filter_method = 'KNN', k = 30, min_interval = 10, max_interval = 60, interval_step = 10, min_percent_overlap = 60, max_percent_overlap = 85, percent_step = 5)
```

## Computing mutational load and its localization within the topological representations

Spurious correlations between the mutation rate and tumor expression profile can confound our approach. An example comes from hypermutated tumors and their often distinctive expression profiles. Localized regions in expression spaces consisting of these tumors can also harbor accumulated passenger mutations that confound our approach. To control for these correlations, `compute_mut_load` computes the mutational tumor burden (total number of somatic mutations in each tumor) and assesses its localization across the topological representations of expression data. If significant localization is seen, tumors with a mutational load above a user-defined threshold have their mutations downsampled such that their median mutational load matches the median mutational load of other samples. The user is also able to set a minimum mutational threshold to eliminate samples with few mutations (e.g. low tumor purity). Nerve complexes may be recomputed without samples below this minimum threshold by using `compute_complexes(..., recompute = TRUE)`

After computing mutational load with `compute_mut_load`, 13 samples with mutational load below ~20 ($10^{1.3}$) are removed. Since samples were removed, we must recompute nerve complexes without them. `plot_mut_load` shows us that 3 samples have a significantly higher mutational load than other samples. However, we see very few regions of the parameter grid with localization of mutational load, so we continue. 

``` {r mut_load, cache = TRUE, dependson = 'input_data', message = c(1,2)}
LGG_object <- compute_mut_load(LGG_object, min_mutload = 1.3)
plot_mut_load(LGG_object)
```

``` {r recompute_complexes, cache = TRUE, dependson = 'input_data', message = FALSE}
LGG_object <- compute_complexes(LGG_object, filter_method = 'KNN', k = 30, min_interval = 10, max_interval = 60, interval_step = 10, min_percent_overlap = 60, max_percent_overlap = 85, percent_step = 5, recompute = TRUE)
```

## Assessing current results by visualizing topological representations

To check the validity of the options chosen in the pipeline so far, we want to visualize our data on the topological representations. We expect to see localization of mutated genes which have been previously identified as distinct markers of low grade glioma subtypes.

* IDH1 mutations in a subtype of astrocytoma
* ATRX mutations in a astrocytic gliomas
* CIC mutations in oligodendromas

We've chosen the topological representation created with 2D intervals = 30 and percent overlap = 75. More can be specified in a vector of interval and percent pairs passed to `which_complexes`. By default, 3 representations with parameters in the middle of the parameter range are chosen.


```{r plots}
LGG_genes <- c('EGFR.1956', 'TP53.7157', 'IDH2.3418', 'CIC.23152')

plot_mapper(LGG_object, type = 'mutation', features = LGG_genes, which_complexes = c(30,75))

```

## Filtering genes by mutation frequency, highest fraction of nonsynonymous mutations, and negative correlations between expression and mutation data

Genes can be filtered by 

* Mutational frequency among all samples 
* Those with the greatest ratio of nonsynonymous mutations to total mutations
* Associations between expression and mutation rates of a gene

For example, anticorrelations between expression and mutation rates can arise due to transcription-coupled DNA repair. To correct for this, we assess the similarity of expression and mutation profiles within the topological representations using Jensen Shannon Divergence (JSD). A p value is estimated by comparing the actual JSD to a null distribution generated by permuting the sample IDs in the mutation data and calculating JSD. The number of permutations used should be increased accordingly to the number of genes that pass filtering thresholds. 1000 permutations is sufficient for many scenarios.

Note: `filter_genes` is the most time-intensive function of `TDAmut` and can take a couple hours. Consider using the `num_cores` argument to parallelize this function.

```{r filter_genes_nocores, eval = FALSE}
LGG_object <- filter_genes(LGG_object, freq_threshold = 0.02, top_nonsyn_fraction = 300, negative_correlations = TRUE, num_permutations = 1000)
```

``` {r filter_genes, echo = FALSE, cache = TRUE, dependson = 'input_data', warning = FALSE}
LGG_object <- filter_genes(LGG_object, freq_threshold = 0.02, top_nonsyn_fraction = 300, negative_correlations = TRUE, num_permutations = 1000, num_cores = 8)
```

## Assessing localization of filtered genes

Filtered genes are assessed for localization across the topological representations using `RayleighSelection`. Localization is quantified via a p value, estimated by a similar permutation scheme as noted above. The false discovery rate is controlled using the Benjamini-Hochberg procedure, which results in an accompanying localization q value.

Genes with a median JSD q value above the threshold specified by the `negative_correlations_threshold` argument are not considered, as they display an anticorrelation between expression and mutation rates.

Note: Consider using the `num_cores` argument to parallelize `compute_gene_localization` if filtering thresholds are relaxed.

```{r gene_localization_nocores, eval = FALSE}
LGG_object <- compute_gene_localization(LGG_object, negative_correlations_threshold = 0.8, num_permutations = 5000)
```

``` {r gene_localization, echo = FALSE, cache = TRUE, dependson = 'input_data', message = c(1)}
LGG_object <- compute_gene_localization(LGG_object, negative_correlations_threshold = 0.8, num_permutations = 5000, num_cores = 6)
```

## Identifying and summarizing cancer-associated genes

Results are then summarized by `identify_significant_mutations`. Genes with a median localization q value below the threshold specified by the `q_threshold_localization` argument are considered significant. These genes are predicted to be associated with global expression patterns over a subset of tumors.

```{r final_results, cache = FALSE}
LGG_object <- identify_significant_mutations(LGG_object, q_threshold_localization = 0.15)
```