Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Summarizing and visualizing the regression-based statistical tests #51

Open
2 tasks
michaelgruenstaeudl opened this issue Oct 16, 2024 · 2 comments
Open
2 tasks

Comments

@michaelgruenstaeudl
Copy link
Owner

michaelgruenstaeudl commented Oct 16, 2024

Overview

The purpose of this issue/document is two-fold:

  • to summarize the regressions analyses of the project in a textual form
  • to summarize the results of these regression analyses in both a textual form and a visual form.

Regression analyses

The regression analyses are conducted through this script: https://github.com/michaelgruenstaeudl/PACVr/blob/master/inst/extdata/scripts_for_figures_tables/PREP_RegressionAnalysis_RunAnalyses.R

The input data for this script is found here: https://github.com/michaelgruenstaeudl/PACVr/tree/master/inst/extdata/scripts_for_figures_tables/input/data_for_regression_analyses_2024_10_16.rds

Textual summary of analyses and results

METHODS
Regression-based statistical tests
Regression models that implement either linear regressions or decision trees were used to test the hypotheses of this investigation. These models employed predictor variables that are also evaluated in our group-based statistical tests as well as control variables (hereafter ‘covariates’) that may have an additional, confounding impact on sequencing depth and sequencing evenness.
The regression models on sequencing depth used the separation into the four structural partitions of a plastid genome and the separation between the coding versus the non-coding sections of each plastome as predictor variables, with each category within these predictor variables defined as independent. The effects of predictor variables on sequencing evenness were modeled as a decision tree. Total genome size and the exact ratio of coding versus non-coding plastome sections were specified as covariates.
The regression models on sequencing evenness, by comparison, used the assembly quality indicators (i.e., the number of ambiguous nucleotides and the number of IR mismatches) and the identity of the sequencing platform employed as predictor variables, with each sequencing platform category defined as independent. The effects of each assembly quality indicator on sequencing evenness were modeled as a linear regression, the effect of the identity of the sequencing platform employed on sequencing evenness as a decision tree. Total genome size and the exact ratio of coding versus non-coding plastome sections were used as covariates. The identity of the assembly software tool employed, by contrast, was not used as a predictor variable due to its high proportion of missing data.
Outliers within any predictor or control variable category were removed from the dataset prior to model scoring.
All regression-based models were implemented in R via the R package tidymodels (Kuhn and Wickham 2020).

RESULTS
Regression-based statistical tests
The results of our regression-based statistical tests on sequencing depth indicated that both the separation into the four structural partitions of a plastid genome and the separation between the coding versus the non-coding sections of each plastome had a significant effect on, and a moderate explanatory ability for, the observed variability in sequencing depth.
The results of our regression-based statistical tests on sequencing evenness indicated that the identity of the sequencing platform employed had a significant effect on the observed variability in sequencing evenness. The linear regression regarding the effect of the assembly quality indicators on sequencing evenness indicated that only the number of ambiguous nucleotides had a significant impact on sequencing evenness, corroborating the results of our correlation coefficient analysis among the group-based statistical tests.


Is the above summary complete?

  • Yes
  • No

BIBTEX

@Manual{KuhnAndWickham2020,
title = {Tidymodels: a collection of packages for modeling and machine learning using tidyverse principles},
author = {Max Kuhn and Hadley Wickham},
url = {https://www.tidymodels.org},
year = {2020},
}

Visual summary of results

Premise

The results of the above regression analyses are difficult to interpret for a non-statistician, because they are primarily textual. For our planned manuscript, it would be great if we could generate graphical visualizations of these results, especially such that are simple, colorful, and intuitive (yes, most manuscript readers need to be treated like children!).

$${\Large \color{red}Start~here!}$$

Improving existing figures and combining them into a collage

Two types of plots are already produced through our code:

  • variable importance plots
    a variable importance plot

  • decision tree plots
    a decision tree plot

Both of these plots are a good start, but they are somewhat difficult to interpret for a non-statistician. One the one hand, these existing plots could be further simplified to make them more intuitive. Decision tree plots, for example, can be improved compared to the default settings via the options explained in http://www.milbo.org/rpart-plot/prp.pdf. On the other hand, alternative ways to visualize regression analysis results exist. Below are examples of such visualizations. Which exact visualizations would make sense on our data is a matter we still have to discuss as a group. If we can adapt and implement some of these code examples (especially example 1 !) for our data, we should be able to create similar visualizations.

Example 1
This code seems visualize the results of the regression decision trees via the package parttree.
https://paulvanderlaken.com/2020/03/31/visualizing-decision-tree-partition-and-decision-boundaries/

Example 2
This code seems visualize the results of the regression decision trees via the package parsnip.
https://www.tmwr.org/explain

Goal

The production of publication-quality figures that would help a manuscript reader in interpreting the regression analysis results.

@bosonicbw
Copy link

Comment Regarding Issue #51

Background:

Though I do not have an extensive background in writing scripts in R, I have been actively building my skills this past week and a half to better contribute to this specific data visualization issue case, and I am starting to get a better grip on it for this project!

The current primary foundation of visualization in this script is resting on vip, ggplot2, and other minor features of svglite and rplot. However, to obtain a more refined visualization structure, we can take the current visualization functions and/or manipulated data to be expanding on via various potential routes that I have been looking into.

Potential Routes:

  • Modifying the current implementation to incorporate aspects from parttree alongside parsnip with vip

    • As mentioned in the original examples presented from the issue case above, the code for modelling decision trees can simply be rewritten/extended to utilize various capabilities that are found natively in the parttree package. However, I have found that parttree does not seem to provide a manageable option for plotting variable importance, so I would recommend extending further capabilities natively found in vip since the package is already utilized, but with including the parsnip ecosystem in order to create a better, more modern-looking data visualization.
  • Exporting to RDS and using a Python script for data visualization

    • Another option that seems easily doable would be to export the manipulated data to an RDS file, then have a separate Python script that reads the RDS file using pyreadr and creates modern-looking, accurate data visualizations of decision trees and variable importance plots using the popular libraries seaborn and matplotlib.
  • Maximizing the existing capabilities of ggplot2 (Not Recommended)

    • Another route would be to simply not include any additional packages, and just maximize the capabilities that are presently found in ggplot2 and vip to manually change things such as colors, fonts, etc. to create a custom theme from scratch, since they are already used in the script.
      (This route is more labor intensive, thus less preferred)

My Next Steps:

Personally, even though I am rather inclined to attempt using Python (and I may still do so in my free time) given my background in the language, I believe that the most straightforward approach ultimately is the first route I mentioned above, Modifying the current implementation to incorporate aspects from parttree alongside parsnip with vip. Thus, I will continue my learning path in getting up-to-speed with R and gaining a better understanding of this current Regression Analysis script that's at-hand!

I should have my finalized draft my this Friday, November 1st, and I look forward to my next comment then!

@michaelgruenstaeudl
Copy link
Owner Author

michaelgruenstaeudl commented Nov 1, 2024

The visualizations that we already have (i.e., variable importance plots, decision tree plots) are not incorrect or otherwise "bad" in any way. Quite on the contrary: they are a great first step. They also do not need to be modern-looking in any particular way. In their current form, they are just a bit difficult to interpret for a layman (as illustrated by the following example), and I believe that we can generate some valid figures that are easier to grasp by a non-expert reader.

Example: The following collage of variable importance plots correctly displays the variable importance for each analysis performed, but does it convey a clear message?

Variable Importance Collage

Variable Importance Collage

By comparison, I consider the visualization of decision boundaries more intuitive. The parttree output as presented on https://paulvanderlaken.com/2020/03/31/visualizing-decision-tree-partition-and-decision-boundaries/ or the boundary decision plot made via ggplot at https://koalaverse.github.io/homlr/notebooks/09-decision-trees.nb.html is something that a non-expert reader would grasp quickly.

Conceptually, something like the motivating example at https://cran.r-project.org/web/packages/ggparty/vignettes/ggparty-graphic-partying.html would be amazing! After all, I see a decision tree and bar charts for each tree node in that figure. Conceptually, we have the same data from our regression analysis results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants