Summarizing and visualizing the regression-based statistical tests #51

michaelgruenstaeudl · 2024-10-16T01:27:10Z

Overview

The purpose of this issue/document is two-fold:

to summarize the regressions analyses of the project in a textual form
to summarize the results of these regression analyses in both a textual form and a visual form.

Regression analyses

The regression analyses are conducted through this script: https://github.com/michaelgruenstaeudl/PACVr/blob/master/inst/extdata/scripts_for_figures_tables/PREP_RegressionAnalysis_RunAnalyses.R

The input data for this script is found here: https://github.com/michaelgruenstaeudl/PACVr/tree/master/inst/extdata/scripts_for_figures_tables/input/data_for_regression_analyses_2024_10_16.rds

Textual summary of analyses and results

METHODS
Regression-based statistical tests
Regression models that implement either linear regressions or decision trees were used to test the hypotheses of this investigation. These models employed predictor variables that are also evaluated in our group-based statistical tests as well as control variables (hereafter ‘covariates’) that may have an additional, confounding impact on sequencing depth and sequencing evenness.
The regression models on sequencing depth used the separation into the four structural partitions of a plastid genome and the separation between the coding versus the non-coding sections of each plastome as predictor variables, with each category within these predictor variables defined as independent. The effects of predictor variables on sequencing evenness were modeled as a decision tree. Total genome size and the exact ratio of coding versus non-coding plastome sections were specified as covariates.
The regression models on sequencing evenness, by comparison, used the assembly quality indicators (i.e., the number of ambiguous nucleotides and the number of IR mismatches) and the identity of the sequencing platform employed as predictor variables, with each sequencing platform category defined as independent. The effects of each assembly quality indicator on sequencing evenness were modeled as a linear regression, the effect of the identity of the sequencing platform employed on sequencing evenness as a decision tree. Total genome size and the exact ratio of coding versus non-coding plastome sections were used as covariates. The identity of the assembly software tool employed, by contrast, was not used as a predictor variable due to its high proportion of missing data.
Outliers within any predictor or control variable category were removed from the dataset prior to model scoring.
All regression-based models were implemented in R via the R package tidymodels (Kuhn and Wickham 2020).

RESULTS
Regression-based statistical tests
The results of our regression-based statistical tests on sequencing depth indicated that both the separation into the four structural partitions of a plastid genome and the separation between the coding versus the non-coding sections of each plastome had a significant effect on, and a moderate explanatory ability for, the observed variability in sequencing depth.
The results of our regression-based statistical tests on sequencing evenness indicated that the identity of the sequencing platform employed had a significant effect on the observed variability in sequencing evenness. The linear regression regarding the effect of the assembly quality indicators on sequencing evenness indicated that only the number of ambiguous nucleotides had a significant impact on sequencing evenness, corroborating the results of our correlation coefficient analysis among the group-based statistical tests.

Is the above summary complete?

Yes
No

BIBTEX

@Manual{KuhnAndWickham2020,
title = {Tidymodels: a collection of packages for modeling and machine learning using tidyverse principles},
author = {Max Kuhn and Hadley Wickham},
url = {https://www.tidymodels.org},
year = {2020},
}

Visual summary of results

Premise

The results of the above regression analyses are difficult to interpret for a non-statistician, because they are primarily textual. For our planned manuscript, it would be great if we could generate graphical visualizations of these results, especially such that are simple, colorful, and intuitive (yes, most manuscript readers need to be treated like children!).

$${\Large \color{red}Start~here!}$$

Improving existing figures and combining them into a collage

Two types of plots are already produced through our code:

variable importance plots
decision tree plots

Both of these plots are a good start, but they are somewhat difficult to interpret for a non-statistician. One the one hand, these existing plots could be further simplified to make them more intuitive. Decision tree plots, for example, can be improved compared to the default settings via the options explained in http://www.milbo.org/rpart-plot/prp.pdf. On the other hand, alternative ways to visualize regression analysis results exist. Below are examples of such visualizations. Which exact visualizations would make sense on our data is a matter we still have to discuss as a group. If we can adapt and implement some of these code examples (especially example 1 !) for our data, we should be able to create similar visualizations.

Example 1
This code seems visualize the results of the regression decision trees via the package parttree.
https://paulvanderlaken.com/2020/03/31/visualizing-decision-tree-partition-and-decision-boundaries/

Example 2
This code seems visualize the results of the regression decision trees via the package parsnip.
https://www.tmwr.org/explain

Goal

The production of publication-quality figures that would help a manuscript reader in interpreting the regression analysis results.

The text was updated successfully, but these errors were encountered:

bosonicbw · 2024-10-29T19:32:05Z

Comment Regarding Issue #51

Background:

Though I do not have an extensive background in writing scripts in R, I have been actively building my skills this past week and a half to better contribute to this specific data visualization issue case, and I am starting to get a better grip on it for this project!

The current primary foundation of visualization in this script is resting on vip, ggplot2, and other minor features of svglite and rplot. However, to obtain a more refined visualization structure, we can take the current visualization functions and/or manipulated data to be expanding on via various potential routes that I have been looking into.

Potential Routes:

Modifying the current implementation to incorporate aspects from parttree alongside parsnip with vip
- As mentioned in the original examples presented from the issue case above, the code for modelling decision trees can simply be rewritten/extended to utilize various capabilities that are found natively in the parttree package. However, I have found that parttree does not seem to provide a manageable option for plotting variable importance, so I would recommend extending further capabilities natively found in vip since the package is already utilized, but with including the parsnip ecosystem in order to create a better, more modern-looking data visualization.
Exporting to RDS and using a Python script for data visualization
- Another option that seems easily doable would be to export the manipulated data to an RDS file, then have a separate Python script that reads the RDS file using pyreadr and creates modern-looking, accurate data visualizations of decision trees and variable importance plots using the popular libraries seaborn and matplotlib.
Maximizing the existing capabilities of ggplot2 (Not Recommended)
- Another route would be to simply not include any additional packages, and just maximize the capabilities that are presently found in ggplot2 and vip to manually change things such as colors, fonts, etc. to create a custom theme from scratch, since they are already used in the script.
  (This route is more labor intensive, thus less preferred)

My Next Steps:

Personally, even though I am rather inclined to attempt using Python (and I may still do so in my free time) given my background in the language, I believe that the most straightforward approach ultimately is the first route I mentioned above, Modifying the current implementation to incorporate aspects from parttree alongside parsnip with vip. Thus, I will continue my learning path in getting up-to-speed with R and gaining a better understanding of this current Regression Analysis script that's at-hand!

I should have my finalized draft my this Friday, November 1st, and I look forward to my next comment then!

michaelgruenstaeudl · 2024-11-01T01:26:28Z

The visualizations that we already have (i.e., variable importance plots, decision tree plots) are not incorrect or otherwise "bad" in any way. Quite on the contrary: they are a great first step. They also do not need to be modern-looking in any particular way. In their current form, they are just a bit difficult to interpret for a layman (as illustrated by the following example), and I believe that we can generate some valid figures that are easier to grasp by a non-expert reader.

Example: The following collage of variable importance plots correctly displays the variable importance for each analysis performed, but does it convey a clear message?

By comparison, I consider the visualization of decision boundaries more intuitive. The parttree output as presented on https://paulvanderlaken.com/2020/03/31/visualizing-decision-tree-partition-and-decision-boundaries/ or the boundary decision plot made via ggplot at https://koalaverse.github.io/homlr/notebooks/09-decision-trees.nb.html is something that a non-expert reader would grasp quickly.

Conceptually, something like the motivating example at https://cran.r-project.org/web/packages/ggparty/vignettes/ggparty-graphic-partying.html would be amazing! After all, I see a decision tree and bar charts for each tree node in that figure. Conceptually, we have the same data from our regression analysis results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Summarizing and visualizing the regression-based statistical tests #51

Summarizing and visualizing the regression-based statistical tests #51

michaelgruenstaeudl commented Oct 16, 2024 •

edited

Loading

bosonicbw commented Oct 29, 2024

Modifying the current implementation to incorporate aspects from `parttree` alongside `parsnip` with `vip`

Exporting to RDS and using a Python script for data visualization

Maximizing the existing capabilities of `ggplot2` (Not Recommended)

michaelgruenstaeudl commented Nov 1, 2024 •

edited

Loading

Summarizing and visualizing the regression-based statistical tests #51

Summarizing and visualizing the regression-based statistical tests #51

Comments

michaelgruenstaeudl commented Oct 16, 2024 • edited Loading

Overview

Regression analyses

Textual summary of analyses and results

Visual summary of results

Premise

Improving existing figures and combining them into a collage

Goal

bosonicbw commented Oct 29, 2024

Comment Regarding Issue #51

Background:

Potential Routes:

Modifying the current implementation to incorporate aspects from parttree alongside parsnip with vip

Exporting to RDS and using a Python script for data visualization

Maximizing the existing capabilities of ggplot2 (Not Recommended)

My Next Steps:

michaelgruenstaeudl commented Nov 1, 2024 • edited Loading

michaelgruenstaeudl commented Oct 16, 2024 •

edited

Loading

Modifying the current implementation to incorporate aspects from `parttree` alongside `parsnip` with `vip`

Maximizing the existing capabilities of `ggplot2` (Not Recommended)

michaelgruenstaeudl commented Nov 1, 2024 •

edited

Loading