-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Summarizing and visualizing the regression-based statistical tests #51
Comments
Comment Regarding Issue #51Background:Though I do not have an extensive background in writing scripts in R, I have been actively building my skills this past week and a half to better contribute to this specific data visualization issue case, and I am starting to get a better grip on it for this project! The current primary foundation of visualization in this script is resting on vip, ggplot2, and other minor features of svglite and rplot. However, to obtain a more refined visualization structure, we can take the current visualization functions and/or manipulated data to be expanding on via various potential routes that I have been looking into. Potential Routes:
My Next Steps:Personally, even though I am rather inclined to attempt using Python (and I may still do so in my free time) given my background in the language, I believe that the most straightforward approach ultimately is the first route I mentioned above, Modifying the current implementation to incorporate aspects from I should have my finalized draft my this Friday, November 1st, and I look forward to my next comment then! |
The visualizations that we already have (i.e., variable importance plots, decision tree plots) are not incorrect or otherwise "bad" in any way. Quite on the contrary: they are a great first step. They also do not need to be modern-looking in any particular way. In their current form, they are just a bit difficult to interpret for a layman (as illustrated by the following example), and I believe that we can generate some valid figures that are easier to grasp by a non-expert reader. Example: The following collage of variable importance plots correctly displays the variable importance for each analysis performed, but does it convey a clear message? By comparison, I consider the visualization of decision boundaries more intuitive. The Conceptually, something like the motivating example at https://cran.r-project.org/web/packages/ggparty/vignettes/ggparty-graphic-partying.html would be amazing! After all, I see a decision tree and bar charts for each tree node in that figure. Conceptually, we have the same data from our regression analysis results. |
Overview
The purpose of this issue/document is two-fold:
Regression analyses
The regression analyses are conducted through this script: https://github.com/michaelgruenstaeudl/PACVr/blob/master/inst/extdata/scripts_for_figures_tables/PREP_RegressionAnalysis_RunAnalyses.R
The input data for this script is found here: https://github.com/michaelgruenstaeudl/PACVr/tree/master/inst/extdata/scripts_for_figures_tables/input/data_for_regression_analyses_2024_10_16.rds
Textual summary of analyses and results
METHODS
Regression-based statistical tests
Regression models that implement either linear regressions or decision trees were used to test the hypotheses of this investigation. These models employed predictor variables that are also evaluated in our group-based statistical tests as well as control variables (hereafter ‘covariates’) that may have an additional, confounding impact on sequencing depth and sequencing evenness.
The regression models on sequencing depth used the separation into the four structural partitions of a plastid genome and the separation between the coding versus the non-coding sections of each plastome as predictor variables, with each category within these predictor variables defined as independent. The effects of predictor variables on sequencing evenness were modeled as a decision tree. Total genome size and the exact ratio of coding versus non-coding plastome sections were specified as covariates.
The regression models on sequencing evenness, by comparison, used the assembly quality indicators (i.e., the number of ambiguous nucleotides and the number of IR mismatches) and the identity of the sequencing platform employed as predictor variables, with each sequencing platform category defined as independent. The effects of each assembly quality indicator on sequencing evenness were modeled as a linear regression, the effect of the identity of the sequencing platform employed on sequencing evenness as a decision tree. Total genome size and the exact ratio of coding versus non-coding plastome sections were used as covariates. The identity of the assembly software tool employed, by contrast, was not used as a predictor variable due to its high proportion of missing data.
Outliers within any predictor or control variable category were removed from the dataset prior to model scoring.
All regression-based models were implemented in R via the R package tidymodels (Kuhn and Wickham 2020).
RESULTS
Regression-based statistical tests
The results of our regression-based statistical tests on sequencing depth indicated that both the separation into the four structural partitions of a plastid genome and the separation between the coding versus the non-coding sections of each plastome had a significant effect on, and a moderate explanatory ability for, the observed variability in sequencing depth.
The results of our regression-based statistical tests on sequencing evenness indicated that the identity of the sequencing platform employed had a significant effect on the observed variability in sequencing evenness. The linear regression regarding the effect of the assembly quality indicators on sequencing evenness indicated that only the number of ambiguous nucleotides had a significant impact on sequencing evenness, corroborating the results of our correlation coefficient analysis among the group-based statistical tests.
Is the above summary complete?
BIBTEX
Visual summary of results
Premise
The results of the above regression analyses are difficult to interpret for a non-statistician, because they are primarily textual. For our planned manuscript, it would be great if we could generate graphical visualizations of these results, especially such that are simple, colorful, and intuitive (yes, most manuscript readers need to be treated like children!).
Improving existing figures and combining them into a collage
Two types of plots are already produced through our code:
variable importance plots
decision tree plots
Both of these plots are a good start, but they are somewhat difficult to interpret for a non-statistician. One the one hand, these existing plots could be further simplified to make them more intuitive. Decision tree plots, for example, can be improved compared to the default settings via the options explained in http://www.milbo.org/rpart-plot/prp.pdf. On the other hand, alternative ways to visualize regression analysis results exist. Below are examples of such visualizations. Which exact visualizations would make sense on our data is a matter we still have to discuss as a group. If we can adapt and implement some of these code examples (especially example 1 !) for our data, we should be able to create similar visualizations.
Example 1
This code seems visualize the results of the regression decision trees via the package
parttree
.https://paulvanderlaken.com/2020/03/31/visualizing-decision-tree-partition-and-decision-boundaries/
Example 2
This code seems visualize the results of the regression decision trees via the package
parsnip
.https://www.tmwr.org/explain
Goal
The production of publication-quality figures that would help a manuscript reader in interpreting the regression analysis results.
The text was updated successfully, but these errors were encountered: