diff --git a/topics/microbiome/tutorials/multivariable-association/tutorial.md b/topics/microbiome/tutorials/multivariable-association/tutorial.md index d334485ca744e5..281a16e1c4df3e 100644 --- a/topics/microbiome/tutorials/multivariable-association/tutorial.md +++ b/topics/microbiome/tutorials/multivariable-association/tutorial.md @@ -13,7 +13,7 @@ tags: level: Introductory zenodo_link: https://zenodo.org/records/12614561 questions: -- How do I find associations between microbial features and specific metadata variables ? +- How do I find associations between microbial features and specific metadata variables? objectives: - Identify statistically significant associations between microbial features and metadata variables (such as clinical conditions, environmental factors, or demographic information) in microbiome data. - Uncover potential biomarkers associated with specific disease states. @@ -24,15 +24,12 @@ contributions: - paulzierep editing: - shiltemann - - - --- # Microbiome Association Detection with MaAsLin2 The importance of identifying associations between microbial features and metadata variables using tools like MaAsLin2 lies in several key areas: -- **Understanding Disease Mechanisms:** These associations can provide insights into how changes in microbial composition may contribute to the development or progression of diseases. This understanding is crucial for advancing knowledge of disease mechanisms. +- **Understanding Disease Mechanisms:** These associations can provide insights into how microbial composition changes may contribute to disease development or progression. This understanding is crucial for advancing knowledge of disease mechanisms. - **Potential Diagnostic Markers:** Identifying microbial biomarkers associated with specific diseases or conditions can potentially lead to the development of diagnostic tests. These tests could aid in earlier detection, more accurate diagnosis, and monitoring of disease progression. @@ -42,7 +39,9 @@ The importance of identifying associations between microbial features and metada - **Advancing Microbiome Research:** Building a comprehensive understanding of microbial associations with various factors enhances microbiome research. This knowledge can contribute to broader insights into microbial ecology, evolution, and interactions within the human body and the environment. -In addition to MaAslin2, Galaxy offers several other differential analysis tools that are widely used in both transcriptomics and microbiome studies. These tools are designed to handle different types of data (e.g., RNA-seq, microbial count data), with varying strengths in terms of statistical power, handling of sparsity, and treatment of compositional data. Some of them are mentioned below: +In addition to MaAslin2, Galaxy offers several other differential analysis tools widely used in transcriptomics and microbiome studies. +These tools are designed to handle different types of data (e.g., RNA-seq, microbial count data), with varying strengths in terms of statistical power, +handling of sparsity, and treatment of compositional data. Some of them are mentioned below: | Tool | Strengths | Weaknesses | Comparison to MaAsLin2 | @@ -62,7 +61,7 @@ In addition to MaAslin2, Galaxy offers several other differential analysis tools ![sensitivity and false discovery rate (FDR) across different tools](https://journals.plos.org/ploscompbiol/article/figure/image?size=large&id=10.1371/journal.pcbi.1009442.g004 "Source: sensitivity and false discovery rate (FDR) across different tools"){:width="60%"} - The above figure compares various tools for differential abundance detection (Panel A) and multivariable association detection (Panel B) in microbiome studies, based on sensitivity and false discovery rate (FDR). -- **Sensitivity** measures how well the methods detect true signals ,higher values leads to better performance. +- **Sensitivity** measures how well the methods detect true signals, higher values lead to better performance. - **False discovery rate (FDR)** measures the proportion of false positives among detected signals (lower FDR is better). - MaAsLin2 is the clear standout for both differential abundance detection and multivariable association detection, showing high sensitivity and maintaining a low FDR. @@ -73,17 +72,18 @@ In addition to MaAslin2, Galaxy offers several other differential analysis tools MaAsLin2 requires the following input files: -- **Taxonomy (or features) file** : \ +- **Taxonomy (or features) file**: \ This file is tab-delimited.\ Formatted with features as columns and samples as rows.\ - The transpose of this format is also okay.\ + The transposition of this format is also okay.\ Possible features in this file include microbes, genes, pathways, etc. - **Metadata file** : \ This file is tab-delimited.\ Formatted with features as columns and samples as rows.\ - The transpose of this format is also okay. + The transposition of this format is also okay. -The Taxonomy file can contain samples not included in the metadata file (or vice versa). For both cases, those samples not included in both files will be removed from the analysis. Also the samples do not need to be in the same order in the two files. +The Taxonomy file can contain samples not included in the metadata file (or vice versa). For both cases, those samples not included in both files will be removed from the analysis. +Also, the samples do not need to be in the same order in the two files. > > @@ -167,23 +167,23 @@ Now we will find significant associations between microbial features( taxonomy f > - *"Random effects"*: `c5:subject` > - *"Reference"*: `diagnosis,CD` > -> Keep rest of the default values as it is. +> Keep the rest of the default values as it is. {: .hands_on} # Understanding parameters in the tool Lets now understand the role of each parameter in the tool. -1. **Interactions:Fixed effects** : Fixed effects are the factors in your model that you want to study and draw conclusions about. These are the variables you hypothesize have a direct and consistent influence on the outcome. For example, you are studying how different diets affect gut microbiome composition, then diet would be a fixed effect because you’re specifically interested in understanding how different diets influence the microbiome. You might also include other fixed effects like age and gender to control for their impact. +1. **Interactions:Fixed effects**: Fixed effects are the factors in your model that you want to study and draw conclusions about. These are the variables you hypothesize have a direct and consistent influence on the outcome. For example, you are studying how different diets affect gut microbiome composition, then diet would be a fixed effect because you’re specifically interested in understanding how different diets influence the microbiome. You might also include other fixed effects like age and gender to control for their impact. -2. **Random effects** : In some studies, like those following people over time or studying families, samples from the same group can be similar. MaAsLin2 helps handle this by letting researchers choose a grouping factor. This helps make sure the statistical analysis is more accurate. For example, setting random_effects = "Subject_ID" helps control for the correlation between samples that come from the same individual. +2. **Random effects**: In some studies, like those following people over time or studying families, samples from the same group can be similar. MaAsLin2 helps handle this by letting researchers choose a grouping factor. This helps make sure the statistical analysis is more accurate. For example, setting random_effects = "Subject_ID" helps control for the correlation between samples that come from the same individual. -3. **Reference** : It allows researchers to establish a baseline or standard category against which other categories are compared, helping to interpret and understand the effects of different variables on microbial features. +3. **Reference**: It allows researchers to establish a baseline or standard category against which other categories are compared, helping to interpret and understand the effects of different variables on microbial features. > - > - In MaAslin2, reference level is must for variables with more than two distinct kind of values. - > - Reference for a variable with more than two levels is provided as a string of `variable,reference`. - > - Reference for more than one variable having more than two levels each is provided as a string of `variable1,reference1,variable2,reference2` . - > - Example, both diagnosis and site variable have more than two levels hence reference can be provided as `diagnosis,CD,site,Cedars-Sinai`. + > - In MaAslin2, the reference level is must for variables with more than two distinct kinds of values. + > - Reference for a variable with more than two levels is provided as a string of `variable, reference`. + > - Reference for more than one variable having more than two levels each is provided as a string of `variable1,reference1,variable2,reference2`. + > - Example, both diagnosis and site variable have more than two levels hence reference can be provided as `diagnosis, CD, site, Cedars-Sinai`. {: .comment} **Additional options** : @@ -224,11 +224,11 @@ Finally, this adjustment factor is used to normalize the counts in each sample, - The transform to apply to the datasets. - This is done to make the data more suitable for the linear models used in MaAslin2, helping to improve the accuracy and reliability of the results. - Options: \ - 1. LOG : The log transformation applies the natural logarithm (log base e) to the data. Used When your data has a wide range of values or + 1. LOG: The log transformation applies the natural logarithm (log base e) to the data. Used When your data has a wide range of values or is heavily skewed, such as microbiome abundance data where some taxa are much more abundant than others. 2. LOGIT: The logit transformation is used for data that represent proportions or probabilities, where the values lie between 0 and 1. It is defined as logit(x) = log(x / (1 - x)). Used when dealing with data that represents proportions, such as relative abundances that are expressed as fractions or percentages.\ The logit transformation is only applicable to data within the open interval (0, 1), so values exactly at 0 or 1 need to be adjusted (e.g., adding a small constant like 0.001). - 3. Arcsine Square Root Transformation (AST) : It is a statistical transformation used primarily on proportion or percentage data.\ + 3. Arcsine Square Root Transformation (AST): It is a statistical transformation used primarily on proportion or percentage data.\ The transformation starts by taking the square root of the proportion value. This step reduces the impact of extreme values.\ Next, it applies the arcsine function (the inverse of the sine function) to the square root result. The arcsine function helps to normalize the distribution further.\ Used when you are working with proportion data, such as relative abundances in microbiome studies, where the values are bounded between 0 and 1. @@ -238,52 +238,52 @@ Used when you are working with proportion data, such as relative abundances in m - Options: \ 1. Linear Model (LM): Determines how changes in metadata are associated with changes in the taxonomy data. 2. Compositional Proportional Linear Model (CPLM): used for analyzing compositional data, where the taxa abundances are proportions or percentages that sum to 1. - 3. Zero-Inflated Count Model (ZCIP):used when there are many zero counts in the microbiome data. It handles datasets where a large number of taxa are absent in may samples. + 3. Zero-Inflated Count Model (ZCIP): used when there are many zero counts in the microbiome data. It handles datasets where a large number of taxa are absent in many samples. 4. Negative Binomial Model (NEGBIN): used for count data where there is overdispersion (variance exceeds the mean). - 5. Zero-Inflated Negative Binomial Model (ZIND) : combines features of both zero-inflation and negative binomial models, useful for count data with both excess zeros and overdispersion. + 5. Zero-Inflated Negative Binomial Model (ZIND): combines features of both zero-inflation and negative binomial models, useful for count data with both excess zeros and overdispersion. 10. **correction or adjustment methods** [ Default: "BH" ] : - When performing numerous statistical tests simultaneously, like testing the association of many microbial taxa with various metadata variables, the risk of finding false positives increases. - Correction methods help control this risk to ensure that the results are reliable and that significant findings are not due to random chance. -- This is done by computing the q-value,which is a measure of how many false positives are expected among the significant results. +- This is done by computing the q-value, which is a measure of how many false positives are expected among the significant results. - Options:\ - 1. Benjamini & Hochberg(BH)(aka false discovery rate(fdr)): A common method used for FDR correction. It ranks the p-values from smallest to largest and adjusts them base on their rank and the total number of tests. - 2. Benjamini & Yekutieli(BY) : Similar to Benjamini-Hochberg but includes a correction factor that accounts for the correlation between tests. - 3. Bonferroni correction : Divides the significance threshold (alpha level) by the number of tests performed and then compare each p-value to this adjusted significance level to determine if it is statistically significant. + 1. Benjamini & Hochberg(BH)(aka false discovery rate(fdr)): A common method used for FDR correction. It ranks the p-values from smallest to largest and adjusts them based on their rank and the total number of tests. + 2. Benjamini & Yekutieli(BY): Similar to Benjamini-Hochberg but includes a correction factor that accounts for the correlation between tests. + 3. Bonferroni correction: Divides the significance threshold (alpha level) by the number of tests performed and then compares each p-value to this adjusted significance level to determine if it is statistically significant. 4. Hochberg : It is similar to the Bonferroni correction but is often more powerful, meaning it has better statistical power to detect true effects while controlling for false positives. 5. Hommel: controls the Family-Wise Error Rate (FWER) by adjusting p-values in a step-down fashion, starting from the smallest p-value and progressively increasing the threshold. It is more powerful than Bonferroni while maintaining strict error control. 6. Holm: controls the Family-Wise Error Rate (FWER) by sequentially adjusting p-values from smallest to largest, comparing ch step. This stepwise approach is less conservative than the Bonferroni correction, offering greater statistical power.\ **FWER** is the probability of finding at least one false positive among all the tests performed, assuming all null hypotheses are true. **FWER Control** is used to minimize the risk of incorrectly claiming significant results when there are none, thus maintaining the overall reliability of the results.\ -For more information on correction methods , [click here](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/p.adjust). +For more information on correction methods, [click here](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/p.adjust). -11. **standardize** : Apply z-score so continuous metadata are on the same scale [ Default: TRUE ] -12. **plot_heatmap** : Generate a heatmap for the significant associations [ Default: TRUE ] -13. **heatmap_first_n** : In heatmap, plot top N features with significant associations [ Default: 50 ] -14. **plot_scatter** : Generate scatter plots for the significant associations [ Default: TRUE ] -15. **cores** : The number of R processes to run in parallel [ Default: 1 ] +11. **standardize**: Apply z-score so continuous metadata are on the same scale [ Default: TRUE ] +12. **plot_heatmap**: Generate a heatmap for the significant associations [ Default: TRUE ] +13. **heatmap_first_n**: In heatmap, plot top N features with significant associations [ Default: 50 ] +14. **plot_scatter**: Generate scatter plots for the significant associations [ Default: TRUE ] +15. **cores**: The number of R processes to run in parallel [ Default: 1 ] # Reading Output Files -The tool generate the following five major files: +The tool generates the following five major files: - **Data output files** 1. `residuals.rds` This file contains a data frame with residuals for each feature. 2. `significant_results.tsv` Provides the most important output from MaAsLin2 which is the list of significant associations. 3. `all_results.tsv` - Same format as significant_results.tsv, but include all association results (instead of just the significant ones). + Same format as significant_results.tsv, but includes all association results (instead of just the significant ones). - **Visualization output files** 4. `heatmap.pdf` This file contains a heatmap of the significant associations. - ![heatmap](../../images/heatmap_maaslin2.png "heatmap of significatn associations") + ![heatmap](../../images/heatmap_maaslin2.png "heatmap of significant associations") 5. `plots :` A plot is generated for each significant association. Scatter plots are used for continuous metadata. Box plots are for categorical data. - Data points plotted are after normalization, filtering, and transform. + Data points plotted are after normalization, filtering, and transformation. > > @@ -295,8 +295,8 @@ The tool generate the following five major files: > > - Observe how setting the reference value as `CD` for the categorical variable `diagnosis` in MaAsLin2 implies that this reference level will be used as the baseline for comparison against other levels of the variable, i.e, `nonIBD` and `UC`. > > - The effects of other levels will be interpreted relative to this reference level, helping to understand their impact on microbial features. > > - The **colors** of the heatmap represent the magnitude and direction of associations between microbial features and metadata variables. - > > - **Color Intensity** : The intensity of the color indicates the strength of the association. Darker or more vivid colors usually represent stronger associations. - > > - **Color Hue** : The hue (e.g., red, blue) typically indicates the direction of the association. For instance, red represents positive associations (where an increase in the metadata variable is associated with an increase in the microbial feature) and another color blue represents negative associations (where an increase in the metadata variable is associated with a decrease in the microbial feature).\ + > > - **Color Intensity**: The intensity of the color indicates the strength of the association. Darker or more vivid colors usually represent stronger associations. + > > - **Color Hue**: The hue (e.g., red, blue) typically indicates the direction of the association. For instance, red represents positive associations (where an increase in the metadata variable is associated with an increase in the microbial feature) and another color blue represents negative associations (where an increase in the metadata variable is associated with a decrease in the microbial feature).\ > > > > For example, if you look for `Bifidobacterium longum` in the heatmap, you'll notice that its occurrence in the human gut is least affected by the individual's age and shows a neutral effect in relation to their diagnosis of UC (Ulcerative Colitis) and non-IBD (non-Inflammatory Bowel Disease). > > 2. The significant.tsv file shows statistically significant associations between microbial features and metadata variables that meet a specified threshold (in our case, the default `Maximum significance = 0.25`). It includes effect sizes, p-values, and adjusted p-values (q-values) to indicate the strength, direction, and reliability of each association. This file helps identify meaningful relationships in the microbiome data. @@ -310,15 +310,15 @@ The tool generate the following five major files: The study explores the enhancement of microbiome research through the incorporation of dietary data. The research emphasizes that integrating detailed dietary information with microbiome analyses provides a more comprehensive understanding of how diet influences gut microbiota composition and function. By applying advanced techniques in nutri-metaomics, the study aims to link specific dietary patterns with microbial changes, revealing insights into the interactions between diet, the microbiome, and health outcomes. This approach improves the ability to identify diet-related biomarkers and tailor personalized nutrition interventions based on microbial profiles.\ MaAsLin2 was used to assess how specific dietary patterns influence the abundance and diversity of gut microbiota by integrating detailed dietary data with microbiome profiles. \ MaAsLin2 was set up with the following parameters: \ -1. **normalization** : TMM +1. **normalization**: TMM 2. **transform**: LOG -3. **correction** : BH -4. **analysis_method** : LM -5. **max_significance** : 0.25 (default significance threshold) -6. **min_abundance** : 0.0001 -7. **min_prevalence** : 0.1 +3. **correction**: BH +4. **analysis_method**: LM +5. **max_significance**: 0.25 (default significance threshold) +6. **min_abundance**: 0.0001 +7. **min_prevalence**: 0.1 8. **fixed effects**: Age, gender, and other characteristics of the participants as well as dietary data were added as fixed effects. -9. **random effects** : as participant samples from two timepoints were included, the participant identification number was added as a random effect.\ +9. **random effects**: as participant samples from two time points were included, the participant identification number was added as a random effect.\ All models were adjusted for gender. \ Results with a false-discovery rate (FDR) lower than 0.25 were considered significant. @@ -330,15 +330,15 @@ MaAsLin2 was set up with the following parameters: 2. **random effects**: Subject ID was specified as a random effect due to multiple samples from the same subject 3. **min_prevalence**: The minimum prevalence threshold was set to 0.1, indicating that features must be present in at least 10% of the samples to be included. 4. **transform**: LOG transformed -5. **Analysis method** : The general linear “LM” model was used. -6. **Correction method** : The Benjamini-Hochberg procedure was used to correct P values -7. **Normalization method**:A Centered Log-Ratio (CLR) normalization approach was used instead of default normalization methods - - +5. **Analysis method**: The general linear “LM” model was used. +6. **Correction method**: The Benjamini-Hochberg procedure was used to correct P values +7. **Normalization method**: A Centered Log-Ratio (CLR) normalization approach was used instead of default normalization methods - [**The infant gut resistome is associated with E. coli and early-life exposures** ](https://link.springer.com/article/10.1186/s12866-021-02129-x):\ -The study investigated how the infant gut resistome—the collection of antibiotic resistance genes (ARGs) in the gut microbiome—associates with E. coli and early-life exposures using MaAsLin2. The analysis, which utilized additive boosting of generalized linear models for feature reduction, revealed significant associations between ARGs and E. coli presence, as well as early-life factors such as antibiotic use and other exposures. Key parameters included CLR normalization of compositional abundance data, no standardization of continuous variables, and a strict significance threshold (q-value < 0.01) using Benjamini-Hochberg correction. This approach highlighted how early exposures influence the resistome and its relationship with E. coli in the infant gut. - +The study investigated how the infant gut resistome—the collection of antibiotic resistance genes (ARGs) in the gut microbiome—associates with E. coli and early-life exposures using MaAsLin2. +The analysis, which utilized additive boosting of generalized linear models for feature reduction, revealed significant associations between ARGs and E. coli presence, as well as early-life factors such +as antibiotic use and other exposures. Key parameters included CLR normalization of compositional abundance data, no standardization of continuous variables, and a strict significance threshold +(q-value < 0.01) using Benjamini-Hochberg correction. This approach highlighted how early exposures influence the resistome and its relationship with E. coli in the infant's gut. # Conclusion