From 246e11780df40f7c52b215e85a5a05c408c3f56d Mon Sep 17 00:00:00 2001 From: doserjef Date: Thu, 12 Oct 2023 20:25:15 -0400 Subject: [PATCH] fix msAbund tests --- tests/testthat/test-msAbund.R | 4 +- vignettes/glmm.Rmd | 52 ++++++------ vignettes/glmm.html | 149 ++++++++++++++++++---------------- vignettes/references.bib | 22 +++++ 4 files changed, 128 insertions(+), 99 deletions(-) diff --git a/tests/testthat/test-msAbund.R b/tests/testthat/test-msAbund.R index d55cb1e..ad69550 100644 --- a/tests/testthat/test-msAbund.R +++ b/tests/testthat/test-msAbund.R @@ -575,8 +575,8 @@ y <- dat$y X <- dat$X coords <- dat$coords -covs <- list(cov.1 = X[, 1, 2], - cov.2 = X[, 1, 3]) +covs <- list(cov.1 = apply(X[, , 2], 1, mean, na.rm = TRUE), + cov.2 = apply(X[, , 3], 1, mean, na.rm = TRUE)) data.list <- list(y = y, covs = covs) diff --git a/vignettes/glmm.Rmd b/vignettes/glmm.Rmd index 6cecb73..7a84378 100644 --- a/vignettes/glmm.Rmd +++ b/vignettes/glmm.Rmd @@ -89,7 +89,7 @@ Following the classic GLM framework, we allow for variation in the mean $\mu_j$ where $\bm{\beta}$ is a vector of regression coefficients for a set of covariates $\bm{x}_j$ (including an intercept). When working with positive integer counts and using the Poisson or negative binomial distributions, we use a log link function. For Gaussian data, we use the identity link function, such that the right hand side of the previous equation simplifies to $\mu_j$ and covariates are directly related to the mean. Note that while not shown, unstructured random intercepts and slopes can be included in the equation for expected abundance. This may for instance be required for accommodating some sorts of "blocks", such as when sites are nested in a number of different regions. -To complete the Bayesian specification of the model, we assign Gaussian priors for the regression coefficients ($\bm{\beta}$), a uniform prior for the negative binomial dispersion parameter (when applicable), and an inverse-Gamma prior for the Gaussian variance parameter (when applicable). +To complete the Bayesian specification of the model, we assign Gaussian priors for the regression coefficients ($\bm{\beta}$), a uniform prior for the negative binomial dispersion parameter $\kappa$ (when applicable), and an inverse-Gamma prior for the Gaussian variance parameter $\tau^2$ (when applicable). ## Fitting univariate GLMMs with `abund()` @@ -100,7 +100,7 @@ abund(formula, data, inits, priors, tuning, n.batch, batch.length, accept.rate = 0.43, family = 'Poisson', n.omp.threads = 1, verbose = TRUE, n.report = 100, n.burn = round(.10 * n.batch * batch.length), n.thin = 1, - n.chains = 1, ...) + n.chains = 1, save.fitted = TRUE, ...) ``` The first argument `formula` uses standard R model syntax to denote the covariates to be included in the model. Only the right hand side of the formula is included. Random intercepts and slopes can be included in the model using `lme4` syntax [@bates2015]. For example, to include a random intercept for different observers in the model to account for observational variability, we would include `(1 | observer)` in `formula`, where `observer` indicates the specific observer for each data point. The names of variables given in the formulas should correspond to those found in `data`, which is a list consisting of the following tags: `y` (count data) and `covs` (covariates). `y` is the vector of count data with length equal to the number of sites in the data set and `covs` is a matrix or data frame with site-specific covariate values. Note the tag `offset` can also be specified to include an offset in the model when using a negative binomial or Poisson distribution. @@ -361,7 +361,7 @@ spAbund(formula, data, inits, priors, tuning, n.batch, batch.length, accept.rate = 0.43, family = 'Poisson', n.omp.threads = 1, verbose = TRUE, n.report = 100, n.burn = round(.10 * n.batch * batch.length), n.thin = 1, - n.chains = 1, ...) + n.chains = 1, save.fitted = TRUE, ...) ``` The arguments to `spAbund()` are very similar to those we saw with `abund()`, with a few additional components. The `formula` and `data` arguments are the same as before, with random slopes and intercepts allowed in both the abundance and detection models. Notice the `coords` matrix in the `data.HOWA` list of data. We did not use this for `abund()`, but specifying the spatial coordinates in `data` is necessary for all spatially explicit models in `spAbundance`. @@ -525,10 +525,10 @@ Note in our spatially-explicit predictions, there is extremely large uncertainy Now consider the case where count data, $y_{i, j}$, are collected for multiple species $i = 1, \dots, I$ (or some other set of multiple variables) at each survey location $j$. We model $y_{i, j}$ analogous to the univariate GLMM, with expected abundance now varying by species and site according to \begin{equation}\label{mu-Abund} - \text{log}(\mu_{i, j}) = \bm{x}_j^\top\bm{\beta}_i, + g(\mu_{i, j}) = \bm{x}_j^\top\bm{\beta}_i, \end{equation} -where $\bm{\beta}_i$ are the species-specific effects of covariates $\bm{x}_j$ (including an intercept) . When $y_{i, j}$ is modeled using a negative binomial distribution, we estimate a separate dispersion parameter $\kappa_i$ for each species. We model $\bm{\beta}_i$ as random effects arising from a common, community-level normal distribution, which often leads to increased precision of species-specific effects compared to single-species models. For example, the species-specific intercept $\beta_{0, i}$ is modeled according to +where $\bm{\beta}_i$ are the species-specific effects of covariates $\bm{x}_j$ (including an intercept) . As in univariate models, for all multivariate GLMMs in `spAbundance` we use a log link function when modeling integer count data with the Poisson or negative binomial distributions, while for Gaussian data, we use the identity link function. When $y_{i, j}$ is modeled using a negative binomial distribution, we estimate a separate dispersion parameter $\kappa_i$ for each species. We model $\bm{\beta}_i$ as random effects arising from a common, community-level normal distribution, which often leads to increased precision of species-specific effects compared to single-species models. For example, the species-specific intercept $\beta_{0, i}$ is modeled according to \begin{equation}\label{betaComm} \beta_{0, i} \sim \text{Normal}(\mu_{\beta_0}, \tau^2_{\beta_0}), @@ -547,7 +547,7 @@ msAbund(formula, data, inits, priors, tuning, n.batch, batch.length, accept.rate = 0.43, family = 'Poisson', n.omp.threads = 1, verbose = TRUE, n.report = 100, n.burn = round(.10 * n.batch * batch.length), n.thin = 1, n.chains = 1, - ...) + save.fitted = TRUE, ...) ``` Notice these are the exact same arguments we saw with `abund()`. We will again use data from the North American Breeding Bird Survey in Pennsylvania, USA, but now we will use data from all six species. Recall this is stored in the `bbsData` object. @@ -565,7 +565,7 @@ ms.formula <- ~ scale(bio2) + scale(bio8) + scale(bio18) + scale(forest) + (1 | obs) ``` -Next we specify the initial values in `inits`. For multivariate GLMMs, we supply initial values for community-level and species-level parameters. In `msAbund()`, we will supply initial values for the following parameters: `beta.comm` (community-level coefficients), `beta` (species-level coefficients), `tau.sq.beta` (community-level variance parameters), `kappa` (species-level negative binomial overdispersion parameters), and `sigma.sq.mu` (random effect variances). These are all specified in a single list. Initial values for community-level parameters (including the random effect variances) are either vectors of length corresponding to the number of community-level parameters in the model (including the intercepts) or a single value if all parameters are assigned the same initial values. Initial values for species level parameters are either matrices with the number of rows indicating the number of species, and each column corresponding to a different regression parameter, or a single value if the same initial value is used for all species and parameters. +Next we specify the initial values in `inits`. For multivariate GLMMs, we supply initial values for community-level and species-level parameters. In `msAbund()`, we will supply initial values for the following parameters: `beta.comm` (community-level coefficients), `beta` (species-level coefficients), `tau.sq.beta` (community-level variance parameters), `kappa` (species-level negative binomial dispersion parameters), `sigma.sq.mu` (random effect variances), and `tau.sq` (species-level Gaussian variance parameters). These are all specified in a single list. Initial values for community-level parameters (including the random effect variances) are either vectors of length corresponding to the number of community-level parameters in the model (including the intercepts) or a single value if all parameters are assigned the same initial values. Initial values for species level coefficients are either matrices with the number of rows indicating the number of species, and each column corresponding to a different regression parameter, or a single value if the same initial value is used for all species and parameters. Similarly, initial values for the species-specific NB dispersion parameter and the Gaussian variance parameter is either a vector with a different initial value for each species, or a single value if the same initial value is used for all species. ```{r} # Number of species @@ -574,9 +574,9 @@ ms.inits <- list(beta.comm = 0, beta = 0, tau.sq.beta = 1, kappa = 1, sigma.sq.mu = 0.5) ``` -In multivariate models, we specify priors on the community-level coefficients (or hyperparameters) rather than the species-level effects, with the exception that we still assign uniform priors for the negative binomial overdispersion parameter. For nonspatial models, these priors are specified with the following tags: `beta.comm.normal` (normal prior on the community-level mean effects), `tau.sq.beta.ig` (inverse-Gamma prior on the community-level variance parameters), `sigma.sq.mu.ig` (inverse-Gamma prior on the random effect variances), `kappa.unif` (uniform prior on the species-specific overdispersion parameters). For all parameters except the species-specific overdispersion parameters, each tag consists of a list with elements corresponding to the mean and variance for normal priors and scale and shape for inverse-Gamma priors. Values can be specified individually for each parameter or as a single value if the same prior is assigned to all parameters of a given type. For the species-specific overdispersion parameters, the prior is specified as a list with two elements representing the lower and upper bound of the uniform distribution, where each of the elements can be a single value if the same prior is used for all species, or a vector of values for each species if specifying a different prior for each species. +In multivariate models, we specify priors on the community-level coefficients (or hyperparameters) rather than the species-level effects, with the exception that we still assign uniform priors for the species-specific negative binomial overdispersion parameter and species-specific Gaussian variance parameter. For nonspatial models, these priors are specified with the following tags: `beta.comm.normal` (normal prior on the community-level mean effects), `tau.sq.beta.ig` (inverse-Gamma prior on the community-level variance parameters), `sigma.sq.mu.ig` (inverse-Gamma prior on the random effect variances), `kappa.unif` (uniform prior on the species-specific overdispersion parameters), `tau.sq.ig` (inverse-gamma prior on the species-specific Gaussian variance parameters). For all parameters except the species-specific NB overdispersion parameters, each tag consists of a list with elements corresponding to the mean and variance for normal priors and scale and shape for inverse-Gamma priors. Values can be specified individually for each parameter or as a single value if the same prior is assigned to all parameters of a given type. For the species-specific overdispersion parameters, the prior is specified as a list with two elements representing the lower and upper bound of the uniform distribution, where each of the elements can be a single value if the same prior is used for all species, or a vector of values for each species if specifying a different prior for each species. -By default, we set the prior hyperparameter values for the community-level means to a mean of 0 and a variance of 100. Below we specify these priors explicitly. For the community-level variance parameters, by default we set the scale and shape parameters to 0.1 following the recommendations of [@lunn2013bugs], which results in a weakly informative prior on the community-level variances. This may lead to shrinkage of the community-level variance towards zero under certain circumstances so that all species will have fairly similar values for the species-specific covariate effect [@gelman2006prior], although we have found multi-species models to be relatively robust to this prior specification. For the negative binomial overdispersion parameters, we by default use a lower bound of 0 and upper bound of 100 for the uniform priors for each species, as we did in the single-species (univariate) models. Below we explicitly set these default prior values for our example. +By default, we set the prior hyperparameter values for the community-level means to a mean of 0 and a variance of 100. Below we specify these priors explicitly. For the community-level variance parameters, by default we set the scale and shape parameters to 0.1 following the recommendations of [@lunn2013bugs], which results in a weakly informative prior on the community-level variances. This may lead to shrinkage of the community-level variance towards zero under certain circumstances so that all species will have fairly similar values for the species-specific covariate effect [@gelman2006prior], although we have found multi-species models to be relatively robust to this prior specification. For the negative binomial overdispersion parameters, we by default use a lower bound of 0 and upper bound of 100 for the uniform priors for each species, as we did in the single-species (univariate) models. Below we explicitly set these default prior values for our example. For the Gaussian variance parameters, we by default set the shape and scale parameter equal to 0.01, which results in a vague prior on the residual variances. ```{r} ms.priors <- list(beta.comm.normal = list(mean = 0, var = 100), @@ -590,7 +590,7 @@ All that's now left to do is specify the number of threads to use (`n.omp.thread ```{r} # Specify initial tuning values ms.tuning <- list(beta = 0.3, beta.star = 0.5, kappa = 0.5) -# Approx. run time: 1.5 min +# Approx. run time: 2 min out.ms <- msAbund(formula = ms.formula, data = bbsData, inits = ms.inits, @@ -615,11 +615,11 @@ summary(out.ms) # summary(out.ms, level = 'both') ``` -We see adequate convergence of most parameters, although we may run the model a bit longer to ensure convergence of the community-level variance paramters. Looking at the community-level values, we see a very strong effect of forest cover on average across the community. This is not too surprising given that we are working with six warbler species that breed in forest habitat. +We see adequate convergence of most parameters, although we may run the model a bit longer to ensure convergence of the community-level variance parameters. Looking at the community-level values, we see a very strong effect of forest cover on average across the community. This is not too surprising given that we are working with six warbler species that breed in forest habitat. ## Posterior predictive checks -We can use the `ppcAbund()` function to again perform posterior predictive checks, and then subsequently summarize the check with a Bayesian p-value using the `summary()` function. +We can use the `ppcAbund()` function to again perform posterior predictive checks, and then subsequently summarize the check with a Bayesian p-value using the `summary()` function. Notice for multivariate models we perform a posterior predictive check separately for each species, and the resulting Bayesian p-values can be summarized for both the overall community and individually for each species. ```{r} ppc.ms.out <- ppcAbund(out.ms, fit.stat = 'freeman-tukey', group = 0) @@ -631,7 +631,7 @@ summary(ppc.ms.out) Model selection (sometimes also called model comparison) proceeds exactly as before using WAIC. We compare our previous model to the same model, but now use a negative binomial distribution. Note that for multivariate models the logical argument `by.sp` allows us to calculate WAIC individually for each species if set to `TRUE`. ```{r} -# Approx. run time: 1.5 min +# Approx. run time: 2 min out.ms.nb <- msAbund(formula = ms.formula, data = bbsData, inits = ms.inits, @@ -667,7 +667,7 @@ out.ms.pred <- predict(out.ms, X.0, ignore.RE = TRUE) str(out.ms.pred) ``` -Notice we now have estimates of relative abundance across the state for each species, which we can use to generate a map of relative abundance for each species. Alternatively, we can generate a map of total relative abundance across all species as a simple measure of "how many birds" we could expect to observe at any given location. +Notice we now have estimates of relative abundance across the state for each species, which we can use to generate a map of relative abundance for each species. Alternatively, we can generate a map of total relative abundance across all species as a simple measure of "how many birds" we could expect to observe at any given location. Of course, this is just one community-level metric we could generate and we could instead generate some form of abundance-weighted richness metric, diversity metric, etc. ```{r, fig.width = 5, fig.height = 5, fig.align = 'center', units = 'in', message = FALSE, warning = FALSE} mu.sum.samples <- apply(out.ms.pred$mu.0.samples, c(1, 3), sum) @@ -687,7 +687,7 @@ ggplot() + title = 'Relative abundance of six warbler species') ``` -If you're familiar with Pennsylvania geography, this map makes a fair amount of sense as total relative abundance of the six species is highest in the North-Central portion of the state (which is heavily forestedand more variable in elevation) and is lowest in the Southeastern portion of the state (which is heavily developed). +If you're familiar with Pennsylvania geography, this map makes a fair amount of sense as total relative abundance of the six species is highest in the North-Central portion of the state (which is heavily forested and more variable in elevation) and is lowest in the Southeastern portion of the state (which is heavily developed). # Latent factor multivariate GLMMs @@ -698,7 +698,7 @@ The latent factor multivariate GLMM is identical to the previously described mul The model is identical to the previously described multivariate GLMM, with the only addition being a species-specific random effect at each site added to the equation for relative abundance. More specifically, we model species-specific relative abundance as \begin{equation} -\text{log}(\mu_{i, j}) = \bm{x}_j^\top\bm{\beta}_i + \text{w}^\ast_{i, j}. +g(\mu_{i, j}) = \bm{x}_j^\top\bm{\beta}_i + \text{w}^\ast_{i, j}. \end{equation} The species-specific random effect $\text{w}^\ast_{i, j}$ is used to account for residual correlations between species by assuming that correlation amongst the species can be explained by species-specific effects of a set of $q$ latent variables. More specifically, we use a factor modeling approach [@lopes2004bayesian] to account for species residual correlations in a computationally efficient manner [@hui2015model]. This approach is ideal for large groups of species, where estimating a full $I \times I$ covariance matrix would be computationally intractable (and/or require massive amounts of data). Specifically, we decompose $\text{w}^\ast_{i, j}$ into a linear combination of $q$ latent variables (i.e., factors) and their associated species-specific coefficients (i.e., factor loadings) according to @@ -707,7 +707,7 @@ The species-specific random effect $\text{w}^\ast_{i, j}$ is used to account for \text{w}^\ast_{i, j} = \bm{\lambda}_i^\top\text{w}_j, \end{equation} -where $\bm{\lambda}_i^\top$ is the $i$\text{th} row of factor loadings from an $I \times q$ loadings matrix $\bm{\Lambda}$, and $\text{w}_j$ is a $q \times 1$ vector of independent latent factors at site $j$. By settng $q << N$, we achieve dimension reduction to efficiently model communities with a large number of species [@lopes2004bayesian; @hui2015model]. The approach accounts for residual species correlations via their species-specific responses to the $q$ factors. We model each latent factor as a standard normal random variable. To ensure identifiability of the latent factors from the latent factor loadings, we fix the upper triangle of the factor loadings matrix to 0 and the diagonal elements to 1. We assign standard normal priors to the lower triangular elements of the factor loadings matrix. All other priors are identical to the multivariate GLMM previously discussed. +where $\bm{\lambda}_i^\top$ is the $i\text{th}$ row of factor loadings from an $I \times q$ loadings matrix $\bm{\Lambda}$, and $\text{w}_j$ is a $q \times 1$ vector of independent latent factors at site $j$. By settng $q << N$, we achieve dimension reduction to efficiently model communities with a large number of species [@lopes2004bayesian; @hui2015model]. The approach accounts for residual species correlations via their species-specific responses to the $q$ factors. We model each latent factor as a standard normal random variable. To ensure identifiability of the latent factors from the latent factor loadings, we fix the upper triangle of the factor loadings matrix to 0 and the diagonal elements to 1. We assign standard normal priors to the lower triangular elements of the factor loadings matrix. All other priors are identical to the multivariate GLMM previously discussed. The constraints on the factor loadings matrix ensure identifiability of the factor loadings from the latent factors, but this also results in important practical considerations when fitting these models (e.g., ordering of the species, initial values). [This vignette](https://www.jeffdoser.com/files/spoccupancy-web/articles/modelconsiderations) on the `spOccupancy` website discusses these (and other) considerations. All the advice applied to factor models fit with detection-nondetection data in `spOccupancy` also apply to factor models fit in `spAbundance`. @@ -720,7 +720,7 @@ lfMsAbund(formula, data, inits, priors, tuning, n.factors, n.batch, batch.length, accept.rate = 0.43, family = 'Poisson', n.omp.threads = 1, verbose = TRUE, n.report = 100, n.burn = round(.10 * n.batch * batch.length), n.thin = 1, n.chains = 1, - ...) + save.fitted = TRUE, ...) ``` For guidance on choosing the number of latent factors, see [this vignette](https://www.jeffdoser.com/files/spoccupancy-web/articles/modelconsiderations) on the `spOccupancy` website. Here we will fit a model with 3 latent factors. @@ -770,7 +770,7 @@ We finish up by specifying the tuning values, where now we specify the initial t ```{r} # Specify initial tuning values ms.tuning <- list(beta = 0.3, beta.star = 0.5, kappa = 0.5, w = 0.5, lambda = 0.5) -# Approx. run time: ~2.6 min +# Approx. run time: 3 min out.lf.ms <- lfMsAbund(formula = ms.formula, data = bbsData, inits = ms.inits, @@ -804,7 +804,7 @@ Notice the Rhat values are only reported for the elements of the factor loadings ## Posterior predictive checks -We again use the `ppcAbund()` function perform a posterior predictive check of our model +We again use the `ppcAbund()` function to perform a posterior predictive check of our model ```{r} ppc.out.lf.ms <- ppcAbund(out.lf.ms, fit.stat = 'freeman-tukey', group = 0) @@ -854,12 +854,12 @@ ggplot() + ## Basic model description -Our final, and most complex, GLMM that we fit in `spAbundance` is a multivariate spatially-explicit GLMM. This model is nearly identical to the latent factor multivariate GLMM, except we give a spatial structure to the latent factors instead of assuming they are independent from each other. By modeling the latent factors with a spatial structure, we will often see such a model have improved predictive performance relative to a latent factor multivariate GLMM [@doser2023joint]. This model again is an abundance-based JSDM, but now it simultaneously accounts for residual correlations between species and spatial autocorrelation. The model we present below is a direct extension of the Gaussian spatial factor NNGP model of @taylor2019spatial, where we now allow for the response to be Poisson or negative binomial (in addition to Gaussian). +Our final, and most complex, GLMM that we fit in `spAbundance` is a multivariate spatially-explicit GLMM. This model is nearly identical to the latent factor multivariate GLMM, except we give a spatial structure to the latent factors instead of assuming they are independent from each other. By modeling the latent factors with a spatial structure, we will often see such a model have improved predictive performance relative to a latent factor multivariate GLMM [@thorson2015spatial; @ovaskainen2016uncovering; @doser2023joint]. This model again is an abundance-based JSDM, but now it simultaneously accounts for residual correlations between species and spatial autocorrelation. The model we present below is a direct extension of the Gaussian spatial factor NNGP model of @taylor2019spatial, where we now allow for the response to be Poisson or negative binomial (in addition to Gaussian). The model is identical to the previously described latent factor multivariate GLMM with the exception that the latent factors are assumed to have a spatial structure to them. More specifically, our model for species-specific relative abundance is again \begin{equation} -\text{log}(\mu_{i}(\bm{s}_j)) = \bm{x}(\bm{s}_j)^\top\bm{\beta}_i + \text{w}^\ast_{i}(\bm{s}_j). +g(\mu_{i}(\bm{s}_j)) = \bm{x}(\bm{s}_j)^\top\bm{\beta}_i + \text{w}^\ast_{i}(\bm{s}_j). \end{equation} Note again that for spatial models we use the notation $\bm{s}_j$ to make it clear that the model is spatially-explicit and relies upon the coordinates ($\bm{s}_j$) at each site $j$ to estimate this spatial pattern. As with the latent factor model, we decompose $\text{w}^\ast_i(\bm{s}_j)$ into a linear combination of $q$ latent variables (i.e., factors) and their associated species-specific coefficients (i.e., factor loadings) according to @@ -868,7 +868,7 @@ Note again that for spatial models we use the notation $\bm{s}_j$ to make it cle \text{w}^\ast_{i}(\bm{s}_j) = \bm{\lambda}_i^\top\textbf{w}(\bm{s}_j). \end{equation} -Now, instead of modeling $\textbf{w}(\bm{s}_j)$ as independent normal latent variables, we assume $\textbf{w}(\bm{s}_j)$ arise from spatial processes, allowing us to account for residual correlation in species-specific abundance. More specifically, each spatial factor $\textbf{w}_r$ for each $r = 1, \dots, q$ is modeled using a Nearest Neighbor Gaussian Process [@datta2016hierarchical], i.e., +Now, instead of modeling $\textbf{w}(\bm{s}_j)$ as independent normal latent variables, we assume $\textbf{w}(\bm{s}_j)$ arise from spatial processes, allowing us to account for both residual species correlations and residual spatial autocorrelation in species-specific abundance. More specifically, each spatial factor $\textbf{w}_r$ for each $r = 1, \dots, q$ is modeled using a Nearest Neighbor Gaussian Process [@datta2016hierarchical], i.e., \begin{equation} \textbf{w}_r \sim \text{Normal}(\bm{0}, \tilde{\bm{C}}_r(\bm{\theta}_r)), @@ -878,7 +878,7 @@ where $\tilde{\bm{C}}_r(\bm{\theta}_r)$ is the NNGP-derived correlation matrix f We assume the same priors and identifiability constraints as the latent factor GLMM. We assign a uniform prior for the spatial decay parameter, $\phi_r$, and the spatial smoothness parameters, $\nu_r$, if using a Matern correlation function. -Notice that this spatial factor modeling approach is the only approach we implement in `spAbundance` for modeling multi-species datasets while accounting for spatial autocorrelation. While we could envision fitting a separate spatial random effect for each species in our multi-species data set (as we implemented in `spOccupancy` in the `spMsPGOcc` function), we prefer (and recommend) using the spatial factor modeling approach as (1) it is far more computationally efficient than fitting a separate spatial effect for each species; (2) it explicitly acknowledges dependence between species; (3) estimating a separate spatial effect for each species would be very difficult to do with multiple rare species in the data set; and (4) even if the species are independent (i.e., there are no residual correlations), the spatial factor modeling approach provides extremely similarly to a model with a separate spatial process for each species [@doser2023joint], while still being substantially faster. +Notice that this spatial factor modeling approach is the only approach we implement in `spAbundance` for modeling multi-species datasets while accounting for spatial autocorrelation. While we could envision fitting a separate spatial random effect for each species in our multi-species data set (as we implemented in `spOccupancy` in the `spMsPGOcc` function), we prefer (and recommend) using the spatial factor modeling approach as (1) it is far more computationally efficient than fitting a separate spatial effect for each species; (2) it explicitly acknowledges dependence between species; (3) estimating a separate spatial effect for each species would be very difficult to do with multiple rare species in the data set; and (4) even if the species are independent (i.e., there are no residual correlations), the spatial factor modeling approach performs extremely similarly to a model with a separate spatial process for each species [@doser2023joint], while still being substantially faster. ## Fitting spatial factor multivariate GLMMs with `sfMsAbund()` @@ -891,7 +891,7 @@ sfMsAbund(formula, data, inits, priors, n.batch, batch.length, accept.rate = 0.43, family = 'Poisson', n.omp.threads = 1, verbose = TRUE, n.report = 100, n.burn = round(.10 * n.batch * batch.length), n.thin = 1, n.chains = 1, - ...) + save.fitted = TRUE, ...) ``` We will again fit the model with three spatial factors, and will use the same covariates/random effects on abundance and detection as we have done throughout the vignette @@ -903,7 +903,7 @@ ms.formula <- ~ scale(bio2) + scale(bio8) + scale(bio18) + scale(forest) + (1 | obs) ``` -Initial values are identical to what we saw with `lfMsAbund()`, but we will also specify the initial values for the spatial decay parameters for each spatial factor (we use an exponential correlation model so we do not need to specify any initial values for the `nu`, the spatial smoothness parameter used when `cov.model = 'matern'`. +Initial values are identical to what we saw with `lfMsAbund()`, but we will also specify the initial values for the spatial decay parameters for each spatial factor (we use an exponential correlation model so we do not need to specify any initial values for `nu`, the spatial smoothness parameter used when `cov.model = 'matern'`. ```{r} # Number of species diff --git a/vignettes/glmm.html b/vignettes/glmm.html index c71cbf2..11202f2 100644 --- a/vignettes/glmm.html +++ b/vignettes/glmm.html @@ -416,12 +416,12 @@

Introduction

  1. Univariate GLMM using abund().
  2. Spatial univariate GLMM using spAbund().
  3. -
  4. Multivariate GLMM using `msAbund().
  5. +
  6. Multivariate GLMM using msAbund().
  7. Multivariate GLMM with residual correlations using lfMsAbund().
  8. -
  9. Spatial multivariate GLMM with residual correlations using sfMsNMix().
  10. +
  11. Spatial multivariate GLMM with residual correlations using sfMsAbund().
-

In this vignette we are only describing spAbundance functionality to fit generalized linear (mixed) models, with separate vignettes on fitting hierarchical distance sampling models and N-mixture models. We fit all models in a Bayesian framework using custom Markov chain Monte Carlo (MCMC) samplers written in C/C++ and called through R’s foreign language interface. Here we will provide a brief description of each model, with full statistical details provided in a separate vignette. As with all model types in spAbundance, we will show how to perform posterior predictive checks as a Goodness of Fit assessment, model comparison and assessment using the Widely Applicable Information Criterion (WAIC), and out-of-sample predictions using standard R helper functions (e.g., predict()). Note that syntax of N-mixture models in spAbundance closely follows syntax for fitting occupancy models in spOccupancy (Doser et al. 2022), and that this vignette closely follows the documentation on the spOccupancy website.

-

Note that when we discussing GLMMs, we use the terms “univariate” and “multivariate” instead of “single-species” and “multi-species” as we do when discussing N-mixture models, hierarchical distance sampling models, and occupancy models. We use these potentially more confusing terms to highlight the fact that the GLMM functionality in spAbundance is not restricted to working only with data on counts of “species”. Rather, the GLMM functionality in spAbundance can be used to model any sort of response that you could imagine fitting in a GLMM. Despite this, we will often use the term “species” when referring to the different response variables that we can model using the GLMM functionality in spAbundance, since modeling patterns in abundance is the primary purpose of the package.

+

In this vignette we are only describing spAbundance functionality to fit generalized linear (mixed) models (GLMMs), with separate vignettes on fitting hierarchical distance sampling models and N-mixture models. We fit all models in a Bayesian framework using custom Markov chain Monte Carlo (MCMC) samplers written in C/C++ and called through R’s foreign language interface. Here we will provide a brief description of each model, with full statistical details provided in a separate vignette. As with all model types in spAbundance, we will show how to perform posterior predictive checks as a Goodness of Fit assessment, model comparison/selection using the Widely Applicable Information Criterion (WAIC), and out-of-sample predictions using standard R helper functions (e.g., predict()). Note that syntax of GLMMs in spAbundance closely follows syntax for fitting occupancy models in spOccupancy (Doser et al. 2022), and that this vignette closely follows the documentation on the spOccupancy website.

+

Note that when we discuss GLMMs, we use the terms “univariate” and “multivariate” instead of “single-species” and “multi-species” as we do when discussing N-mixture models, hierarchical distance sampling models, and occupancy models. We use these potentially less-straightforward terms to highlight the fact that the GLMM functionality in spAbundance is not restricted to working only with data on counts of “species”. Rather, the GLMM functionality in spAbundance can be used to model any sort of response that you could imagine fitting in a GLMM. Despite this, we will often use the term “species” when referring to the different response variables that we can model using the GLMM functionality in spAbundance, since modeling patterns in abundance is the primary purpose of the package.

To get started, we load the spAbundance package, as well as the coda package, which we will use for some MCMC summary and diagnostics. We will also use the stars and ggplot2 packages to create some basic plots of our results. We then set a seed so you can reproduce the same results as we do.

library(spAbundance)
 library(coda)
@@ -430,7 +430,7 @@ 

Introduction

set.seed(111)

Example data set: Six warbler species from the Breeding Bird Survey

-

As an example data set throughout this vignette, we will use count data from the North American Breeding Bird Survey collected in 2018 in Pennsylvania, USA (Pardieck et al. 2020). Briefly, these data consist of the total number of individuals for six bird species (American Redstart, Blackburnian Warbler, Black-throated Blue Warbler, Black-throated Green Warbler, Hooded Warbler, Magnolia Warbler) at 95 40km routes across Pennsylvania. Additional details on the data set can be found on the USGS Science Base website. The data are included as part of the spAbundance package and are loaded with data(bbsData). See the manual page using help(bbsData) for more information.

+

As an example data set throughout this vignette, we will use count data from the North American Breeding Bird Survey collected in 2018 in Pennsylvania, USA (Pardieck et al. 2020). Briefly, these data consist of the total number of individuals for six bird species (American Redstart, Blackburnian Warbler, Black-throated Blue Warbler, Black-throated Green Warbler, Hooded Warbler, Magnolia Warbler) at 95 routes (about 40km long) across Pennsylvania. Additional details on the data set can be found on the USGS Science Base website. The data are included as part of the spAbundance package and are loaded with data(bbsData). See the manual page using help(bbsData) for more information.

# Load the data set.
 data(bbsData)
 # Get an overview of what's in the data
@@ -453,7 +453,7 @@ 

Example data set: Six warbler species from the Breeding Bird Survey

..- attr(*, "dimnames")=List of 2 .. ..$ : NULL .. ..$ : chr [1:2] "X" "Y"
-

The object bbsData is a list that is structured in the format needed for multivariate GLMMs in spAbundance. Specifically, bbsData is a list comprised of the count data for the 5 species (y), covariates to include on either abundance and/or detection (covs), and the spatial coordinates for each site (coords). Note the coordinates are only required for spatailly-explicit GLMMs. The matrix y consists of the count data for all 6 species in the data set, where the rows correspond to species and the columns correspond to sites. For single-species GLMMs, we will only use data for one species (Hooded Warbler; HOWA), so we next subset the bbsData list to only include data from HOWA in a new object data.HOWA.

+

The object bbsData is a list that is structured in the format needed for multivariate GLMMs in spAbundance. Specifically, bbsData is a list comprised of the count data for the six species (y), covariates (covs), and the spatial coordinates for each site (coords). Note the coordinates are only required for spatailly-explicit GLMMs. The matrix y consists of the count data for all 6 species in the data set, where the rows correspond to species and the columns correspond to sites. For single-species GLMMs, we will only use data for one species (Hooded Warbler; HOWA), so we next subset the bbsData list to only include data from HOWA in a new object data.HOWA.

sp.names <- dimnames(bbsData$y)[[1]]
 data.HOWA <- bbsData
 data.HOWA$y <- data.HOWA$y[which(sp.names == "HOWA"), ]
@@ -478,7 +478,7 @@ 

Univariate GLMMs

g(\mu_j) = \boldsymbol{x}_j^\top\boldsymbol{\beta}, \end{equation}\]

where \(\boldsymbol{\beta}\) is a vector of regression coefficients for a set of covariates \(\boldsymbol{x}_j\) (including an intercept). When working with positive integer counts and using the Poisson or negative binomial distributions, we use a log link function. For Gaussian data, we use the identity link function, such that the right hand side of the previous equation simplifies to \(\mu_j\) and covariates are directly related to the mean. Note that while not shown, unstructured random intercepts and slopes can be included in the equation for expected abundance. This may for instance be required for accommodating some sorts of “blocks”, such as when sites are nested in a number of different regions.

-

To complete the Bayesian specification of the model, we assign Gaussian priors for the regression coefficients (\(\boldsymbol{\beta}\)), a uniform prior for the negative binomial dispersion parameter (when applicable), and an inverse-Gamma prior for the Gaussian variance parameter (when applicable).

+

To complete the Bayesian specification of the model, we assign Gaussian priors for the regression coefficients (\(\boldsymbol{\beta}\)), a uniform prior for the negative binomial dispersion parameter \(\kappa\) (when applicable), and an inverse-Gamma prior for the Gaussian variance parameter \(\tau^2\) (when applicable).

Fitting univariate GLMMs with abund()

The abund() function fits univariate abundance models. abund has the following arguments.

@@ -486,8 +486,8 @@

Fitting univariate GLMMs with abund()

n.batch, batch.length, accept.rate = 0.43, family = 'Poisson', n.omp.threads = 1, verbose = TRUE, n.report = 100, n.burn = round(.10 * n.batch * batch.length), n.thin = 1, - n.chains = 1, ...)
-

The first argument formula uses standard R model syntax to denote the covariates to be included in the model. Only the right hand side of the formula is included. Random intercepts and slopes can be included in the model using lme4 syntax (Bates et al. 2015). For example, to include a random intercept for different observers the model to account for observational variability, we would include (1 | observer) in formula, where observer indicates the specific observer for each data point. The names of variables given in the formulas should correspond to those found in data, which is a list consisting of the following tags: y (count data) and covs (covariates). y is the vector of count data with length equal to the number of sites in the data set and covs is a matrix or data frame with site-specific covariate values.

+ n.chains = 1, save.fitted = TRUE, ...)
+

The first argument formula uses standard R model syntax to denote the covariates to be included in the model. Only the right hand side of the formula is included. Random intercepts and slopes can be included in the model using lme4 syntax (Bates et al. 2015). For example, to include a random intercept for different observers in the model to account for observational variability, we would include (1 | observer) in formula, where observer indicates the specific observer for each data point. The names of variables given in the formulas should correspond to those found in data, which is a list consisting of the following tags: y (count data) and covs (covariates). y is the vector of count data with length equal to the number of sites in the data set and covs is a matrix or data frame with site-specific covariate values. Note the tag offset can also be specified to include an offset in the model when using a negative binomial or Poisson distribution.

The data.HOWA list is already in the required format for use with the abund() function. Here we will model abundance as a function of three “bioclim” bioclimatic variables and the proportion of forest and developed land within 5km of the route starting location. We will also include a few variables that we believe may relate to observational variability (e.g., imperfect detection) in the count data. Including such variables in a GLMM is a common approach for modeling relative abundance, particularly when using BBS data (Link and Sauer 2002). Here we include linear and quadratic effects of the day of year, a linear effect of time of day, and a random effect of observer, all of which we think may influence relative abundance. We standardize all continuous covariates by using the scale() function in our model specification (note that standardizing continuous covariates is highly recommended as it helps aid convergence of the underlying MCMC algorithms):

howa.formula <- ~ scale(bio2) + scale(bio8) + scale(bio18) + scale(forest) + 
                   scale(devel) + scale(day) + I(scale(day)^2) + scale(tod) + 
@@ -495,7 +495,7 @@ 

Fitting univariate GLMMs with abund()

The family argument is used to specify the specific family we will use to model the data. Valid options are Poisson, NB (negative binomial), and Gaussian. Here we are working with count data, and so both the Poisson and negative binomial distributions make sense. We will start working with a Poisson distribution, but later we will compare this to a negative binomial distribution to determine the amount of overdispersion in the data.

howa.family <- 'Poisson'

Next, we specify the initial values for the MCMC sampler in inits. abund() (and all other spAbundance model fitting functions) will set initial values by default, but here we will do this explicitly, since in more complicated cases setting initial values close to the presumed solutions can be vital for success of an MCMC-based analysis (for instance, this is the case when fitting distance sampling models in spAbundance). However, for all models described in this vignette (in particular the non-spatial models), choice of the initial values is largely inconsequential, with the exception being that specifying initial values close to the presumed solutions can decrease the amount of samples you need to run to arrive at convergence of the MCMC chains. Thus, when first running a model in spAbundance, we recommend fitting the model using the default initial values that spAbundance provides. The initial values that spAbundance chooses will be reported to the screen when setting verbose = TRUE. After running the model for a reasonable period, if you find the chains are taking a long time to reach convergence, you then may wish to set the initial values to the mean estimates of the parameters from the initial model fit, as this will likely help reduce the amount of time you need to run the model.

-

The default initial values for regression coefficients (including the intercepts) are random values from a standard normal distribution. When fitting a GLMM with a negative binomial distribution, the initial value for the overdispersion parameter is drawn from the prior distribution. When using a Gaussian distribution, the initial value for the variance parameter is a random value between 0.5 and 10. Initial values are specified in a list with the following tags: beta (intercept and regression coefficients), kappa (negative binomial overdispersion parameter), tau.sq (Gaussian variance parameter). For the regression coefficients beta, the initial values are passed either as a vector of length equal to the number of estimated parameters (including an intercept, and in the order specified in the model formula), or as a single value if setting the same initial value for all parameters (including the intercept). Below we take the latter approach. For the negative binomial overdispersion parameter and Gaussian variance parameter, the initial value is simply a single numeric value. For any random effects that are included in the model, we can also specify the initial values for the random effect variances (sigma.sq.mu). By default, these will be drawn as random values between 0.5 and 10. Here we specify the initial value for the random effect to 1.

+

The default initial values for regression coefficients (including the intercepts) are random values from a standard normal distribution. When fitting a GLMM with a negative binomial distribution, the initial value for the overdispersion parameter is drawn from the prior distribution. When using a Gaussian distribution, the initial value for the variance parameter is a random value between 0.5 and 10. Initial values are specified in a list with the following tags: beta (intercept and regression coefficients), kappa (negative binomial overdispersion parameter), tau.sq (Gaussian variance parameter). For the regression coefficients beta, the initial values are passed either as a vector of length equal to the number of estimated parameters (including an intercept, and in the order specified in the model formula), or as a single value if setting the same initial value for all parameters (including the intercept). Below we take the latter approach. For the negative binomial overdispersion parameter and Gaussian variance parameter, the initial value is simply a single numeric value. For any random effects that are included in the model, we can also specify the initial values for the random effect variances (sigma.sq.mu). By default, these will be drawn as random values between 0.5 and 10. Here we specify the initial value for the random effect variance to 1.

inits <- list(beta = 0, kappa = 1, sigma.sq.mu = 1)

We next specify the priors for the regression coefficients, as well as the negative binomial overdispersion parameter. We assume normal priors for regression coefficients. These priors are specified in a list with tags beta.normal (including intercepts). Each list element is then itself a list, with the first element of the list consisting of the hypermeans for each coefficient and the second element of the list consisting of the hypervariances for each coefficient. Alternatively, the hypermeans and hypervariances can be specified as a single value if the same prior is used for all regression coefficients. By default, spAbundance will set the hypermeans to 0 and the hypervariances to 100. For the negative binomial overdispersion parameter, we will use a uniform prior. This prior is specified as a tag in the prior list called kappa.unif, which should be a vector with two values indicating the lower and upper bound of the uniform distribution. The default prior is to set the lower bound to 0 and the upper bound to 100. Recall that lower values of kappa indicate substantial overdispersion and high values of kappa indicate minimal overdispersion. If there is little support for overdispersion when fitting a negative binomial model, we will likely see the estimates of kappa be close to the upper bound of the uniform prior distribution. For the default prior distribution, if the estimates of kappa are very close to 100, this indicates little support for overdispersion in the model, and we can likely switch to using a Poisson distribution (which would also likely be favored by model comparison approaches). For models with random effects, we can also specify the prior for the random effect variance parameter (sigma.sq.mu). We assume inverse-Gamma priors for these variance parameters and specify them with the tags sigma.sq.mu.ig. These priors are set as a list with two components, where the first element is the shape parameter and the second element is the scale parameter. The shape and scale parameters can be specified as a single value or as vectors with length equal to the number of random effects included in the model. The default prior distribution for random effect variances is 0.1 for both the shape and scale parameters. When fitting GLMMs with a Gaussian distribution, the tag tau.sq.ig is used to specify the inverse-Gamma prior for the variance parameter of the Gaussian distribution. The prior is specified as a vector of length two, with the first element being the inverse-Gamma shape parameter and second element being the inverse-Gamma scale parameter. By default, these values are both set to 0.01. Below we use default priors for all parameters, but specify them explicitly for clarity.

priors <- list(beta.normal = list(mean = 0, var = 100),
@@ -507,7 +507,7 @@ 

Fitting univariate GLMMs with abund()

# Total number of MCMC samples per chain batch.length * n.batch
[1] 20000
-

Importantly, we also need to specify a target acceptance rate and initial tuning parameters for the regression coefficients (and the negative binomial overdispersion parameter and any latent random effects if applicable). These are both features of the adaptive algorithm we use to sample these parameters. In this adaptive Metropolis-Hastings algorithm, we propose new values for the parameters from some proposal distribution, compare them to our previous values, and use a statistical algorithm to determine if we should accept the new proposed value or keep the old one. The accept.rate argument specifies the ideal proportion of times we will accept the newly proposed values for these parameters. Roberts and Rosenthal (2009) show that if we accept new values around 43% of the time, then this will lead to optimal mixing and convergence of the MCMC chains. Following these recommendations, we should strive for an algorithm that accepts new values about 43% of the time. Thus, we recommend setting accept.rate = 0.43 unless you have a specific reason not to (this is the default value). The values specified in the tuning argument help control the initial values we will propose for the abundance/detection coefficients and the negative binomial overdispersion parameter. These values are supplied as input in the form of a list with tags beta and kappa. The initial tuning value can be any value greater than 0, but we generally recommend starting the value out around 0.5. These tuning values can also be thought of as tuning “variances”, as it is these values that control the variance of the distribution we use to generate newly proposed values for the parameters we are trying to estimate with our MCMC algorithm. In short, the new values that we propose for the parameters beta, and kappa come from a normal distribution with mean equal to the current value for the given parameter and the variance equal to the tuning parameter. Thus, the smaller this tuning parameter/variance is, the closer our proposed values will be to the current value, and vise versa for large values of the tuning parameter. The “ideal” value of the tuning variance will depend on the data set, the parameter, and how much uncertainty there is in the estimate of the parameter. This initial tuning value that we supply is the first tuning variance that will be used for the given parameter, and our adaptive algorithm will adjust this tuning parameter after each batch to yield acceptance rates of newly proposed values that are close to our target acceptance rate that we specified in the accept.rate argument. Information on the acceptance rates for a few of the parameters in your model will be displayed when setting verbose = TRUE. After some initial runs of the model, if you notice the final acceptance rate is much larger or smaller than the target acceptance rate (accept.rate), you can then change the initial tuning value to get closer to the target rate. While use of this algorithm requires us to specify more arguments than if we didn’t “adaptively tune” our proposal variances, this leads to much shorter run times compared to a more simplistic approach where we do not have an “adaptive” sampling approach, and it should thus save us time in the long haul when waiting for these models to run. For our example here, we set the initial tuning values to 0.5 for beta and kappa. For models with random effects in either the abundance or detection portions of the model, we also need to specify tuning parameters for the latent random effect values (beta.star). We similarly set these to 0.5. Note that for Gaussian GLMMs, we use much more efficient algorithms (Gibbs updates).

+

Importantly, we also need to specify a target acceptance rate and initial tuning parameters for the regression coefficients (and the negative binomial overdispersion parameter and any latent random effects if applicable). These are both features of the adaptive algorithm we use to sample these parameters. In this adaptive Metropolis-Hastings algorithm, we propose new values for the parameters from some proposal distribution, compare them to our previous values, and use a statistical algorithm to determine if we should accept the new proposed value or keep the old one. The accept.rate argument specifies the ideal proportion of times we will accept the newly proposed values for these parameters. Roberts and Rosenthal (2009) show that if we accept new values around 43% of the time, then this will lead to optimal mixing and convergence of the MCMC chains. Following these recommendations, we should strive for an algorithm that accepts new values about 43% of the time. Thus, we recommend setting accept.rate = 0.43 unless you have a specific reason not to (this is the default value). The values specified in the tuning argument help control the initial values we will propose for the abundance/detection coefficients and the negative binomial overdispersion parameter. These values are supplied as input in the form of a list with tags beta and kappa. The initial tuning value can be any value greater than 0, but we generally recommend starting the value out around 0.5. These tuning values can also be thought of as tuning “variances”, as it is these values that control the variance of the distribution we use to generate newly proposed values for the parameters we are trying to estimate with our MCMC algorithm. In short, the new values that we propose for the parameters beta and kappa come from a normal distribution with mean equal to the current value for the given parameter and the variance equal to the tuning parameter (with a transformation for kappa since it can only take positive values). Thus, the smaller this tuning parameter/variance is, the closer our proposed values will be to the current value, and vise versa for large values of the tuning parameter. The “ideal” value of the tuning variance will depend on the data set, the parameter, and how much uncertainty there is in the estimate of the parameter. This initial tuning value that we supply is the first tuning variance that will be used for the given parameter, and our adaptive algorithm will adjust this tuning parameter after each batch to yield acceptance rates of newly proposed values that are close to our target acceptance rate that we specified in the accept.rate argument. Information on the acceptance rates for a few of the parameters in your model will be displayed when setting verbose = TRUE. After some initial runs of the model, if you notice the final acceptance rate is much larger or smaller than the target acceptance rate (accept.rate), you can then change the initial tuning value to get closer to the target rate. While use of this algorithm requires us to specify more arguments than if we didn’t “adaptively tune” our proposal variances, this leads to much shorter run times compared to a more simplistic approach where we do not have an “adaptive” sampling approach, and it should thus save us time in the long haul when waiting for these models to run. For our example here, we set the initial tuning values to 0.5 for beta and kappa. For models with random effects in either the abundance or detection portions of the model, we also need to specify tuning parameters for the latent random effect values (beta.star). We similarly set these to 0.5. Note that for Gaussian GLMMs, we use much more efficient algorithms (Gibbs updates).

tuning <- list(beta = 0.5, kappa = 0.5, beta.star = 0.5)
 # accept.rate = 0.43 by default, so we do not specify it.

We also need to specify the length of burn-in (n.burn), the rate at which we want to thin the posterior samples (n.thin), and the number of MCMC chains to run (n.chains). Note that currently spAbundance runs multiple chains sequentially and does not allow chains to be run simultaneously in parallel across multiple threads, which is something we hope to implement in future package development. Instead, we allow for within-chain parallelization using the n.omp.threads argument. We can set n.omp.threads to a number greater than 1 and smaller than the number of threads on the computer you are using. Generally, setting n.omp.threads > 1 will not result in decreased run times for non-spatial models in spAbundance, but can substantially decrease run time when fitting spatial models (Finley, Datta, and Banerjee 2020). Here we set n.omp.threads = 1.

@@ -686,7 +686,7 @@

Fitting univariate GLMMs with abund()

Thinning Rate: 10 Number of Chains: 3 Total Posterior Samples: 3000 -Run Time (min): 0.247 +Run Time (min): 0.3256 Abundance (log scale): Mean SD 2.5% 50% 97.5% Rhat ESS @@ -705,7 +705,7 @@

Fitting univariate GLMMs with abund()

(Intercept)-obs 3.0363 1.131 1.4259 2.8583 5.7909 1.0341 267

We see the variable with the largest magnitude effect is time of day with a strong positive effect. Since we believe this variable may relate to the probability of detecting HOWA at a location (or the probability HOWA is singing and thus available for detection), this suggests a larger number of HOWA are counted later in the morning relative to early in the morning. There is also a strong positive relationship with forest cover, suggesting larger HOWA relative abundance in more forested areas.

The model summary also provides information on convergence of the MCMC chains in the form of the Gelman-Rubin diagnostic (Brooks and Gelman 1998) and the effective sample size (ESS) of the posterior samples. Here we find all Rhat values are less than 1.1 and the ESS values are decently large for all parameters.

-

We can use the plot() function to generate a simple trace plot of the MCMC chains to provide additional confidence in the convergence (or non-convergence) of the model. The plotting functionality for each model type in spAbundance takes three arguments: x (the resulting object from fitting the model), param (the parameter name that you want to display), and density (a logical value indicating whether to also generate a density plot in addition to the traceplot). To see the parameter names available to use with plot() for a given model type, you can look at the manual page for the function, which for models generated from plot.abund() can be accessed with ?plot.abund.

+

We can use the plot() function to generate a simple trace plot of the MCMC chains to provide additional confidence in the convergence (or non-convergence) of the model. The plotting functionality for each model type in spAbundance takes three arguments: x (the resulting object from fitting the model), param (the parameter name that you want to display), and density (a logical value indicating whether to also generate a density plot in addition to the traceplot). To see the parameter names available to use with plot() for a given model type, you can look at the manual page for the function, which for models generated from abund() can be accessed with ?plot.abund.

# Regression coefficients
 plot(out, param = 'beta', density = FALSE)

@@ -727,7 +727,7 @@

Posterior predictive checks

Bayesian p-value: 0.001 Fit statistic: freeman-tukey -

Here our Bayesian p-value is very close to 0, indicating that the current model does not adequately represent the variability in the observed data. We can further look at a plot of the fitted values versus the true to get a better sense of how our model performed.

+

Here our Bayesian p-value is very close to 0, indicating that the current model does not adequately represent the variability in the observed data. We can further look at a plot of the fitted values versus the trues to get a better sense of how our model performed.

# Extract fitted values
 y.rep.samples <- fitted(out)
 # Get means of fitted values
@@ -736,7 +736,7 @@ 

Posterior predictive checks

plot(data.HOWA$y, y.rep.means, pch = 19, xlab = 'True', ylab = 'Fitted') abline(0, 1)

-

Looking at this plot, we see our model actually does a decent job of identifying locations with high relative abundance, but there are a few sites with low observed relative abundance for which the model seems to be overestimating abundance (i.e., values points the left of 5 on the x-axis in the above plot). This is likely what is causing the low Bayesian p-value.

+

Looking at this plot, we see our model actually does a decent job of identifying locations with high relative abundance, but there are a few sites with low observed relative abundance for which the model seems to be overestimating abundance (i.e., points to the left of 5 on the x-axis in the above plot). This is likely what is causing the low Bayesian p-value.

Model selection using WAIC

@@ -802,7 +802,7 @@

Prediction

Below we form the design matrix and predict relative abundance across the grid.

X.0 <- cbind(1, bio2.pred, bio8.pred, bio18.pred, forest.pred, 
-         devel.pred, day.pred, day.pred^2, tod.pred)
+             devel.pred, day.pred, day.pred^2, tod.pred)
 colnames(X.0) <- c('(Intercept)', 'scale(bio2)', 'scale(bio8)', 'scale(bio18)', 
                    'scale(forest)', 'scale(devel)', 'scale(day)', 
                    'I(scale(day)^2)', 'scale(tod)')
@@ -848,9 +848,9 @@ 

Prediction

Univariate spatial GLMMs

Basic model description

-

When working across large spatial domains, accounting for residual spatial autocorrelation in species distributions can often improve predictive performance of a model, leading to more accurate predictions of species abundance patterns (Guélat and Kéry 2018). We here extend the previous GLMM to incorporate a spatial random effect that accounts for unexplained spatial variation in species abundance across a region of interest (“Model-Based Geostatistics” 1998). Let \(\boldsymbol{s}_j\) denote the geographical coordinates of site \(j\) for \(j = 1, \dots, J\). In all spatially-explicit models, we include \(\boldsymbol{s}_j\) directly in the notation of spatially-indexed variables to indicate the model is spatially-explicit. More specifically, the expected abundance at site \(j\) with coordinates \(\boldsymbol{s}_j\), \(\mu(\boldsymbol{s}_j)\), now takes the form

+

When working across large spatial domains, accounting for residual spatial autocorrelation in species distributions can often improve predictive performance of a model, leading to more accurate predictions of species abundance patterns (Guélat and Kéry 2018). We here extend the previous GLMM to incorporate a spatial random effect that accounts for unexplained spatial variation in species abundance across a region of interest (Diggle 1998). Let \(\boldsymbol{s}_j\) denote the geographical coordinates of site \(j\) for \(j = 1, \dots, J\). In all spatially-explicit models, we include \(\boldsymbol{s}_j\) directly in the notation of spatially-indexed variables to indicate the model is spatially-explicit. More specifically, the expected abundance at site \(j\) with coordinates \(\boldsymbol{s}_j\), \(\mu(\boldsymbol{s}_j)\), now takes the form

\[\begin{equation} -\text{log}(\mu(\boldsymbol{s}_j)) = \boldsymbol{x}(\boldsymbol{s}_j)^{\top}\boldsymbol{\beta} + \text{w}(\boldsymbol{s}_j), +g(\mu(\boldsymbol{s}_j)) = \boldsymbol{x}(\boldsymbol{s}_j)^{\top}\boldsymbol{\beta} + \text{w}(\boldsymbol{s}_j), \end{equation}\]

where \(\text{w}(\boldsymbol{s}_j)\) is a spatial random effect modeled with a Nearest Neighbor Gaussian Process (NNGP; Datta et al. (2016)). More specifically, we have

\[\begin{equation} @@ -867,7 +867,7 @@

Fitting univariate spatial GLMMs with spAbund()

n.batch, batch.length, accept.rate = 0.43, family = 'Poisson', n.omp.threads = 1, verbose = TRUE, n.report = 100, n.burn = round(.10 * n.batch * batch.length), n.thin = 1, - n.chains = 1, ...)
+ n.chains = 1, save.fitted = TRUE, ...)

The arguments to spAbund() are very similar to those we saw with abund(), with a few additional components. The formula and data arguments are the same as before, with random slopes and intercepts allowed in both the abundance and detection models. Notice the coords matrix in the data.HOWA list of data. We did not use this for abund(), but specifying the spatial coordinates in data is necessary for all spatially explicit models in spAbundance.

howa.formula <- ~ scale(bio2) + scale(bio8) + scale(bio18) + scale(forest) +
                   scale(devel) + scale(day) + I(scale(day)^2) + scale(tod) +
@@ -926,24 +926,25 @@ 

Fitting univariate spatial GLMMs with spAbund()

n.burn <- 10000
 n.thin <- 10
 n.chains <- 3
-out.sp <- spAbund(formula = howa.formula,
-                  data = data.HOWA,
-                  inits = inits,
-                  priors = priors,
-                  n.batch = n.batch,
-                  batch.length = batch.length,
-                  tuning = tuning,
-                  cov.model = cov.model,
-                  NNGP = NNGP,
-                  n.neighbors = n.neighbors,
-                  search.type = search.type,
-                  n.omp.threads = n.omp.threads,
-                  n.report = n.report,
-                  family = 'Poisson',
-                  verbose = TRUE,
-                  n.burn = n.burn,
-                  n.thin = n.thin,
-                  n.chains = n.chains)
+# Approx run time: 1.5 minutes +out.sp <- spAbund(formula = howa.formula, + data = data.HOWA, + inits = inits, + priors = priors, + n.batch = n.batch, + batch.length = batch.length, + tuning = tuning, + cov.model = cov.model, + NNGP = NNGP, + n.neighbors = n.neighbors, + search.type = search.type, + n.omp.threads = n.omp.threads, + n.report = n.report, + family = 'Poisson', + verbose = TRUE, + n.burn = n.burn, + n.thin = n.thin, + n.chains = n.chains)
----------------------------------------
     Preparing to run the model
 ----------------------------------------
@@ -1118,7 +1119,7 @@ 

Fitting univariate spatial GLMMs with spAbund()

Thinning Rate: 10 Number of Chains: 3 Total Posterior Samples: 3000 -Run Time (min): 1.5775 +Run Time (min): 1.7418 Abundance (log scale): Mean SD 2.5% 50% 97.5% Rhat ESS @@ -1143,7 +1144,7 @@

Fitting univariate spatial GLMMs with spAbund()

Posterior predictive checks

-

Posterior predictive checks proceed exactly before using the ppcAbund() function.

+

Posterior predictive checks proceed exactly as before using the ppcAbund() function.

ppc.out.sp <- ppcAbund(out.sp, fit.stat = 'freeman-tukey', group = 0)
 summary(ppc.out.sp)

@@ -1244,14 +1245,14 @@ 

Multivariate GLMMs

Basic model description

Now consider the case where count data, \(y_{i, j}\), are collected for multiple species \(i = 1, \dots, I\) (or some other set of multiple variables) at each survey location \(j\). We model \(y_{i, j}\) analogous to the univariate GLMM, with expected abundance now varying by species and site according to

\[\begin{equation}\label{mu-Abund} - \text{log}(\mu_{i, j}) = \boldsymbol{x}_j^\top\boldsymbol{\beta}_i, + g(\mu_{i, j}) = \boldsymbol{x}_j^\top\boldsymbol{\beta}_i, \end{equation}\]

-

where \(\boldsymbol{\beta}_i\) are the species-specific effects of covariates \(\boldsymbol{x}_j\) (including an intercept) . When \(y_{i, j}\) is modeled using a negative binomial distribution, we estimate a separate dispersion parameter \(\kappa_i\) for each species. We model \(\boldsymbol{\beta}_i\) as random effects arising from a common, community-level normal distribution, which often leads to increased precision of species-specific effects compared to single-species models. For example, the species-specific intercept \(\beta0_i\) is modeled according to

+

where \(\boldsymbol{\beta}_i\) are the species-specific effects of covariates \(\boldsymbol{x}_j\) (including an intercept) . As in univariate models, for all multivariate GLMMs in spAbundance we use a log link function when modeling integer count data with the Poisson or negative binomial distributions, while for Gaussian data, we use the identity link function. When \(y_{i, j}\) is modeled using a negative binomial distribution, we estimate a separate dispersion parameter \(\kappa_i\) for each species. We model \(\boldsymbol{\beta}_i\) as random effects arising from a common, community-level normal distribution, which often leads to increased precision of species-specific effects compared to single-species models. For example, the species-specific intercept \(\beta_{0, i}\) is modeled according to

\[\begin{equation}\label{betaComm} - \beta0_i \sim \text{Normal}(\mu_{\beta0}, \tau^2_{\beta0}), + \beta_{0, i} \sim \text{Normal}(\mu_{\beta_0}, \tau^2_{\beta_0}), \end{equation}\]

-

where \(\mu_{\beta0}\) is the community-level abundance intercept, and \(\tau^2_{\beta0}\) is the variance of the intercept across all \(I\) species.

-

We assign normal priors to the community-level (\(\boldsymbol{\mu}_{\beta}\)) mean parameters and inverse-Gamma priors to the community-level variance parameters (\(\boldsymbol{\tau^2}_{\beta}\)).

+

where \(\mu_{\beta_0}\) is the community-level abundance intercept, and \(\tau^2_{\beta_0}\) is the variance of the intercept across all \(I\) species.

+

We assign normal priors to the community-level (\(\boldsymbol{\mu}_{\beta}\)) mean parameters and inverse-Gamma priors to the community-level variance parameters (\(\boldsymbol{\tau^2}_{\beta}\)). We assign independent uniform priors to the species specific dispersion parameters \(\kappa_i\) when using a negative binomial distribution. When fitting a Gaussian GLMM (a linear mixed model or LMM), we assign independent inverse-Gamma priors to the species-specific residual variance parameters \(\tau^2_i\).

Fitting multivariate GLMMs with msAbund()

@@ -1260,7 +1261,7 @@

Fitting multivariate GLMMs with msAbund()

n.batch, batch.length, accept.rate = 0.43, family = 'Poisson', n.omp.threads = 1, verbose = TRUE, n.report = 100, n.burn = round(.10 * n.batch * batch.length), n.thin = 1, n.chains = 1, - ...)
+ save.fitted = TRUE, ...)

Notice these are the exact same arguments we saw with abund(). We will again use data from the North American Breeding Bird Survey in Pennsylvania, USA, but now we will use data from all six species. Recall this is stored in the bbsData object.

# Remind ourselves of the structure of bbsData
 str(bbsData)
@@ -1286,13 +1287,13 @@

Fitting multivariate GLMMs with msAbund()

ms.formula <- ~ scale(bio2) + scale(bio8) + scale(bio18) + scale(forest) + 
                 scale(devel) + scale(day) + I(scale(day)^2) + scale(tod) + 
                 (1 | obs)
-

Next we specify the initial values in inits. For multivariate GLMMs, we supply initial values for community-level and species-level parameters. In msAbund(), we will supply initial values for the following parameters: beta.comm (community-level coefficients), beta (species-level coefficients), tau.sq.beta (community-level variance parameters), kappa (species-level negative binomial overdispersion parameters), and sigma.sq.mu (random effect variances). These are all specified in a single list. Initial values for community-level parameters (including the random effect variances) are either vectors of length corresponding to the number of community-level parameters in the model (including the intercepts) or a single value if all parameters are assigned the same initial values. Initial values for species level parameters are either matrices with the number of rows indicating the number of species, and each column corresponding to a different regression parameter, or a single value if the same initial value is used for all species and parameters.

+

Next we specify the initial values in inits. For multivariate GLMMs, we supply initial values for community-level and species-level parameters. In msAbund(), we will supply initial values for the following parameters: beta.comm (community-level coefficients), beta (species-level coefficients), tau.sq.beta (community-level variance parameters), kappa (species-level negative binomial dispersion parameters), sigma.sq.mu (random effect variances), and tau.sq (species-level Gaussian variance parameters). These are all specified in a single list. Initial values for community-level parameters (including the random effect variances) are either vectors of length corresponding to the number of community-level parameters in the model (including the intercepts) or a single value if all parameters are assigned the same initial values. Initial values for species level coefficients are either matrices with the number of rows indicating the number of species, and each column corresponding to a different regression parameter, or a single value if the same initial value is used for all species and parameters. Similarly, initial values for the species-specific NB dispersion parameter and the Gaussian variance parameter is either a vector with a different initial value for each species, or a single value if the same initial value is used for all species.

# Number of species
 n.sp <- dim(bbsData$y)[1]
 ms.inits <- list(beta.comm = 0, beta = 0, tau.sq.beta = 1, kappa = 1, 
                  sigma.sq.mu = 0.5)
-

In multivariate models, we specify priors on the community-level coefficients (or hyperparameters) rather than the species-level effects, with the exception that we still assign uniform priors for the negative binomial overdispersion parameter. For nonspatial models, these priors are specified with the following tags: beta.comm.normal (normal prior on the community-level mean effects), tau.sq.beta.ig (inverse-Gamma prior on the community-level variance parameters), sigma.sq.mu.ig (inverse-Gamma prior on the random effect variances), kappa.unif (uniform prior on the species-specific overdispersion parameters). For all parameters except the species-specific overdispersion parameters, each tag consists of a list with elements corresponding to the mean and variance for normal priors and scale and shape for inverse-Gamma priors. Values can be specified individually for each parameter or as a single value if the same prior is assigned to all parameters of a given type. For the species-specific overdispersion parameters, the prior is specified as a list with two elements representing the lower and upper bound of the uniform distribution, where each of the elements can be a single value if the same prior is used for all species, or a vector of values for each species if specifying a different prior for each species.

-

By default, we set the prior hyperparameter values for the community-level means to a mean of 0 and a variance of 100. Below we specify these priors explicitly. For the community-level variance parameters, by default we set the scale and shape parameters to 0.1 following the recommendations of (Lunn et al. 2013), which results in a weakly informative prior on the community-level variances. This may lead to shrinkage of the community-level variance towards zero under certain circumstances so that all species will have fairly similar values for the species-specific covariate effect (Gelman 2006), although we have found multi-species models to be relatively robust to this prior specification. For the negative binomial overdispersion parameters, we by default use a lower bound of 0 and upper bound of 100 for the uniform priors for each species, as we did in the single-species (univariate) models. Below we explicitly set these default prior values for our example.

+

In multivariate models, we specify priors on the community-level coefficients (or hyperparameters) rather than the species-level effects, with the exception that we still assign uniform priors for the species-specific negative binomial overdispersion parameter and species-specific Gaussian variance parameter. For nonspatial models, these priors are specified with the following tags: beta.comm.normal (normal prior on the community-level mean effects), tau.sq.beta.ig (inverse-Gamma prior on the community-level variance parameters), sigma.sq.mu.ig (inverse-Gamma prior on the random effect variances), kappa.unif (uniform prior on the species-specific overdispersion parameters), tau.sq.ig (inverse-gamma prior on the species-specific Gaussian variance parameters). For all parameters except the species-specific NB overdispersion parameters, each tag consists of a list with elements corresponding to the mean and variance for normal priors and scale and shape for inverse-Gamma priors. Values can be specified individually for each parameter or as a single value if the same prior is assigned to all parameters of a given type. For the species-specific overdispersion parameters, the prior is specified as a list with two elements representing the lower and upper bound of the uniform distribution, where each of the elements can be a single value if the same prior is used for all species, or a vector of values for each species if specifying a different prior for each species.

+

By default, we set the prior hyperparameter values for the community-level means to a mean of 0 and a variance of 100. Below we specify these priors explicitly. For the community-level variance parameters, by default we set the scale and shape parameters to 0.1 following the recommendations of (Lunn et al. 2013), which results in a weakly informative prior on the community-level variances. This may lead to shrinkage of the community-level variance towards zero under certain circumstances so that all species will have fairly similar values for the species-specific covariate effect (Gelman 2006), although we have found multi-species models to be relatively robust to this prior specification. For the negative binomial overdispersion parameters, we by default use a lower bound of 0 and upper bound of 100 for the uniform priors for each species, as we did in the single-species (univariate) models. Below we explicitly set these default prior values for our example. For the Gaussian variance parameters, we by default set the shape and scale parameter equal to 0.01, which results in a vague prior on the residual variances.

ms.priors <- list(beta.comm.normal = list(mean = 0, var = 100),
                   tau.sq.beta.ig = list(a = 0.1, b = 0.1),
                   sigma.sq.mu.ig = list(a = 0.1, b = 0.1),
@@ -1300,7 +1301,7 @@ 

Fitting multivariate GLMMs with msAbund()

All that’s now left to do is specify the number of threads to use (n.omp.threads), the number of MCMC samples (which we do by specifying the number of batches n.batch and batch length batch.length), the initial tuning variances (tuning), the amount of samples to discard as burn-in (n.burn), the thinning rate (n.thin), and arguments to control the display of sampler progress (verbose, n.report). Note for the tuning variances, we do not need to specify initial tuning values for any of the community-level parameters, as those parameters can be sampled with an efficient Gibbs update. We will also run the model with a Poisson distribution for abundance, which later we will shortly compare to a negative binomial.

# Specify initial tuning values
 ms.tuning <- list(beta = 0.3, beta.star = 0.5, kappa = 0.5)
-# Approx. run time:  1.5 min
+# Approx. run time:  2 min
 out.ms <- msAbund(formula = ms.formula,
                   data = bbsData,
                   inits = ms.inits,
@@ -1442,7 +1443,7 @@ 

Fitting multivariate GLMMs with msAbund()

Thinning Rate: 10 Number of Chains: 3 Total Posterior Samples: 3000 -Run Time (min): 1.6268 +Run Time (min): 1.8479 ---------------------------------------- Community Level @@ -1536,11 +1537,11 @@

Fitting multivariate GLMMs with msAbund()

scale(tod)-MAWA 0.0270 0.2652 -0.4964 0.0259 0.5432 1.0004 1726
# Or more explicitly
 # summary(out.ms, level = 'both')
-

We see adequate convergence of most parameters, although we may run the model a bit longer to ensure convergence of the community-level variance paramters. Looking at the community-level values, we see a very strong effect of forest cover on average across the community. This is not too surprising given that we are working with six warbler species that breed in forest habitat.

+

We see adequate convergence of most parameters, although we may run the model a bit longer to ensure convergence of the community-level variance parameters. Looking at the community-level values, we see a very strong effect of forest cover on average across the community. This is not too surprising given that we are working with six warbler species that breed in forest habitat.

Posterior predictive checks

-

We can use the ppcAbund() function to again perform posterior predictive checks, and then subsequently summarize the check with a Bayesian p-value using the summary() function.

+

We can use the ppcAbund() function to again perform posterior predictive checks, and then subsequently summarize the check with a Bayesian p-value using the summary() function. Notice for multivariate models we perform a posterior predictive check separately for each species, and the resulting Bayesian p-values can be summarized for both the overall community and individually for each species.

ppc.ms.out <- ppcAbund(out.ms, fit.stat = 'freeman-tukey', group = 0)
Currently on species 1 out of 6
Currently on species 2 out of 6
@@ -1578,7 +1579,7 @@

Posterior predictive checks

Model selection using WAIC

Model selection (sometimes also called model comparison) proceeds exactly as before using WAIC. We compare our previous model to the same model, but now use a negative binomial distribution. Note that for multivariate models the logical argument by.sp allows us to calculate WAIC individually for each species if set to TRUE.

-
# Approx. run time:  1.5 min
+
# Approx. run time:  2 min
 out.ms.nb <- msAbund(formula = ms.formula,
                      data = bbsData,
                      inits = ms.inits,
@@ -1631,7 +1632,7 @@ 

Prediction

$ y.0.samples : int [1:3000, 1:6, 1:816, 1] 7 19 7 8 10 8 7 8 4 5 ... $ call : language predict.msAbund(object = out.ms, X.0 = X.0, ignore.RE = TRUE) - attr(*, "class")= chr "predict.msAbund"
-

Notice we now have estimates of relative abundance across the state for each species, which we can use to generate a map of relative abundance for each species. Alternatively, we can generate a map of total relative abundance across all species as a simple measure of “how many birds” we could expect to observe at any given location.

+

Notice we now have estimates of relative abundance across the state for each species, which we can use to generate a map of relative abundance for each species. Alternatively, we can generate a map of total relative abundance across all species as a simple measure of “how many birds” we could expect to observe at any given location. Of course, this is just one community-level metric we could generate and we could instead generate some form of abundance-weighted richness metric, diversity metric, etc.

mu.sum.samples <- apply(out.ms.pred$mu.0.samples, c(1, 3), sum)
 # Average total abundance at each site
 mu.sum.means <- apply(mu.sum.samples, 2, mean)
@@ -1648,7 +1649,7 @@ 

Prediction

labs(fill = '', x = 'Longitude', y = 'Latitude', title = 'Relative abundance of six warbler species')

-

If you’re familiar with Pennsylvania geography, this map makes a fair amount of sense as total relative abundance of the six species is highest in the North-Central portion of the state (which is heavily forestedand more variable in elevation) and is lowest in the Southeastern portion of the state (which is heavily developed).

+

If you’re familiar with Pennsylvania geography, this map makes a fair amount of sense as total relative abundance of the six species is highest in the North-Central portion of the state (which is heavily forested and more variable in elevation) and is lowest in the Southeastern portion of the state (which is heavily developed).

@@ -1658,13 +1659,13 @@

Basic model description

The latent factor multivariate GLMM is identical to the previously described multivariate GLMM except we include an additional component in the model to accommodate residual correlations between species. This type of model is often referred to as an abundance-based joint species distribution model (JSDM) in the statistical ecology literature (Warton et al. 2015). The previously described multivariate GLMM can be viewed as a simplified version of the latent factor multivariate GLMM, where we assume there are no residual correlations between species. In fact, the previously described multivariate GLMM could also more simply be described as a simple GLMM with a bunch of random intercepts and slopes across the different species in the data set. In the statistical literature, the latent factor multivariate GLMM now explicitly accounts for correlations between responses, and thus may formally be considered a “joint” model.

The model is identical to the previously described multivariate GLMM, with the only addition being a species-specific random effect at each site added to the equation for relative abundance. More specifically, we model species-specific relative abundance as

\[\begin{equation} -\text{log}(\mu_{i, j}) = \boldsymbol{x}_j^\top\boldsymbol{\beta}_i + \text{w}^\ast_{i, j}. +g(\mu_{i, j}) = \boldsymbol{x}_j^\top\boldsymbol{\beta}_i + \text{w}^\ast_{i, j}. \end{equation}\]

The species-specific random effect \(\text{w}^\ast_{i, j}\) is used to account for residual correlations between species by assuming that correlation amongst the species can be explained by species-specific effects of a set of \(q\) latent variables. More specifically, we use a factor modeling approach (Lopes and West 2004) to account for species residual correlations in a computationally efficient manner (Hui et al. 2015). This approach is ideal for large groups of species, where estimating a full \(I \times I\) covariance matrix would be computationally intractable (and/or require massive amounts of data). Specifically, we decompose \(\text{w}^\ast_{i, j}\) into a linear combination of \(q\) latent variables (i.e., factors) and their associated species-specific coefficients (i.e., factor loadings) according to

\[\begin{equation}\label{factorModel} \text{w}^\ast_{i, j} = \boldsymbol{\lambda}_i^\top\text{w}_j, \end{equation}\]

-

where \(\boldsymbol{\lambda}_i^\top\) is the \(i\) row of factor loadings from an \(I \times q\) loadings matrix \(\boldsymbol{\Lambda}\), and \(\text{w}_j\) is a \(q \times 1\) vector of independent latent factors at site \(j\). By settng \(q << N\), we achieve dimension reduction to efficiently model communities with a large number of species (Lopes and West 2004; Hui et al. 2015). The approach accounts for residual species correlations via their species-specific responses to the \(q\) factors. We model each latent factor as a standard normal random variable. To ensure identifiability of the latent factors from the latent factor loadings, we fix the upper triangle of the factor loadings matrix to 0 and the diagonal elements to 1. We assign standard normal priors to the lower triangular elements of the factor loadings matrix. All other priors are identical to the multivariate GLMM previously discussed.

+

where \(\boldsymbol{\lambda}_i^\top\) is the \(i\text{th}\) row of factor loadings from an \(I \times q\) loadings matrix \(\boldsymbol{\Lambda}\), and \(\text{w}_j\) is a \(q \times 1\) vector of independent latent factors at site \(j\). By settng \(q << N\), we achieve dimension reduction to efficiently model communities with a large number of species (Lopes and West 2004; Hui et al. 2015). The approach accounts for residual species correlations via their species-specific responses to the \(q\) factors. We model each latent factor as a standard normal random variable. To ensure identifiability of the latent factors from the latent factor loadings, we fix the upper triangle of the factor loadings matrix to 0 and the diagonal elements to 1. We assign standard normal priors to the lower triangular elements of the factor loadings matrix. All other priors are identical to the multivariate GLMM previously discussed.

The constraints on the factor loadings matrix ensure identifiability of the factor loadings from the latent factors, but this also results in important practical considerations when fitting these models (e.g., ordering of the species, initial values). This vignette on the spOccupancy website discusses these (and other) considerations. All the advice applied to factor models fit with detection-nondetection data in spOccupancy also apply to factor models fit in spAbundance.

@@ -1674,7 +1675,7 @@

Fitting latent factor multivariate GLMMs with lfMsAbund()

n.batch, batch.length, accept.rate = 0.43, family = 'Poisson', n.omp.threads = 1, verbose = TRUE, n.report = 100, n.burn = round(.10 * n.batch * batch.length), n.thin = 1, n.chains = 1, - ...)
+ save.fitted = TRUE, ...)

For guidance on choosing the number of latent factors, see this vignette on the spOccupancy website. Here we will fit a model with 3 latent factors.

n.factors <- 3

There are only a few slight differences in how we go about fitting a multivariate GLMM with latent factors compared to one without latent factors. The data format as well as formula remain the same as before.

@@ -1711,7 +1712,7 @@

Fitting latent factor multivariate GLMMs with lfMsAbund()

We finish up by specifying the tuning values, where now we specify the initial tuning variance for the factor loadings lambda as well as the latent factors w. We then run the model for 20,000 iterations with a burn-in of 10,000 samples and a thinning rate of 10, for each of 3 chains, yielding a total of 3000 posterior samples. We will fit the model with a Poisson distribution for abundance.

# Specify initial tuning values
 ms.tuning <- list(beta = 0.3, beta.star = 0.5, kappa = 0.5, w = 0.5, lambda = 0.5)
-# Approx. run time: ~2.6 min
+# Approx. run time: 3 min
 out.lf.ms <- lfMsAbund(formula = ms.formula,
                        data = bbsData,
                        inits = ms.inits,
@@ -1857,7 +1858,7 @@ 

Fitting latent factor multivariate GLMMs with lfMsAbund()

Thinning Rate: 10 Number of Chains: 3 Total Posterior Samples: 3000 -Run Time (min): 2.6156 +Run Time (min): 2.8613 ---------------------------------------- Community Level @@ -1987,7 +1988,7 @@

Fitting latent factor multivariate GLMMs with lfMsAbund()

Posterior predictive checks

-

We again use the ppcAbund() function perform a posterior predictive check of our model

+

We again use the ppcAbund() function to perform a posterior predictive check of our model

ppc.out.lf.ms <- ppcAbund(out.lf.ms, fit.stat = 'freeman-tukey', group = 0)
Currently on species 1 out of 6
Currently on species 2 out of 6
@@ -2080,22 +2081,22 @@

Prediction

Spatial factor multivariate GLMMs

Basic model description

-

Our final, and most complex, GLMM that we fit in spAbundance is a multivariate spatially-explicit GLMM. This model is nearly identical to the latent factor multivariate GLMM, except we give a spatial structure to the latent factors instead of assuming they are independent from each other. By modeling the latent factors with a spatial structure, we will often see such a model have improved predictive performance relative to a latent factor multivariate GLMM (Doser, Finley, and Banerjee 2023). This model again is an abundance-based JSDM, but now it simultaneously accounts for residual correlations between species and spatial autocorrelation. The model we present below is a direct extension of the Gaussian spatial factor NNGP model of Taylor-Rodriguez et al. (2019), where we now allow for the response to be Poisson or negative binomial (in addition to Gaussian).

+

Our final, and most complex, GLMM that we fit in spAbundance is a multivariate spatially-explicit GLMM. This model is nearly identical to the latent factor multivariate GLMM, except we give a spatial structure to the latent factors instead of assuming they are independent from each other. By modeling the latent factors with a spatial structure, we will often see such a model have improved predictive performance relative to a latent factor multivariate GLMM (Thorson et al. 2015; Ovaskainen et al. 2016; Doser, Finley, and Banerjee 2023). This model again is an abundance-based JSDM, but now it simultaneously accounts for residual correlations between species and spatial autocorrelation. The model we present below is a direct extension of the Gaussian spatial factor NNGP model of Taylor-Rodriguez et al. (2019), where we now allow for the response to be Poisson or negative binomial (in addition to Gaussian).

The model is identical to the previously described latent factor multivariate GLMM with the exception that the latent factors are assumed to have a spatial structure to them. More specifically, our model for species-specific relative abundance is again

\[\begin{equation} -\text{log}(\mu_{i}(\boldsymbol{s}_j)) = \boldsymbol{x}(\boldsymbol{s}_j)^\top\boldsymbol{\beta}_i + \text{w}^\ast_{i}(\boldsymbol{s}_j). +g(\mu_{i}(\boldsymbol{s}_j)) = \boldsymbol{x}(\boldsymbol{s}_j)^\top\boldsymbol{\beta}_i + \text{w}^\ast_{i}(\boldsymbol{s}_j). \end{equation}\]

Note again that for spatial models we use the notation \(\boldsymbol{s}_j\) to make it clear that the model is spatially-explicit and relies upon the coordinates (\(\boldsymbol{s}_j\)) at each site \(j\) to estimate this spatial pattern. As with the latent factor model, we decompose \(\text{w}^\ast_i(\boldsymbol{s}_j)\) into a linear combination of \(q\) latent variables (i.e., factors) and their associated species-specific coefficients (i.e., factor loadings) according to

\[\begin{equation} \text{w}^\ast_{i}(\boldsymbol{s}_j) = \boldsymbol{\lambda}_i^\top\textbf{w}(\boldsymbol{s}_j). \end{equation}\]

-

Now, instead of modeling \(\textbf{w}(\boldsymbol{s}_j)\) as independent normal latent variables, we assume \(\textbf{w}(\boldsymbol{s}_j)\) arise from spatial processes, allowing us to account for residual correlation in species-specific abundance. More specifically, each spatial factor \(\textbf{w}_r\) for each \(r = 1, \dots, q\) is modeled using a Nearest Neighbor Gaussian Process (Datta et al. 2016), i.e.,

+

Now, instead of modeling \(\textbf{w}(\boldsymbol{s}_j)\) as independent normal latent variables, we assume \(\textbf{w}(\boldsymbol{s}_j)\) arise from spatial processes, allowing us to account for both residual species correlations and residual spatial autocorrelation in species-specific abundance. More specifically, each spatial factor \(\textbf{w}_r\) for each \(r = 1, \dots, q\) is modeled using a Nearest Neighbor Gaussian Process (Datta et al. 2016), i.e.,

\[\begin{equation} \textbf{w}_r \sim \text{Normal}(\boldsymbol{0}, \tilde{\boldsymbol{C}}_r(\boldsymbol{\theta}_r)), \end{equation}\]

where \(\tilde{\boldsymbol{C}}_r(\boldsymbol{\theta}_r)\) is the NNGP-derived correlation matrix for the \(r^{\text{th}}\) spatial factor. Note that the spatial variance parameter for each spatial factor is assumed to be 1 for identifiability purposes. The vector \(\boldsymbol{\theta}_r\) consists of parameters governing the spatial process for each spatial factor according to some spatial correlation function, as we saw with the single-species GLMM. Thus, for the exponential, spherical, and Gaussian correlation functions, \(\boldsymbol{\theta}_r\) includes a spatial decay parameter \(\phi_r\), while the Matern correlation function includes an additional spatial smoothness parameter, \(\nu_r\).

We assume the same priors and identifiability constraints as the latent factor GLMM. We assign a uniform prior for the spatial decay parameter, \(\phi_r\), and the spatial smoothness parameters, \(\nu_r\), if using a Matern correlation function.

-

Notice that this spatial factor modeling approach is the only approach we implement in spAbundance for modeling multi-species datasets while accounting for spatial autocorrelation. While we could envision fitting a separate spatial random effect for each species in our multi-species data set (as we implemented in spOccupancy in the spMsPGOcc function), we prefer (and recommend) using the spatial factor modeling approach as (1) it is far more computationally efficient than fitting a separate spatial effect for each species; (2) it explicitly acknowledges dependence between species; (3) estimating a separate spatial effect for each species would be very difficult to do with multiple rare species in the data set; and (4) even if the species are independent (i.e., there are no residual correlations), the spatial factor modeling approach provides extremely similarly to a model with a separate spatial process for each species (Doser, Finley, and Banerjee 2023), while still being substantially faster.

+

Notice that this spatial factor modeling approach is the only approach we implement in spAbundance for modeling multi-species datasets while accounting for spatial autocorrelation. While we could envision fitting a separate spatial random effect for each species in our multi-species data set (as we implemented in spOccupancy in the spMsPGOcc function), we prefer (and recommend) using the spatial factor modeling approach as (1) it is far more computationally efficient than fitting a separate spatial effect for each species; (2) it explicitly acknowledges dependence between species; (3) estimating a separate spatial effect for each species would be very difficult to do with multiple rare species in the data set; and (4) even if the species are independent (i.e., there are no residual correlations), the spatial factor modeling approach performs extremely similarly to a model with a separate spatial process for each species (Doser, Finley, and Banerjee 2023), while still being substantially faster.

Fitting spatial factor multivariate GLMMs with sfMsAbund()

@@ -2106,13 +2107,13 @@

Fitting spatial factor multivariate GLMMs with sfMsAbund()

n.batch, batch.length, accept.rate = 0.43, family = 'Poisson', n.omp.threads = 1, verbose = TRUE, n.report = 100, n.burn = round(.10 * n.batch * batch.length), n.thin = 1, n.chains = 1, - ...)
+ save.fitted = TRUE, ...)

We will again fit the model with three spatial factors, and will use the same covariates/random effects on abundance and detection as we have done throughout the vignette

n.factors <- 3
 ms.formula <- ~ scale(bio2) + scale(bio8) + scale(bio18) + scale(forest) +
                 scale(devel) + scale(day) + I(scale(day)^2) + scale(tod) +
                 (1 | obs)
-

Initial values are identical to what we saw with lfMsAbund(), but we will also specify the initial values for the spatial decay parameters for each spatial factor (we use an exponential correlation model so we do not need to specify any initial values for the nu, the spatial smoothness parameter used when cov.model = 'matern'.

+

Initial values are identical to what we saw with lfMsAbund(), but we will also specify the initial values for the spatial decay parameters for each spatial factor (we use an exponential correlation model so we do not need to specify any initial values for nu, the spatial smoothness parameter used when cov.model = 'matern'.

# Number of species
 n.sp <- dim(bbsData$y)[1]
 # Initiate all lambda initial values to 0.
@@ -2333,7 +2334,7 @@ 

Fitting spatial factor multivariate GLMMs with sfMsAbund()

Thinning Rate: 10 Number of Chains: 3 Total Posterior Samples: 3000 -Run Time (min): 7.9458 +Run Time (min): 8.2892 ---------------------------------------- Community Level @@ -2526,7 +2527,7 @@

Prediction

$ y.0.samples : num [1:3000, 1:6, 1:816, 1] 25 3 32 39 14 13 5 12 23 3 ... $ w.0.samples : num [1:3000, 1:3, 1:816] 0.723 -0.745 1.303 1.945 0.632 ... $ mu.0.samples: num [1:3000, 1:6, 1:816, 1] 22.46 4.42 28.69 42.11 16.28 ... - $ run.time : 'proc_time' Named num [1:5] 232.9 118 90.2 0 0 + $ run.time : 'proc_time' Named num [1:5] 243.3 122.7 94.2 0 0 ..- attr(*, "names")= chr [1:5] "user.self" "sys.self" "elapsed" "user.child" ... $ call : language predict.sfMsAbund(object = out.sf.ms, X.0 = X.0, coords.0 = coords.0, verbose = TRUE, n.report = 100, ignore.RE = TRUE) $ object.class: chr "sfMsAbund" @@ -2564,6 +2565,9 @@

References

Datta, Abhirup, Sudipto Banerjee, Andrew O Finley, and Alan E Gelfand. 2016. “Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geostatistical Datasets.” Journal of the American Statistical Association 111 (514): 800–812.

+
+

Diggle, Peter J. 1998. “Model-Based Geostatistics.” Journal of the Royal Statistical Society Series C: Applied Statistics 47 (3): 299–350.

+

Doser, Jeffrey W, Andrew O Finley, and Sudipto Banerjee. 2023. “Joint Species Distribution Models with Imperfect Detection for High-Dimensional Spatial Data.” Ecology 104 (9): e4137.

@@ -2597,8 +2601,8 @@

References

Lunn, David, Christopher Jackson, Nicky Best, Andrew Thomas, and David Spiegelhalter. 2013. “The Bugs Book: A Practical Introduction to Bayesian Analysis.”

-
-

“Model-Based Geostatistics.” 1998. Journal of the Royal Statistical Society Series C: Applied Statistics 47 (3): 299–350.

+
+

Ovaskainen, Otso, David B Roy, Richard Fox, and Barbara J Anderson. 2016. “Uncovering Hidden Spatial Structure in Species Communities with Spatially Explicit Joint Species Distribution Models.” Methods in Ecology and Evolution 7 (4): 428–36.

Pardieck, K. L., D. J. Ziolkowski Jr, M. Lutmerding, V. I. Aponte, and M-A.R. Hudson. 2020. “North American Breeding Bird Survey Dataset 1966–2019.” U.S. Geological Survey Data Release, Https://Doi.org/10.5066/P9J6QUF6.

@@ -2609,6 +2613,9 @@

References

Taylor-Rodriguez, Daniel, Andrew O Finley, Abhirup Datta, Chad Babcock, Hans-Erik Andersen, Bruce D Cook, Douglas C Morton, and Sudipto Banerjee. 2019. “Spatial Factor Models for High-Dimensional and Large Spatial Data: An Application in Forest Variable Mapping.” Statistica Sinica 29: 1155.

+
+

Thorson, James T, Mark D Scheuerell, Andrew O Shelton, Kevin E See, Hans J Skaug, and Kasper Kristensen. 2015. “Spatial Factor Analysis: A New Tool for Estimating Joint Species Distributions and Correlations in Species Range.” Methods in Ecology and Evolution 6 (6): 627–37.

+

Warton, David I, F Guillaume Blanchet, Robert B O’Hara, Otso Ovaskainen, Sara Taskinen, Steven C Walker, and Francis KC Hui. 2015. “So Many Variables: Joint Modeling in Community Ecology.” Trends in Ecology & Evolution 30 (12): 766–79.

diff --git a/vignettes/references.bib b/vignettes/references.bib index 78e3dd6..ea02388 100644 --- a/vignettes/references.bib +++ b/vignettes/references.bib @@ -912,3 +912,25 @@ @article{hui2015model publisher={Wiley Online Library} } +@article{thorson2015spatial, + title={Spatial factor analysis: a new tool for estimating joint species distributions and correlations in species range}, + author={Thorson, James T and Scheuerell, Mark D and Shelton, Andrew O and See, Kevin E and Skaug, Hans J and Kristensen, Kasper}, + journal={Methods in Ecology and Evolution}, + volume={6}, + number={6}, + pages={627--637}, + year={2015}, + publisher={Wiley Online Library} +} + +@article{ovaskainen2016uncovering, + title={Uncovering hidden spatial structure in species communities with spatially explicit joint species distribution models}, + author={Ovaskainen, Otso and Roy, David B and Fox, Richard and Anderson, Barbara J}, + journal={Methods in Ecology and Evolution}, + volume={7}, + number={4}, + pages={428--436}, + year={2016}, + publisher={Wiley Online Library} +} +