Andrew Heiss • 2018-03-08
tl;dr: Use the functions in broomify-amelia.R
to use broom::tidy()
, broom::glance()
, and huxtable::huxreg()
on
lists of multiply imputed models.
The whole reason I went into the rabbit hole of the mechanics of merging
imputed regression results in the previous
post
was so I could easily report these results in papers and writeups. In
political science and economics (and probably other social science
disciplines), it’s fairly standard to report many regression models in a
side-by-side table, with a column for each model and rows for each
coefficient. R packages like
stargazer
and
huxtable
make this
fairly straightforward.
library(tidyverse)
library(Amelia)
library(stargazer)
library(huxtable)
library(broom)
# Use the africa dataset from Ameila
data(africa)
# Build some example models
model_original1 <- lm(gdp_pc ~ trade + civlib, data = africa)
model_original2 <- lm(gdp_pc ~ trade + civlib + infl, data = africa)
Stargazer takes a list of models:
stargazer(model_original1, model_original2, type = "text")
##
## ====================================================================
## Dependent variable:
## ------------------------------------------------
## gdp_pc
## (1) (2)
## --------------------------------------------------------------------
## trade 18.027*** 18.494***
## (1.272) (1.204)
##
## civlib -665.428*** -588.722***
## (185.436) (175.717)
##
## infl -6.336***
## (1.620)
##
## Constant 136.063 166.435*
## (100.409) (94.871)
##
## --------------------------------------------------------------------
## Observations 115 115
## R2 0.653 0.695
## Adjusted R2 0.647 0.687
## Residual Std. Error 352.491 (df = 112) 331.931 (df = 111)
## F Statistic 105.569*** (df = 2; 112) 84.469*** (df = 3; 111)
## ====================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
As does huxtable’s huxreg()
:
huxreg(model_original1, model_original2) %>%
print_screen()
## ─────────────────────────────────────────────
## ( 1) ( 2)
## ───────────────────────────────
## (Intercept) 136.063 166.435
## (100.409) (94.871)
## trade 18.027 *** 18.494 ***
## (1.272) (1.204)
## civlib -665.428 *** -588.722 **
## (185.436) (175.717)
## infl -6.336 ***
## (1.620)
## ───────────────────────────────
## N 115 115
## R 2 0.653 0.695
## logLik -836.136 -828.709
## AIC 1680.272 1667.418
## ─────────────────────────────────────────────
## *** p < 0.001; ** p < 0.01; * p <
## 0.05.
##
## Column names: names, model1, model2
Stargazer has support for a ton of different model
types
(see ?stargazer
for details), but they’re all hardcoded into
stargazer’s internal code and adding more is tricky.
Huxtable, on the other hand,
doesn’t rely on hardcoded model processing, but instead will display any
model that works with broom::tidy()
and broom::glance()
. The
broom
package supports way
more models than stargazer (including models created with
rstan
and
rstanarm
!), and
because of this, huxtable is far more extensible—if you can create a
tidy()
and a glance()
function for a type of model, huxtable can use
it.
Also, stargazer was written before R
Markdown was really a thing, so it has
excellent support for HTML and LaTeX output, but that’s it. Including
stargazer tables in an R Markdown document is a hassle, especially if
you want to be able to convert it to Word (I’ve written a Python script
for doing
this—that’s
how much extra work it takes). Huxtable, though, was written after the R
Markdown and tidyverse revolutions, so it supports piping and can
output to HTML, LaTeX, and Markdown (with huxtable::print_md()
).
This history is important because it means that models based on multiple
imputation will not work with stargazer. Melding all the
coefficients across imputations creates nice data frames of model
results, but it doesn’t create a model that stargazer can work with.
This is unfortunate, especially given how much I use
stargazer.
However, if we could make a tidy()
and a glance()
function that
could work with a list of multiply imputed models, huxtable would solve
all our problems.
So here’s how to solve all your problems :)
First, we’ll impute the missing data in the Africa data set, nest the imputed data in a larger data frame, and run a model on each imputed dataset:
set.seed(1234)
imp_amelia <- amelia(x = africa, m = 5, cs = "country", ts = "year", logs = "gdp_pc", p2s = 0)
models_imputed_df <- bind_rows(unclass(imp_amelia$imputations), .id = "m") %>%
group_by(m) %>%
nest() %>%
mutate(model = data %>% map(~ lm(gdp_pc ~ trade + civlib, data = .)))
models_imputed_df
## # A tibble: 5 x 3
## m data model
## <chr> <list> <list>
## 1 imp1 <tibble [120 × 7]> <S3: lm>
## 2 imp2 <tibble [120 × 7]> <S3: lm>
## 3 imp3 <tibble [120 × 7]> <S3: lm>
## 4 imp4 <tibble [120 × 7]> <S3: lm>
## 5 imp5 <tibble [120 × 7]> <S3: lm>
Before we do anything with the models in models_imputed_df$model
,
first we can define a few functions to extend broom. R’s S3 object
system means that a function named whatever.blah()
will automatically
work when called on objects with the class blah
. This is how broom
generally works—there are functions named tidy.anova()
, tidy.glm()
,
tidy.lm()
, etc. that will do the correct tidying when run on anova
,
glm
, and lm
objects. Huxtable also takes advantage of this S3 object
system—it will call the appropriate tidy and glance functions based on
the class of the models passed to it.
To make a list of models work with broom, we need to invent a new class
of model. In this example I’ve named it melded
, but it could be
anything. Here are three functions designed to work on melded
objects
(the code for these is largely based on the previous post about melding
coefficients).
These functions are also found in
broomify-amelia.R
, which you can add to your
project (maybe someday this could be an actual package, but I don’t see
a reason for it yet).
tidy.melded <- function(x, conf.int = FALSE, conf.level = 0.95) {
# Get the df from one of the models
model_degrees_freedom <- glance(x[[1]])$df.residual
# Create matrices of the estimates and standard errors
params <- data_frame(models = x) %>%
mutate(m = 1:n(),
tidied = models %>% map(tidy)) %>%
unnest(tidied) %>%
select(m, term, estimate, std.error) %>%
gather(key, value, estimate, std.error) %>%
mutate(term = fct_inorder(term)) %>% # Order the terms so that spread() keeps them in order
spread(term, value)
just_coefs <- params %>% filter(key == "estimate") %>% select(-m, -key)
just_ses <- params %>% filter(key == "std.error") %>% select(-m, -key)
# Meld the coefficients with Rubin's rules
coefs_melded <- mi.meld(just_coefs, just_ses)
# Create tidy output
output <- as.data.frame(cbind(t(coefs_melded$q.mi),
t(coefs_melded$se.mi))) %>%
magrittr::set_colnames(c("estimate", "std.error")) %>%
mutate(term = rownames(.)) %>%
select(term, everything()) %>%
mutate(statistic = estimate / std.error,
p.value = 2 * pt(abs(statistic), model_degrees_freedom, lower.tail = FALSE))
# Add confidence intervals if needed
if (conf.int & conf.level) {
# Convert conf.level to tail values (0.025 when it's 0.95)
a <- (1 - conf.level) / 2
output <- output %>%
mutate(conf.low = estimate + std.error * qt(a, model_degrees_freedom),
conf.high = estimate + std.error * qt((1 - a), model_degrees_freedom))
}
# tidy objects only have a data.frame class, not tbl_df or anything else
class(output) <- "data.frame"
output
}
glance.melded <- function(x) {
# Because the properly melded parameters and the simple average of the
# parameters of these models are roughly the same (see
# https://www.andrewheiss.com/blog/2018/03/07/amelia-tidy-melding/), for the
# sake of simplicty we just take the average here
output <- data_frame(models = x) %>%
mutate(glance = models %>% map(glance)) %>%
unnest(glance) %>%
summarize_at(vars(r.squared, adj.r.squared, sigma, statistic, p.value, df,
logLik, AIC, BIC, deviance, df.residual),
funs(mean)) %>%
mutate(m = as.integer(length(x)))
# glance objects only have a data.frame class, not tbl_df or anything else
class(output) <- "data.frame"
output
}
nobs.melded <- function(x, ...) {
# Take the number of observations from the first model
nobs(x[[1]])
}
With these three functions, we can now use glance()
and tidy()
on a
list of models with the class melded
, like so:
# Extract the models into a vector and make it a "melded" class
models_imputed <- models_imputed_df$model
# Without this, R won't use our custom tidy.melded() or glance.melded() functions
class(models_imputed) <- "melded"
glance(models_imputed)
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## 1 0.6535905 0.6476689 350.2007 110.4054 1.459152e-27 3 -871.7718 1751.544 1762.694
## deviance df.residual m
## 1 14349553 117 5
tidy(models_imputed)
## term estimate std.error statistic p.value
## 1 (Intercept) 111.75580 97.738661 1.143414 2.552007e-01
## 2 trade 18.13329 1.247038 14.541095 5.451100e-28
## 3 civlib -628.06108 182.568297 -3.440143 8.066410e-04
Even better, though, is that we can use these imputed models in a
huxtable regression table. And, because I included a column named m
in
glance.melded()
, we can also include it in the regression output!
huxreg(model_original1, model_original2, models_imputed,
statistics = c(N = "nobs", R2 = "r.squared", `Adj R2` = "adj.r.squared",
"logLik", "AIC", "m")) %>%
print_screen()
## ────────────────────────────────────────────────────────────
## ( 1) ( 2) ( 3)
## ──────────────────────────────────────────────
## (Intercept) 136.063 166.435 111.756
## (100.409) (94.871) (97.739)
## trade 18.027 *** 18.494 *** 18.133 ***
## (1.272) (1.204) (1.247)
## civlib -665.428 *** -588.722 ** -628.061 ***
## (185.436) (175.717) (182.568)
## infl -6.336 ***
## (1.620)
## ──────────────────────────────────────────────
## N 115 115 120
## R 2 0.653 0.695 0.654
## Adj R 2 0.647 0.687 0.648
## logLik -836.136 -828.709 -871.772
## AIC 1680.272 1667.418 1751.544
## m 5.000
## ────────────────────────────────────────────────────────────
## *** p < 0.001; ** p < 0.01; * p < 0.05.
##
## Column names: names, model1, model2, model3