Andrew Heiss • 2018-03-07
Missing data can significantly influence the results of normal regression models, since the default in R and most other statistical packages is to throw away any rows with missing variables. To avoid unnecessarily throwing out data, it’s helpful to impute missing values. One of the best ways to do this is to build a separate regression model to make predictions that fill in the gaps in data. This isn’t always accurate, so it’s best to make many iterations of predictions (in imputation parlance, (m) is the number of imputations done to a dataset). After making (m) datasets, you can use this data by (1) running statistical tests on each imputation individually and then (2) pooling those results into a single number. The excellent Amelia vignette details the theory and mechanics of how to use multiple imputation, and it’s a fantastic resource.
There are several packages for dealing with missing data in R, including
mi
,
mice
, and
Amelia
, and Thomas Leeper
has a short overview of how to use all
three. I’m partial
to Amelia, since it’s designed to
work well with time series-cross sectional data and can deal with
complicated features like country-year observations.
Because Amelia is written by Gary King, et al., it works with
Zelig, a separate framework that’s designed
to simplify modeling in R. With Zelig + Amelia, you can combine all of
the (m) imputations automatically with whatever Zelig uses for
printing model results. I’m not a huge fan of Zelig, though, and I
prefer using lm()
, glm()
, stan_glm()
, and gang on my own, thank
you very much.
However, doing it on my own means there’s a little more work involved with combining coefficients and parameters across imputations. Fortunately, the tidyverse—specifically its ability to store models within data frames—makes it really easy to deal with models based on imputed data. Here’s how to do it using tidy functions. The code for this whole process can be greatly simplified in real life. You technically don’t need all these intermediate steps, though they’re helpful for seeing what’s going on behind the scenes.
We’ll start by working with some basic example imputed data frame from Amelia’s built-in data. We create 5 imputed datasets defining countries and years as cross sections and time series, and we log GDP per capita in the predictive model:
library(tidyverse)
library(Amelia)
library(broom)
set.seed(1234)
data(africa)
imp_amelia <- amelia(x = africa, m = 5, cs = "country", ts = "year", logs = "gdp_pc", p2s = 0)
The resulting object contains a list of data frames, and each imputed
dataset is stored in a list slot named “imputations” or
imp_amelia$imputations
. We can combine these all into one big data
frame with bind_rows()
, group by the imputation number ((m)), and
nest them into imputation-specific
rows:
# unclass() is necessary because bind_rows() will complain when dealing with
# lists with the "amelia" class, which is what amelia() returns
all_imputations <- bind_rows(unclass(imp_amelia$imputations), .id = "m") %>%
group_by(m) %>%
nest()
all_imputations
## # A tibble: 5 x 2
## m data
## <chr> <list>
## 1 imp1 <tibble [120 × 7]>
## 2 imp2 <tibble [120 × 7]>
## 3 imp3 <tibble [120 × 7]>
## 4 imp4 <tibble [120 × 7]>
## 5 imp5 <tibble [120 × 7]>
With this nested data, we can use purrr::map()
to run models and
return tidy summaries of those models directly in the data frame:
models_imputations <- all_imputations %>%
mutate(model = data %>% map(~ lm(gdp_pc ~ trade + civlib, data = .)),
tidied = model %>% map(~ tidy(., conf.int = TRUE)),
glance = model %>% map(~ glance(.)))
models_imputations
## # A tibble: 5 x 5
## m data model tidied glance
## <chr> <list> <list> <list> <list>
## 1 imp1 <tibble [120 × 7]> <S3: lm> <data.frame [3 × 7]> <data.frame [1 × 11]>
## 2 imp2 <tibble [120 × 7]> <S3: lm> <data.frame [3 × 7]> <data.frame [1 × 11]>
## 3 imp3 <tibble [120 × 7]> <S3: lm> <data.frame [3 × 7]> <data.frame [1 × 11]>
## 4 imp4 <tibble [120 × 7]> <S3: lm> <data.frame [3 × 7]> <data.frame [1 × 11]>
## 5 imp5 <tibble [120 × 7]> <S3: lm> <data.frame [3 × 7]> <data.frame [1 × 11]>
Having the models structured like this makes it easy to access coefficients for models from individual imputations, like so:
models_imputations %>%
filter(m == "imp1") %>%
unnest(tidied)
## # A tibble: 3 x 8
## m term estimate std.error statistic p.value conf.low conf.high
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 imp1 (Intercept) 114 97.7 1.17 2.44e⁻ ¹ - 79.0 308
## 2 imp1 trade 18.1 1.25 14.4 9.65e⁻²⁸ 15.6 20.6
## 3 imp1 civlib -631 182 - 3.46 7.47e⁻ ⁴ -993 -270
More importantly, we can access the coefficients for all the models, which is essential for combining and averaging the coefficients across all five imputations.
Pooling or melding coefficients from many models is a little trickier than just averaging them all together (as delightfully easy as that would be). Donald Rubin (1987) outlines an algorithm/set of rules for combining the results from multiply imputed datasets that reflects the averages and accounts for differences in standard errors. Rubin’s rules are essentially a fancier, more robust way of averaging coefficients and other quantities of interest across imputations.
Amelia has a built-in function for using Rubin’s rules named mi.meld()
that accepts two m-by-k matrices (one for coefficients and one for
standard errors) like so:
coef1 coef2 coefn
imp1 x x x
imp2 x x x
impn x x x
We can use some dplyr/tidyr magic to wrangle the regression results into this form:
# Create a wide data frame of just the coefficients and standard errors
params <- models_imputations %>%
unnest(tidied) %>%
select(m, term, estimate, std.error) %>%
gather(key, value, estimate, std.error) %>%
spread(term, value)
params
## # A tibble: 10 x 5
## m key `(Intercept)` civlib trade
## * <chr> <chr> <dbl> <dbl> <dbl>
## 1 imp1 estimate 114 -631 18.1
## 2 imp1 std.error 97.7 182 1.25
## 3 imp2 estimate 111 -621 18.1
## 4 imp2 std.error 97.5 182 1.24
## 5 imp3 estimate 122 -645 18.1
## 6 imp3 std.error 95.7 180 1.22
## 7 imp4 estimate 107 -633 18.2
## 8 imp4 std.error 98.3 183 1.25
## 9 imp5 estimate 104 -610 18.2
## 10 imp5 std.error 98.0 183 1.26
# Extract just the coefficients
just_coefs <- params %>%
filter(key == "estimate") %>%
select(-m, -key)
just_coefs
## # A tibble: 5 x 3
## `(Intercept)` civlib trade
## <dbl> <dbl> <dbl>
## 1 114 -631 18.1
## 2 111 -621 18.1
## 3 122 -645 18.1
## 4 107 -633 18.2
## 5 104 -610 18.2
# Extract just the standard errors
just_ses <- params %>%
filter(key == "std.error") %>%
select(-m, -key)
just_ses
## # A tibble: 5 x 3
## `(Intercept)` civlib trade
## <dbl> <dbl> <dbl>
## 1 97.7 182 1.25
## 2 97.5 182 1.24
## 3 95.7 180 1.22
## 4 98.3 183 1.25
## 5 98.0 183 1.26
We can then use these matrices in mi.meld()
, which returns a list with
two slots—q.mi
and se.mi
:
coefs_melded <- mi.meld(just_coefs, just_ses)
coefs_melded
## $q.mi
## (Intercept) civlib trade
## [1,] 111.7558 -628.0611 18.13329
##
## $se.mi
## (Intercept) civlib trade
## [1,] 97.73866 182.5683 1.247038
Armed with these, we can create our regression summary table with some more dplyr wizardry. To calculate the p-value and confidence intervals, we need to extract the degrees of freedom from one of the imputed models
model_degree_freedom <- models_imputations %>%
unnest(glance) %>%
filter(m == "imp1") %>%
pull(df.residual)
melded_summary <- as.data.frame(cbind(t(coefs_melded$q.mi),
t(coefs_melded$se.mi))) %>%
magrittr::set_colnames(c("estimate", "std.error")) %>%
mutate(term = rownames(.)) %>%
select(term, everything()) %>%
mutate(statistic = estimate / std.error,
conf.low = estimate + std.error * qt(0.025, model_degree_freedom),
conf.high = estimate + std.error * qt(0.975, model_degree_freedom),
p.value = 2 * pt(abs(statistic), model_degree_freedom, lower.tail = FALSE))
melded_summary
## term estimate std.error statistic conf.low conf.high p.value
## 1 (Intercept) 111.75580 97.738661 1.143414 -81.8105 305.32210 2.552007e-01
## 2 civlib -628.06108 182.568297 -3.440143 -989.6280 -266.49413 8.066410e-04
## 3 trade 18.13329 1.247038 14.541095 15.6636 20.60299 5.451100e-28
Hooray! Correctly melded coefficients and standard errors!
But what do we do about the other model details, like (R^2) and the F-statistic? How do we report those?
According to a post on the Amelia mailing list, there are two ways. First, we can use a fancy method for combining (R^2) and adjusted (R^2) described by Ofer Harel (2009). Second, we can just take the average of the (R^2)s from all the imputed models. The results should be roughly the same.
Harel’s method involves two steps:
- In each complete data set, calculate the (R^2), take its square root ((R)), transform (R) with a Fisher z-transformation ((Q = \frac{1}{2} \log_{e}(\frac{1 + R}{1 - R}))), and calculate the variance of (R^2) (which is (\frac{1}{\text{degrees of freedom}}))
- Meld the resulting (Q) and variance using Rubin’s rules
(
mi.meld()
; this creates (Q_a)), undo the z-transformation ((R_a = (\frac{-1 + \exp(2Q_a)}{1 + \exp(2Q_a)})^2)), and square it ((R_a^2))
That looks complicated, but it’s fairly easy with some dplyr magic. Here’s how to do it for adjusted (R^2) (the same process works for regular (R^2) too):
# Step 1: in each complete data set, calculate R2, take its square root,
# transform it with Fisher z-transformation, and calculate the variance of R2\
r2s <- models_imputations %>%
unnest(glance) %>%
select(m, adj.r.squared, df.residual) %>%
mutate(R = sqrt(adj.r.squared), # Regular R
Q = 0.5 * log((R + 1) / (1 - R)), # Fisher z-transformation
se = 1 / df.residual) # R2 variance
r2s
## # A tibble: 5 x 6
## m adj.r.squared df.residual R Q se
## <chr> <dbl> <int> <dbl> <dbl> <dbl>
## 1 imp1 0.643 117 0.802 1.10 0.00855
## 2 imp2 0.648 117 0.805 1.11 0.00855
## 3 imp3 0.656 117 0.810 1.13 0.00855
## 4 imp4 0.647 117 0.804 1.11 0.00855
## 5 imp5 0.644 117 0.803 1.11 0.00855
# Step 2: combine the results using Rubin's rules (mi.meld()), inverse transform
# the value, and square it
# Meld the R2 values with mi.meld()
Q_melded <- mi.meld(as.matrix(r2s$Q), as.matrix(r2s$se))
# Inverse transform Q to R and square it
r2_melded <- ((exp(2 * Q_melded$q.mi) - 1) / (1 + exp(2 * Q_melded$q.mi)))^2
r2_melded
## [,1]
## [1,] 0.6476917
The correctly pooled/melded (R^2) is thus 0.6477. Neat.
How does this compare to just the average of all the (R^2)s from all the imputations?
r2s_avg <- models_imputations %>%
unnest(glance) %>%
summarize(adj.r.squared_avg = mean(adj.r.squared)) %>%
pull(adj.r.squared_avg)
r2s_avg
## [1] 0.6476689
The incorrectly averaged (R^2) is 0.6477, which is basically identical to the correctly melded 0.6477. This is probably because the models from the five imputed models are already fairly similar—there might be more variance in (R^2) in data that’s less neat. But for this situation, the two approaches are essentially the same. Other model diagnostics like the F-statistic can probably be pooled just with averages as well. I haven’t found any specific algorithms for melding them with fancy math.
So, in summary, combine the coefficients and standard errors from
multiply imputed models with mi.meld()
and combine other model
parameters like (R^2) either with Harel’s fancy method or by simply
averaging them.