Internal variable selection procedure in mice and the loggedEvents file #627

alexpate30 · 2024-03-12T11:25:13Z

alexpate30
Mar 12, 2024

Hello, I have a few questions around how mice selects variables for each imputation model, and the loggedEvents file. I will set the scene and then go into some detail of what I’ve encountered.

I am planning a multiple imputation. We have millions of observations, and 6 missing variables to impute. The levels of missingness are between 10 – 50%. We have approximately 20 predictor variables, each of which will be interacted with a non-linear transformation of a key continuous variable, which will amount to approximately 80-90 dummy predictors total. I appreciate some manual variable selection may need to be done first, but for now I am trying to Figure out what mice is doing under the hood with respect to variable selection in the imputation models.

When running a test imputation procedure (with ~500,000 observations), I find that there are a large number of variables removed from the imputation model for each of the missing variables (approximately half), which I’m identifying through the loggedEvents output.

According to section 9.1.5 (https://stefvanbuuren.name/fimd/sec-toomany.html), the entries in loggedEvents can signal the following three actions:

Predictor is constant or correlates higher than 0.999 with target
All predictors are removed
Degrees of freedom has become negative

I’m fairly certain that the entries to loggedEvents are not caused by the above three things (although I could be wrong). The predictors do not correlate that strongly with the target, and the degree of freedom will not be negative due to the high number of observations.

I therefore took a look at the source code on GitHub (specifically here: https://github.com/amices/mice/blob/master/R/internal.R and here: https://github.com/amices/mice/blob/master/R/sampler.R), and think it’s probably to do with the “remove.lindep” function. This is applied within the “sampler” function and adds entries to loggedEvents when it is applied. As far as I can tell, this function does two things.

Remove predictors that are highly correlated with the outcome, which is in line with the text in the book above:

keep <- unlist(apply(xobs, 2, var) > eps)
  keep[is.na(keep)] <- FALSE
  highcor <- suppressWarnings(unlist(apply(xobs, 2, cor, yobs) < maxcor))
  keep <- keep & highcor
  if (all(!keep)) {
    updateLog(
      out = "All predictors are constant or have too high correlation.",
      frame = frame
    )
  }

Remove predictors that have a linear dependence with each other:

  # correlation between x's
  cx <- cor(xobs[, keep, drop = FALSE], use = "all.obs")
  eig <- eigen(cx, symmetric = TRUE)
  ncx <- cx
  while (eig$values[k] / eig$values[1] < eps) {
    j <- seq_len(k)[order(abs(eig$vectors[, k]), decreasing = TRUE)[1]]
    keep[keep][j] <- FALSE
    ncx <- cx[keep[keep], keep[keep], drop = FALSE]
    k <- k - 1
    eig <- eigen(ncx)
  }
  if (!all(keep)) {
    out <- paste(dimnames(x)[[2]][!keep], collapse = ", ")
    updateLog(out = out, frame = frame)
  }

I believe this is removing predictors that have a linear dependence to other predictors, through the guise of removing variables with the smallest contribution to the eigenvalues (i.e. something along the lines of principal component analysis?), although I’m now getting out of my depth. I can see the motivation for taking such an approach, but would just like to be sure this is happening. Looking at the 1999 paper (https://pubmed.ncbi.nlm.nih.gov/10204197/), which outlines steps for selection of variables, also focuses on correlation with the target variable, as opposed to other predictors?

I was wondering if someone who is a little more familiar with the functionality of mice could help me with a few things:
A) Is my above interpretation close, or if I am completely off the boil?
B) Is the implementation of the remove.lindep function documented anywhere, for example in a section of Flexible Imputation of Missing Data that I have missed?
C) Are there any other potential reasons for variables being removed and flagged in the loggedEvents file that could be driving this?

Any help would be much much appreciated. I can try and produce a reproducible example if this would help the situation, but have left it for now as I think the discussion is mostly theoretical.

Many thanks,
Alex

stefvanbuuren · 2024-03-12T14:52:20Z

stefvanbuuren
Mar 12, 2024
Maintainer

The predictorMatrix is the main mechanism for selecting variables in mice. mice by itself does no variable selection, apart from removing (almost) duplicate variables for technical reasons. You can switch off this behavior. See #225, #314 and related issues.

2 replies

alexpate30 Mar 12, 2024
Author

Thank you so much for the response, variable selection was a poor use of language from me. I understand your response to mean that second part of remove.indep (2nd code block above) is the process of removing near duplicate variables? And this is done through the eigenvalues of the correlation matrix?

stefvanbuuren Mar 14, 2024
Maintainer

Yep. It's not very elegant, and it slows down mice when there are many duplicates, but it has prevented lots of crashes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Internal variable selection procedure in mice and the loggedEvents file #627

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Internal variable selection procedure in mice and the loggedEvents file #627

alexpate30 Mar 12, 2024

Replies: 1 comment · 2 replies

stefvanbuuren Mar 12, 2024 Maintainer

alexpate30 Mar 12, 2024 Author

stefvanbuuren Mar 14, 2024 Maintainer

alexpate30
Mar 12, 2024

Replies: 1 comment 2 replies

stefvanbuuren
Mar 12, 2024
Maintainer

alexpate30 Mar 12, 2024
Author

stefvanbuuren Mar 14, 2024
Maintainer