Internal variable selection procedure in mice and the loggedEvents file #627
Unanswered
alexpate30
asked this question in
Q&A
Replies: 1 comment 2 replies
-
The |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello, I have a few questions around how mice selects variables for each imputation model, and the loggedEvents file. I will set the scene and then go into some detail of what I’ve encountered.
I am planning a multiple imputation. We have millions of observations, and 6 missing variables to impute. The levels of missingness are between 10 – 50%. We have approximately 20 predictor variables, each of which will be interacted with a non-linear transformation of a key continuous variable, which will amount to approximately 80-90 dummy predictors total. I appreciate some manual variable selection may need to be done first, but for now I am trying to Figure out what mice is doing under the hood with respect to variable selection in the imputation models.
When running a test imputation procedure (with ~500,000 observations), I find that there are a large number of variables removed from the imputation model for each of the missing variables (approximately half), which I’m identifying through the loggedEvents output.
According to section 9.1.5 (https://stefvanbuuren.name/fimd/sec-toomany.html), the entries in loggedEvents can signal the following three actions:
I’m fairly certain that the entries to loggedEvents are not caused by the above three things (although I could be wrong). The predictors do not correlate that strongly with the target, and the degree of freedom will not be negative due to the high number of observations.
I therefore took a look at the source code on GitHub (specifically here: https://github.com/amices/mice/blob/master/R/internal.R and here: https://github.com/amices/mice/blob/master/R/sampler.R), and think it’s probably to do with the “remove.lindep” function. This is applied within the “sampler” function and adds entries to loggedEvents when it is applied. As far as I can tell, this function does two things.
I believe this is removing predictors that have a linear dependence to other predictors, through the guise of removing variables with the smallest contribution to the eigenvalues (i.e. something along the lines of principal component analysis?), although I’m now getting out of my depth. I can see the motivation for taking such an approach, but would just like to be sure this is happening. Looking at the 1999 paper (https://pubmed.ncbi.nlm.nih.gov/10204197/), which outlines steps for selection of variables, also focuses on correlation with the target variable, as opposed to other predictors?
I was wondering if someone who is a little more familiar with the functionality of mice could help me with a few things:
A) Is my above interpretation close, or if I am completely off the boil?
B) Is the implementation of the remove.lindep function documented anywhere, for example in a section of Flexible Imputation of Missing Data that I have missed?
C) Are there any other potential reasons for variables being removed and flagged in the loggedEvents file that could be driving this?
Any help would be much much appreciated. I can try and produce a reproducible example if this would help the situation, but have left it for now as I think the discussion is mostly theoretical.
Many thanks,
Alex
Beta Was this translation helpful? Give feedback.
All reactions