Imputing a subset but keeping all rows #333

AmeBol · 2021-03-15T12:48:36Z

AmeBol
Mar 15, 2021

I am quite new to using mice(), but find myself having an issue with understanding where I go wrong. I would be very grateful for some advice! Summarised my question is why my imputation results differ depending on existence of NAs or values in the ignored rows.

I have a longitudinal dataset that contains multiple variables of a questionnaire at different time points. Since not all participants attend every data collection time point, I have two types of missings for the items: (1) individual at random missings (e.g., forgot to mark answer - those NAs need to be imputed) and (2) missings of people that were not measured at this time point, therefore rows of missings (should not be imputed and should stay NAs, later called all_NA).

I used the ignore "function" of mice() to exclude those lines via a logical vector from the imputation calculation. Now I used two variations of calculation and compared them. I assumed they would give the same result but that the second proceeding would be quicker. Unfortunately, the both approaches give different results, which I do not understand:

I exclude all_NA rows from being used in the imputation calculation via ignore. They still get imputed (which is not needed - but I assume this does not influence the imputation since they are ignored for the calculation). Therefore, I later replace those imputed all_NA rows with NAs again. Because the all-NA rows still get imputed, this process takes a long time.
To speed up the process of imputation, I insert zeros in the all_NA rows so that those rows do not get imputed. Since they are to be ignored, it should not make a difference for the results. Then I use mice() and ignore as before. Later I insert NAs in the all_NA rows, replacing the inserted zeros.
-> The results of approach 1 and 2 differ!
I did another test: To test whether ignore works, I inserted "-99" into all rows that should be ignored. The values in my dataset usually vary between 0 and 4, so -99 will have a great influence. With this test run, one of the imputed values came back as "-99".

Why are the imputed individual values not the same even though the training mids are the same? Will the ignored rows only be ignored for the first iteration? How can I solve my issue and which is the correct way to do it? And more generally: Can I trust the ignore function?

I hope this is clear and that somebody can help me out! I read #32 but it did not help in explaining why ignore seems to not do the job. Thank you :)

prockenschaub · 2021-03-22T08:32:42Z

prockenschaub
Mar 22, 2021

The issue of different results might be related to #313. In short, although the ignored rows are not used in creating the imputation model, they are still imputed. This means that an imputation run that uses the ignore parameter and an imputation run that simply excludes the ignore-rows from the dataset altogether will have different random seeds and thus lead to different imputed values. Only the exact imputed values will differ, though, and the overall performance of the model should stay exactly the same (you can think of it as just running one imputation cycle more after the models have converged, which will give you different draws from the imputation models but the overall conclusion remains the same).

The problem of a "leak" of -99 into the imputed values is more concerning. Could you please check that each row of your data with a -99 in it is set to TRUE in the ignore parameter? If that's the case, could you by any chance provide a reproducible example?

I had a quick test run with simulated data in which I wasn't able to reproduce this error:

library(mice)
#> 
#> Attaching package: 'mice'
#> The following object is masked from 'package:stats':
#> 
#>     filter
#> The following objects are masked from 'package:base':
#> 
#>     cbind, rbind
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

set.seed(42)
n <- 1000L

# Generate fully observed data
x <- rnorm(n)
z <- sample(0:4, n, replace = TRUE)
y <- rnorm(n, x + z / 5, sd = 0.5)

# Simulate missingness completely at random
idx <- sample(1:3, n, replace = TRUE, prob = c(0.25, 0.25, 0.5))
z_miss <- case_when(
    idx == 1 ~ z,             # observed
    idx == 2 ~ NA_integer_,   # missing but should be imputed
    idx == 3 ~ -99L            # missing and should be ingored
  )

# Impute 
data <- data.frame(x, z_miss, y)
ignore <- idx == 3

imp_pmm <- mice(data, ignore = ignore, print = FALSE)
imp_lr <- mice(
    data %>% mutate(z_miss = factor(z_miss, c(0:4, -99))), 
    ignore = ignore, print = FALSE
  )

# Check whether -99 appeared in any of the imputations
table(complete(imp_pmm)[!ignore, "z_miss"])
#> 
#>   0   1   2   3   4 
#>  75  84  98 115 109
table(complete(imp_lr)[!ignore, "z_miss"])
#> 
#>   0   1   2   3   4 -99 
#>  96  86 100 106  93   0

^{Created on 2021-03-22 by the reprex package (v0.3.0)}

You might have come across a special case, however, in which there might indeed be a leak. If that's the case, I'd be super grateful if you could provide an example for me to work through.

0 replies

stefvanbuuren · 2021-03-22T10:11:56Z

stefvanbuuren
Mar 22, 2021
Maintainer

It wasn't obvious to me what problem the OP tries to solve.

The arguments of mice() have different roles. If you want to

Exclude a subset from imputation, then filter the data before calling mice()
Fit the imputation model on a subset, and apply it to impute missing values in the complement: ignore
Skip imputation of missing data or over-impute observed data: where

These actions have different behaviours. 1 and 2 differ because 2 will impute the removed/ignored rows. 1 and 3 differ because 3 will fit the imputation model on the observed data in all rows.

0 replies

AmeBol · 2021-03-27T11:23:26Z

AmeBol
Mar 27, 2021
Author

Thank you very much for the reply!
The explanation that the random numbers will still be assigned to ignored rows is very helpful and eased my confusion! Thank you very much!

Concerning @stefvanbuuren: I agree that my reasons might not seem obvious and I bet there are better ways to do it. I am sorry for my inexperience. The script I am using is the code somebody else wrote, so I did not want to make extensive changes. In this script the imputation is performed on a subset of variables (but with same nrow as the extensive dataframe). But exclusion/deletion of the all_NA rows seemed complicated for me, because the nrow of my subset would then not match the nrow of my original dataframe. Therefore it would be difficult for me (but maybe not for someody else) to match the imputed subset back into the original dataframe. To simply ignore them using ignore seems like a convenient solution.
And because I wanted to avoid the imputation for the ignored rows (cf. 2. in your commentary), I decided to temporarly insert values into those rows. Not the most elegant way, but it should do the job.

@prockenschaub: I tried to replicate the -99 issue and did so by trying to constructing a similar dataframe than mine is. Thereby I noticed that it probably has something to do with logged events. The dataframe that I imputed has a lot of zeros (and also all-zero rows). Since my dataset is a questionnaire assessing symptoms of one psychological disorder, collinearity definetly is an issue. Here my example:

library(mice)
library(dplyr)

set.seed(42)
n <- 100L

# Generate basic dataset
x <- sample(0:4, n, replace = TRUE)
y <- sample(0:4, n, replace = TRUE)
z <- sample(0:4, n, replace = TRUE)
k <- sample(0:4, n, replace = TRUE)
l <- sample(0:4, n, replace = TRUE)
t <- sample(0:4, n, replace = TRUE)
s <- sample(0:2, n, replace = TRUE)
u <- sample(0:2, n, replace = TRUE)
daten <- data.frame(x, z, y, k, l, t, s, u)

# My original dataset consists of: 
#    many rows with missings, that should be kept
#    many rows with zeros
#    only very, very few missings

daten[1:20, ] <- NA # generate rows that are missing 
daten[21:40, ] <- 0 # generate rows that are all zero
daten[22 , 2] <- NA # NA in an all-zero row
daten[72 , 4] <- NA


md.pattern(daten, rotate.names = TRUE) # no individual missings in x
ignore <- is.na(daten[, "x"]) #all rows that have missings in x will be ignored


daten_99 <- daten
daten_99[ignore, ] <- -99 
for (i in 1:ncol(daten_99)) {  daten_99[,i] <- as.factor(daten_99[,i])} # My original dataset had been a .sav (SPSS). Might that be a problem?
daten_99 <- mice(daten_99 , m=5, maxit=5, meth="polr", seed=500, ignore=ignore)
#Warnmeldung:
#Number of logged events: 50 
daten_99$imp$z # one imputed number came back as -99
daten_99$imp$k

#completing dataset with recreating all_NA rows:
daten_99 <- mice::complete(daten_99)
daten_99[ignore, ] <- NA #replace the inserted -99 with NA again

0 replies

stefvanbuuren · 2021-03-27T12:34:27Z

stefvanbuuren
Mar 27, 2021
Maintainer

Ah, I now understand what you're trying to do. The following script combines ignore and where:

# impute a subset
# do not touch or impute non-selected rows
# return full data as mids object

# example: impute even rows only
library(mice)
data <- nhanes
odd <- as.logical((1:nrow(data)) %%2)
where <- make.where(data)
where[odd, ] <- FALSE
imp <- mice(data, ignore = odd, where = where, seed = 1, m = 2, print = FALSE)

The even rows are imputed using the subset of even rows. Nothing happened with odd rows.

> complete(imp, 2)
   age  bmi hyp chl
1    1   NA  NA  NA
2    2 22.7   1 187
3    1   NA   1 187
4    3 25.5   1 184
5    1 20.4   1 113
6    3 25.5   2 184
7    1 22.5   1 118
8    1 30.1   1 187
9    2 22.0   1 238
10   2 26.3   1 187
11   1   NA  NA  NA
12   2 26.3   1 187
13   3 21.7   1 206
14   2 28.7   2 204
15   1 29.6   1  NA
16   1 33.2   2 187
17   3 27.2   2 284
18   2 26.3   2 199
19   1 35.3   1 218
20   3 25.5   2 184
21   1   NA  NA  NA
22   1 33.2   1 229
23   1 27.5   1 131
24   3 24.9   1 187
25   2 27.4   1 186

0 replies

stefvanbuuren · 2021-03-27T13:11:50Z

stefvanbuuren
Mar 27, 2021
Maintainer

Here's another way to achieve the same thing, this time combining a pre-filter and the special rbind for mids.

# example: impute even rows only, method 2
library(mice)
data <- nhanes
odd <- as.logical((1:nrow(data)) %%2)
imp <- mice(data[!odd, ], seed = 1, m = 2, print = FALSE)
imp <- rbind(imp, data[odd, ])

> complete(imp, 2)
   age  bmi hyp chl
2    2 22.7   1 187
4    3 25.5   1 184
6    3 25.5   2 184
8    1 30.1   1 187
10   2 26.3   1 187
12   2 26.3   1 187
14   2 28.7   2 204
16   1 33.2   2 187
18   2 26.3   2 199
20   3 25.5   2 184
22   1 33.2   1 229
24   3 24.9   1 187
1    1   NA  NA  NA
3    1   NA   1 187
5    1 20.4   1 113
7    1 22.5   1 118
9    2 22.0   1 238
11   1   NA  NA  NA
13   3 21.7   1 206
15   1 29.6   1  NA
17   3 27.2   2 284
19   1 35.3   1 218
21   1   NA  NA  NA
23   1 27.5   1 131
25   2 27.4   1 186

Depending on your needs, you may wish to resort the data. In mice 3.13.3do

idx <- order(as.numeric(rownames(imp$data)))
imp$data <- imp$data[idx, ]
imp$where <- imp$where[idx, ]

0 replies

prockenschaub · 2021-03-27T13:25:02Z

prockenschaub
Mar 27, 2021

@AmeBol With regards to the value -99 being imputed in your example, this appears to be because although the value does not appear in any of the rows used for imputation, it is a valid level of the factor variables themselves. For example, just before imputation:

levels(daten_99$z)
[1] "-99" "0"   "1"   "2"   "3"   "4"

You then use a proportional odds logistic regression (polr) to impute that variable, which treats all 6 of those levels (including -99) as a potential outcome. Depending on the associations estiamted from the other rows, -99 can therefore be a valid prediction for a row. In your example, the -99 is imputed for a row of all 0 (row 22). Since -99 in this example is the level just below 0, it is probably not be uncommon for you to see that value imputed using polr.

0 replies

stefvanbuuren · 2021-03-27T14:23:24Z

stefvanbuuren
Mar 27, 2021
Maintainer

Thanks @AmeBol and @prockenschaub.

Closing because I think everything is resolved now.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Imputing a subset but keeping all rows #333

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Imputing a subset but keeping all rows #333

AmeBol Mar 15, 2021

Replies: 7 comments

prockenschaub Mar 22, 2021

stefvanbuuren Mar 22, 2021 Maintainer

AmeBol Mar 27, 2021 Author

stefvanbuuren Mar 27, 2021 Maintainer

stefvanbuuren Mar 27, 2021 Maintainer

prockenschaub Mar 27, 2021

stefvanbuuren Mar 27, 2021 Maintainer

AmeBol
Mar 15, 2021

prockenschaub
Mar 22, 2021

stefvanbuuren
Mar 22, 2021
Maintainer

AmeBol
Mar 27, 2021
Author

stefvanbuuren
Mar 27, 2021
Maintainer

stefvanbuuren
Mar 27, 2021
Maintainer

prockenschaub
Mar 27, 2021

stefvanbuuren
Mar 27, 2021
Maintainer