Replies: 7 comments
-
The issue of different results might be related to #313. In short, although the ignored rows are not used in creating the imputation model, they are still imputed. This means that an imputation run that uses the The problem of a "leak" of I had a quick test run with simulated data in which I wasn't able to reproduce this error: library(mice)
#>
#> Attaching package: 'mice'
#> The following object is masked from 'package:stats':
#>
#> filter
#> The following objects are masked from 'package:base':
#>
#> cbind, rbind
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
set.seed(42)
n <- 1000L
# Generate fully observed data
x <- rnorm(n)
z <- sample(0:4, n, replace = TRUE)
y <- rnorm(n, x + z / 5, sd = 0.5)
# Simulate missingness completely at random
idx <- sample(1:3, n, replace = TRUE, prob = c(0.25, 0.25, 0.5))
z_miss <- case_when(
idx == 1 ~ z, # observed
idx == 2 ~ NA_integer_, # missing but should be imputed
idx == 3 ~ -99L # missing and should be ingored
)
# Impute
data <- data.frame(x, z_miss, y)
ignore <- idx == 3
imp_pmm <- mice(data, ignore = ignore, print = FALSE)
imp_lr <- mice(
data %>% mutate(z_miss = factor(z_miss, c(0:4, -99))),
ignore = ignore, print = FALSE
)
# Check whether -99 appeared in any of the imputations
table(complete(imp_pmm)[!ignore, "z_miss"])
#>
#> 0 1 2 3 4
#> 75 84 98 115 109
table(complete(imp_lr)[!ignore, "z_miss"])
#>
#> 0 1 2 3 4 -99
#> 96 86 100 106 93 0 Created on 2021-03-22 by the reprex package (v0.3.0) You might have come across a special case, however, in which there might indeed be a leak. If that's the case, I'd be super grateful if you could provide an example for me to work through. |
Beta Was this translation helpful? Give feedback.
-
It wasn't obvious to me what problem the OP tries to solve. The arguments of
These actions have different behaviours. 1 and 2 differ because 2 will impute the removed/ignored rows. 1 and 3 differ because 3 will fit the imputation model on the observed data in all rows. |
Beta Was this translation helpful? Give feedback.
-
Thank you very much for the reply! Concerning @stefvanbuuren: I agree that my reasons might not seem obvious and I bet there are better ways to do it. I am sorry for my inexperience. The script I am using is the code somebody else wrote, so I did not want to make extensive changes. In this script the imputation is performed on a subset of variables (but with same @prockenschaub: I tried to replicate the -99 issue and did so by trying to constructing a similar dataframe than mine is. Thereby I noticed that it probably has something to do with logged events. The dataframe that I imputed has a lot of zeros (and also all-zero rows). Since my dataset is a questionnaire assessing symptoms of one psychological disorder, collinearity definetly is an issue. Here my example:
|
Beta Was this translation helpful? Give feedback.
-
Ah, I now understand what you're trying to do. The following script combines # impute a subset
# do not touch or impute non-selected rows
# return full data as mids object
# example: impute even rows only
library(mice)
data <- nhanes
odd <- as.logical((1:nrow(data)) %%2)
where <- make.where(data)
where[odd, ] <- FALSE
imp <- mice(data, ignore = odd, where = where, seed = 1, m = 2, print = FALSE) The even rows are imputed using the subset of even rows. Nothing happened with odd rows. > complete(imp, 2)
age bmi hyp chl
1 1 NA NA NA
2 2 22.7 1 187
3 1 NA 1 187
4 3 25.5 1 184
5 1 20.4 1 113
6 3 25.5 2 184
7 1 22.5 1 118
8 1 30.1 1 187
9 2 22.0 1 238
10 2 26.3 1 187
11 1 NA NA NA
12 2 26.3 1 187
13 3 21.7 1 206
14 2 28.7 2 204
15 1 29.6 1 NA
16 1 33.2 2 187
17 3 27.2 2 284
18 2 26.3 2 199
19 1 35.3 1 218
20 3 25.5 2 184
21 1 NA NA NA
22 1 33.2 1 229
23 1 27.5 1 131
24 3 24.9 1 187
25 2 27.4 1 186 |
Beta Was this translation helpful? Give feedback.
-
Here's another way to achieve the same thing, this time combining a pre-filter and the special # example: impute even rows only, method 2
library(mice)
data <- nhanes
odd <- as.logical((1:nrow(data)) %%2)
imp <- mice(data[!odd, ], seed = 1, m = 2, print = FALSE)
imp <- rbind(imp, data[odd, ]) > complete(imp, 2)
age bmi hyp chl
2 2 22.7 1 187
4 3 25.5 1 184
6 3 25.5 2 184
8 1 30.1 1 187
10 2 26.3 1 187
12 2 26.3 1 187
14 2 28.7 2 204
16 1 33.2 2 187
18 2 26.3 2 199
20 3 25.5 2 184
22 1 33.2 1 229
24 3 24.9 1 187
1 1 NA NA NA
3 1 NA 1 187
5 1 20.4 1 113
7 1 22.5 1 118
9 2 22.0 1 238
11 1 NA NA NA
13 3 21.7 1 206
15 1 29.6 1 NA
17 3 27.2 2 284
19 1 35.3 1 218
21 1 NA NA NA
23 1 27.5 1 131
25 2 27.4 1 186 Depending on your needs, you may wish to resort the data. In idx <- order(as.numeric(rownames(imp$data)))
imp$data <- imp$data[idx, ]
imp$where <- imp$where[idx, ] |
Beta Was this translation helpful? Give feedback.
-
@AmeBol With regards to the value levels(daten_99$z)
[1] "-99" "0" "1" "2" "3" "4" You then use a proportional odds logistic regression (polr) to impute that variable, which treats all 6 of those levels (including |
Beta Was this translation helpful? Give feedback.
-
Thanks @AmeBol and @prockenschaub. Closing because I think everything is resolved now. |
Beta Was this translation helpful? Give feedback.
-
I am quite new to using
mice()
, but find myself having an issue with understanding where I go wrong. I would be very grateful for some advice! Summarised my question is why my imputation results differ depending on existence of NAs or values in the ignored rows.I have a longitudinal dataset that contains multiple variables of a questionnaire at different time points. Since not all participants attend every data collection time point, I have two types of missings for the items: (1) individual at random missings (e.g., forgot to mark answer - those NAs need to be imputed) and (2) missings of people that were not measured at this time point, therefore rows of missings (should not be imputed and should stay NAs, later called all_NA).
I used the
ignore
"function" ofmice()
to exclude those lines via a logical vector from the imputation calculation. Now I used two variations of calculation and compared them. I assumed they would give the same result but that the second proceeding would be quicker. Unfortunately, the both approaches give different results, which I do not understand:ignore
. They still get imputed (which is not needed - but I assume this does not influence the imputation since they are ignored for the calculation). Therefore, I later replace those imputed all_NA rows with NAs again. Because the all-NA rows still get imputed, this process takes a long time.mice()
andignore
as before. Later I insert NAs in the all_NA rows, replacing the inserted zeros.-> The results of approach 1 and 2 differ!
I did another test: To test whether
ignore
works, I inserted "-99" into all rows that should be ignored. The values in my dataset usually vary between 0 and 4, so -99 will have a great influence. With this test run, one of the imputed values came back as "-99".Why are the imputed individual values not the same even though the
training mids
are the same? Will the ignored rows only be ignored for the first iteration? How can I solve my issue and which is the correct way to do it? And more generally: Can I trust the ignore function?I hope this is clear and that somebody can help me out! I read #32 but it did not help in explaining why ignore seems to not do the job. Thank you :)
Beta Was this translation helpful? Give feedback.
All reactions