Epi Data Science: July 2018

Random vs Mice Imputation

This post will explore a comparison between random imputation and mice as described in my January post - Random Imputation. Random imputation is simple, interpretable, and explainable. It has no complicated theory. I would like to see if more elaborate imputation methods are any better in terms of a noticable difference in distribution shape or statistical comparison. The first section will demonstrate random imputation using a simulation with missing data missing completely at random (MCAR - no correlation between missingness and the other variables). Then we will compare the MICE (Multivariate Imputation by Chained Equations) algorithm to it. Other imputation algorithms will be explored in future posts.
First, load the necessary libraries

library(tidyverse)
library(mice)

Next, we generate a data frame with three variables. For simplicity, they will have no correlation and will be drawn from a uniform distribution

rm(list = ls())
set.seed(123)
x <- round(runif(10000, 1, 10000))
y <- round(runif(10000, 1, 10000))
z <- round(runif(10000, 1, 10000))
df1 <- data_frame("x" = x, "y" = y, "z" = z)

Correlations are all almost zero

cor(df1)

##               x           y             z
## x  1.0000000000 -0.01760312 -0.0005817506
## y -0.0176031198  1.00000000  0.0110057126
## z -0.0005817506  0.01100571  1.0000000000

The Mice package will generate a data frame with missing values.

df1_m <- mice::ampute(df1,
              prop = 0.99)

Here shown diagramatically, we have generated four patterns. “0” indicates no missing values. “1” indicates a missing value in the targeted variable shown by the red square. For example, the lowest row shows a data frame with the x variable containing missing values. The y and z variables are complete. We will use the set with the missing values in x for our comparison.

md.pattern(df1_m$amp)

Using the generated vector “x”, the data frame is constructed.

df2 <- data_frame("x" = df1_m$amp$x, "y" = y, "z" = z)
glimpse(df2)

## Observations: 10,000
## Variables: 3
## $ x <dbl> 2876, NA, 4090, 8830, 9405, 457, NA, 8924, NA, 4567, NA, 453...
## $ y <dbl> 3107, 3246, 8703, 3287, 1258, 3563, 9307, 8752, 8202, 219, 6...
## $ z <dbl> 9911, 3023, 4338, 1606, 8230, 2082, 2850, 5463, 5891, 4423, ...

Record the index of each missing value for future comparison.

missings1 <-  which(is.na(df2)) #pre-impute

These are the values located at these spots. In effect, they are the counterfactuals to the missing data.

miss1 <- df1$x[missings1]

Since the x variable is drawn from a uniform distribution, we will create a pool 10000 random values to impute into the missing slots.

imps <- round(runif(10000, 0, 10000))
imps <- data_frame(imps)

How many missing values are there in each variable?

sum(is.na(df2$x))

## [1] 3152

sum(is.na(df2$y))

## [1] 0

sum(is.na(df2$z))

## [1] 0

Now occurs the random imputation of x. It now has no missing values.

df2$x <- if_else(is.na(df2$x), imps$imps, df2$x)
sum(is.na(df2$x))

## [1] 0

These are the values that were imputed into the missing slots

miss2 <- df2$x[missings1]

Is there a significant difference between the “real” values of a and the imputed values of x?

t.test(df1$x, df2$x, paired = TRUE)

## 
##  Paired t-test
## 
## data:  df1$x and df2$x
## t = 0.44899, df = 9999, p-value = 0.6534
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -34.50193  55.00353
## sample estimates:
## mean of the differences 
##                 10.2508

So there is no significant difference between the original and the imputed data. Let’s look at visualizing the data.

options(scipen = 99)
ggplot() +
  geom_density(aes(x = df1$x), color = "blue") +
  geom_density(aes(x = df2$x), color = "red") +
  labs(title = "Random Imutation - Original vs. Imputed")

Both real and imputed data are correlated

cor(df1$x, df2$x)

## [1] 0.6846402

The imputed values are not significantly different from the real values.

#compare the missing values with the real ones
t.test(miss1, miss2, paired = TRUE)

## 
##  Paired t-test
## 
## data:  miss1 and miss2
## t = 0.44895, df = 3151, p-value = 0.6535
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -109.5101  174.5533
## sample estimates:
## mean of the differences 
##                32.52157

Did the imputation make any correct hits (bullseyes)?

bullseye <-  0
bullseye <- if_else(miss1 == miss2, 1, 0)
sum(bullseye)

## [1] 0

************************************************

MICE Imputation

We will now start over and do the imputation using the MICE algorithm.

rm(list = ls())
set.seed(123)

x <- round(runif(10000, 1, 10000))
y <- round(runif(10000, 1, 10000))
z <- round(runif(10000, 1, 10000))
df1 <- data_frame("x" = x, "y" = y, "z" = z)
cor(df1)

##               x           y             z
## x  1.0000000000 -0.01760312 -0.0005817506
## y -0.0176031198  1.00000000  0.0110057126
## z -0.0005817506  0.01100571  1.0000000000

df1_m <- ampute(df1,
              prop = 0.99)
md.pattern(df1_m$amp)

##         x    y    z     
## 89      1    1    1    0
## 3381    1    1    0    1
## 3378    1    0    1    1
## 3152    0    1    1    1
##      3152 3378 3381 9911

df2 <- data_frame("x" = df1_m$amp$x, "y" = y, "z" = z)

missings1 <-  which(is.na(df2)) #pre-impute
miss1 <- df1$x[missings1]
sum(is.na(df2$x))

## [1] 3152

sum(is.na(df2$y))

## [1] 0

sum(is.na(df2$z))

## [1] 0

The MICE algorithm

df2_imp <- mice(df2) #generate the imputed data sets

missings5 <- df2_imp$imp$x
miss2 <- missings5[,1] #just use the first imputation set
miss3 <- miss2 # for later use

toadd <- nrow(df2) - length(miss3)
miss3 <- c(miss3, rep(0, toadd))
miss3 <- data_frame("x" = miss3)
df2$x <- if_else(is.na(df2$x), miss3$x, df2$x) #impute the data
sum(is.na(df2$x)) #how many NA's left?

## [1] 0

ggplot(df2) +
  geom_density(aes(x = df1$x), color = "blue") +
  geom_density(aes(x = df2$x), color = "red") +
  labs(title = "MICE Imutation - Original vs. Imputed")

t.test(miss1, miss2, paired = TRUE)

## 
##  Paired t-test
## 
## data:  miss1 and miss2
## t = -0.65864, df = 3151, p-value = 0.5102
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -187.7664   93.3387
## sample estimates:
## mean of the differences 
##               -47.21383

cor(df1$x, df2$x)

## [1] 0.6052733

bullseye <-  0
bullseye <- if_else(miss1 == miss2, 1, 0)
sum(bullseye)

## [1] 1

************************************************

Now let’s see how random imputation stands up against MICE with a smaller, normally distributed dataset.

rm(list = ls())
set.seed(123)
x <- round(rnorm(200, 100, 30))
y <- round(rnorm(200, 100, 30))
z <- round(rnorm(200, 100, 30))
df1 <- data_frame("x" = x, "y" = y, "z" = z)
cor(df1)

##             x           y           z
## x  1.00000000 -0.02555085 -0.02958507
## y -0.02555085  1.00000000 -0.05119379
## z -0.02958507 -0.05119379  1.00000000

df1_m <- mice::ampute(df1,
              prop = 0.99)
df2 <- data_frame("x" = df1_m$amp$x, "y" = y, "z" = z)
glimpse(df2)

## Observations: 200
## Variables: 3
## $ x <dbl> NA, NA, 147, NA, 104, 151, 114, 62, NA, 87, 137, NA, 112, 10...
## $ y <dbl> 166, 139, 92, 116, 88, 86, 76, 82, 150, 98, 104, 107, 137, 8...
## $ z <dbl> 98, 65, 81, 99, 120, 50, 90, 123, 84, 107, 115, 108, 120, 96...

missings1 <-  which(is.na(df2))
miss1 <- df1$x[missings1]
imps <- round(rnorm(200, 100, 30))
imps <- data_frame(imps)

sum(is.na(df2$x))

## [1] 56

sum(is.na(df2$y))

## [1] 0

sum(is.na(df2$z))

## [1] 0

df2$x <- if_else(is.na(df2$x), imps$imps, df2$x) #impute
sum(is.na(df2$x))

## [1] 0

miss2 <- df2$x[missings1]
t.test(df1$x, df2$x, paired = TRUE)

## 
##  Paired t-test
## 
## data:  df1$x and df2$x
## t = 0.48601, df = 199, p-value = 0.6275
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.766981  4.576981
## sample estimates:
## mean of the differences 
##                   0.905

ggplot() +
  geom_density(aes(x = df1$x), color = "blue") +
  geom_density(aes(x = df2$x), color = "red") +
  labs(title = "Random Imutation - Original vs. Imputed")

cor(df1$x, df2$x)

## [1] 0.5919782

t.test(miss1, miss2, paired = TRUE)

## 
##  Paired t-test
## 
## data:  miss1 and miss2
## t = 0.4836, df = 55, p-value = 0.6306
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -10.16194  16.62622
## sample estimates:
## mean of the differences 
##                3.232143

bullseye <-  0
bullseye <- if_else(miss1 == miss2, 1, 0)
sum(bullseye)

## [1] 1

************************************************

Now, to do it over again with MICE

rm(list = ls())
set.seed(123)

x <- round(rnorm(200, 100, 30))
y <- round(rnorm(200, 100, 30))
z <- round(rnorm(200, 100, 30))
df1 <- data_frame("x" = x, "y" = y, "z" = z)
cor(df1)

##             x           y           z
## x  1.00000000 -0.02555085 -0.02958507
## y -0.02555085  1.00000000 -0.05119379
## z -0.02958507 -0.05119379  1.00000000

df1_m <- ampute(df1,
              prop = 0.99)
df2 <- data_frame("x" = df1_m$amp$x, "y" = y, "z" = z)

missings1 <-  which(is.na(df2)) #pre-impute
miss1 <- df1$x[missings1]
sum(is.na(df2$x))

## [1] 56

sum(is.na(df2$y))

## [1] 0

sum(is.na(df2$z))

## [1] 0

The MICE algorithm

df2_imp <- mice(df2)

missings5 <- df2_imp$imp$x
miss2 <- missings5[,1] #just use the first imputation set
miss3 <- miss2 # for later use

toadd <- nrow(df2) - length(miss3)
miss3 <- c(miss3, rep(0, toadd))
miss3 <- data_frame("x" = miss3)
df2$x <- if_else(is.na(df2$x), miss3$x, df2$x)
sum(is.na(df2$x))

## [1] 0

ggplot(df2) +
  geom_density(aes(x = df1$x), color = "blue") +
  geom_density(aes(x = df2$x), color = "red") +
  labs(title = "MICE Imutation - Original vs. Imputed")

t.test(miss1, miss2, paired = TRUE)

## 
##  Paired t-test
## 
## data:  miss1 and miss2
## t = -0.98135, df = 55, p-value = 0.3307
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -15.210621   5.210621
## sample estimates:
## mean of the differences 
##                      -5

cor(df1$x, df2$x)

## [1] 0.3916698

bullseye <-  0
bullseye <- if_else(miss1 == miss2, 1, 0)
sum(bullseye)

## [1] 1

So, it seems that Random Imputation stands up quite well against Mice Imputation. It doesn’t have an opaque black box (at least to me) surrounding it. It certainly is more satisfying than listwise deletion (destructive), or median imputation (disfiguring). I haven’t seen this in my reading but I’m not a statistician.

Epi Data Science

Wednesday, July 18, 2018

Random vs Mice Imputation

Random vs Mice Imputation

David P. Nichols, MD, MPH

July 17, 2018

************************************************

MICE Imputation

************************************************

************************************************

Pages

Labels

My Blog List