library(tidyverse)
There are many methods that have been devised to impute values to missing data. Simple methods such as using the mean or median of a column are common. The row containing the missing datum can be deleted. More elaborate methods such as the mice package, Full Information Maximum Likelihood (FIML), and Amelia II are available.
In my opinion, a more satisfying, straightforward and accurate approach to imutation of missing values would involve sampling randomly generated values derived from the same distribution that the observed data are contained in and using these values to populate the missing slots.
This method involves the following steps:
1. Visualize the data to determine its likely probability distribution.
2. Generate a random selection of values using the parameters of the probability distribution.
3. Populate the missing values with the generated values.
As an example, we will use the Titanic survival data which can be obtained at the Kaggle website. The age data contains many missing values (NA’s) which can be imputed using this method.
The first thing to do is to get the classic Titanic data set from Kaggle
We will use the training set to illustrate.
#get the training data
titan <- read_csv("train.csv")
Next, make the column names more manageable.
colnames(titan) <- c("id", "surv", "clas", "nam","sex", "age",
"sib", "par", "tik", "fare", "cab", "emb")
Plot the distribution of ages in the training data
#see the distribution of age
p <- ggplot() +
geom_density(aes(x = age), data = titan)
p
Remember this shape for later. It shows a bimodal distribution with a component for children and another for adults which is skewed to the right. To make the adult component more normal appearing, we will take the log of it and plot.
ggplot() +
geom_density(aes(x = log(age)), data = titan)
We are going to select the ages to impute for NA’s from a normal distribution that is derived from the bulk of the passengers which are represented by the normal appearing area of the density curve. This is a more honest way to impute the NA’s since it includes the uncertainty inherent in the random selection of imputed values from the normal distribution, rather than imputing the median or mean which does not include any uncertainty.
Most of the adult curve is between the ages of 12 and 74
exp(2.5) # 12
## [1] 12.18249
exp(4.3) # 74
## [1] 73.69979
Now we proceed to examine the characteristics of the adult data. Remove the adult data first to see it up close.
ages <- titan %>% select(id, age)
ages <- ages %>% filter(age >= 12 & age <=74)
sum(is.na(ages))
## [1] 0
ggplot() +
geom_density(aes(x = log(age)), data = ages)
This isn’t exactly a normal curve, but it shows a probable distribution (normal) to select our imputing values from.
Get the parameters and the number of NA’s in the adult ages
sd(ages$age) # 12
## [1] 12.48277
mean(ages$age) # 32
## [1] 32.26047
sum(is.na(titan$age)) # 177
## [1] 177
Create the values to be impiuted into the missing age slots from a normal distribution with the same parameters as the observed data.
ages_imp <- data_frame(ages = round(rnorm(177, mean = 32, sd = 12)))
glimpse(ages_imp)
## Observations: 177
## Variables: 1
## $ ages <dbl> 29, 25, 19, 36, 34, 41, 48, 34, 37, 52, 47, 40, 42, 17, 3...
Create a copy of the Titanic data for later comparison
titan$age <- as.integer(titan$age)
titan1 <- titan #for later comparison
Impute a value from the created random values in each missing ages slot. Then check for any remaining NA’s.
j = 1
for(i in 1:nrow(titan)) {
if(is.na(titan$age[i])) {
titan$age[i] <- ages_imp$ages[j]
j = j+1
}
}
sum(is.na(titan$age))
## [1] 0
Here’s the imputed plot:
ggplot() +
geom_density(aes(x = age), data = titan)
Compare to the shape of the original data with NA’s
p
I think that the random imputation is a more honest and accurate way to impute since it includes a measure of uncertainty in the imputation which is derived from the original data. The randomly imputed plot is almost identical to the plot from the original data. Imputing the median distorts the density plot. This is a relatively simple and, in my opinion, a more satisfying and convincing way to impute missing values.