Kaggle - Employee Attrition

Introduction

This analysis is based on a dataset published on the Kaggle.com website: (https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset) Data scientists at the IBM corporation designed the dataset to represent human rescources data from a hypothetical company. To download the dataset, sign in to Kaggle, navigate to the above website and download the “WA_Fn-UseC_-HR-Employee-Attrition.csv” file. You will have to sign up for a free Kaggle account if you don’t already have one.

The HR dept at this hypothetical company feels that they may have a problem with excessive employee turnover that deserves further study. The company’s HR data includes the following variables, which I have renamed as follows in the interest of minimizing excessive typing, but preserving the name information:

Age(age), Attrition(attrit), BusinessTravel(bus_trav), DailyRate(day_rate), Department(dept), DistanceFromHome(home_dist), Education(educ), EducationField(ed_field), EmployeeCount(zero-variance, will be deleted), EmployeeNumber(emp_num), EnvironmentSatisfaction(emp_sat), Gender(sex), HourlyRate(hr_rate), JobInvolvement(involv), JobLevel(level), JobRole(role), JobSatisfaction(satis), MaritalStatus(married), MonthlyIncome(month_inc), MonthlyRate(month_rate), NumCompaniesWorked(prev_emp), Over18(zero-variance, will be deleted), OverTime(ot), PercentSalaryHike(perc_sal_up), PerformanceRating(perf_rate), RelationshipSatisfaction(relat_sat), StandardHours(zero-variance, will be deleted), StockOptionLevel(stk_opt_lev), TotalWorkingYears(tot_wk_yrs), TrainingTimesLastYear(train_times), WorkLifeBalance(work_life), YearsAtCompany(yrs_at_co), YearsInCurrentRole(yrs_in_role), YearsSinceLastPromotion(yrs_last_pro), YearsWithCurrManager(yrs_w_man)

We will use concepts utilized by epidemiologists to analyze this data. Epidemiologic reasoning can be used in many other areas besides medicine to gain insight into data and solve related problems. Our primary interest is to determine what variables are most associated with our outcome variable - attrition. Which of these should we address to lessen the impact on the company of losing employees? We will apply the concept of attributable risk and the use of 2x2 tables to help answer this question. In addition, we can use our insight into the data derived from these concepts to predict future employee behavior.
# load the required packages
library(tidyverse)
library(readr) 
library(stringr)
library(caret)
library(purrr)
library(broom)
library(forcats)
library(modelr)
library(knitr)
library(epitools)
options(scipen=999) # no scientific notation 
Read in the data
#get the data
hr <- read_csv("WA_Fn-UseC_-HR-Employee-Attrition.csv")

Data Preparation
#remove zero-variance features
hr <- hr %>% select(-c(EmployeeCount, StandardHours, Over18))
#make more user-friendly variable names
colnames(hr) <- c("age","attrit", "bus_trav", "day_rate", "dept", "home_dist",
                  "educ", "ed_field", "emp_num", "emp_sat",
                  "sex", "hr_rate", "involv", "level", "role", "satis", "married",
                  "month_inc", "month_rate", "prev_emp", "ot",
                  "perc_sal_up", "perf_rate", "relat_sat",
                  "stk_opt_lev", "tot_wk_yrs", "train_times", "work_life",
                  "yrs_at_co", "yrs_in_role", "yrs_last_pro", "yrs_w_man")
Since we will be creating predictive models later on, we need to split the data into a training data set from which we will create the model, and into a test data set from which we will be able to calculate measures of accuracy. 75% of the data will be used to train the model and 25% will be held aside to test against. The test data will be prepared in parallel along with the training set, but will not be examined or analyzed. This allows an honest evaluation of model accuracy.
#split the data
set.seed(123)
inTrain <- createDataPartition(hr$attrit, p = 0.75, list = FALSE)
known <- hr[inTrain,] #the known data, 75%, will be used to train the model
test <- hr[-inTrain,] #the test data will be used to determine the model accuracy
We will separate the numeric from the categorical variables. This makes it easier to perform feature development on each class, with less likelihood of error
#separate the numeric from the categorical variables
known_num <- known %>% select(age, hr_rate, day_rate, month_rate, month_inc,
                           home_dist, emp_num, prev_emp, perc_sal_up, tot_wk_yrs,
                           yrs_at_co, yrs_in_role, yrs_last_pro, yrs_w_man)
known_cat <- known %>% select(attrit, bus_trav, dept, educ, ed_field, emp_sat,
                           sex, involv, level, role,satis,married,ot, perf_rate,
                           relat_sat, stk_opt_lev, train_times, work_life)
Integers need to be converted to floating-point for later computation
#change integers to floating-point                           
known_num[1:14] <- lapply(known_num[1:14], as.numeric)
Further separate the integer categorical variables from the character variables for further batch processing
#separate integer categoricals from character categorical variables
known_cat_int <- known_cat %>% select(educ, emp_sat, involv, level, satis,
                                      perf_rate, relat_sat, stk_opt_lev,
                                      train_times, work_life)
known_cat_chr <- known_cat %>% select(attrit, bus_trav, dept,
                                      ed_field, sex, role, married, ot)
Character categorical variables need to be converted to factors
#factorize known_cat_chr
known_cat_chr[1:8] <- lapply(known_cat_chr[1:8], as.factor)
R doesn’t like factor level names that start with a number - level names will start with X
#create R-friendly names (no leading number)
known_cat_intX <- data.frame(known_cat_int %>% map(make.names))
Assemble all the pieces
#put the categorical data frames together
known_cat_full1 <- known_cat_chr %>% bind_cols(known_cat_intX)

#put everything together
known_cat_full2 <- known_cat_full1 %>% bind_cols(known_num)
Shorter data frame name for easier future use
#more user-friendly dataset name
hr_known <- known_cat_full2
Some algorithms need to have the outcome variable (attrit) as the last variable in the data frame
#send attrit to last column
hr_known <- hr_known[c(2:length(hr_known),1)]
Limiting multicollinearity helps prevent overfitting our model
#remove highly correlated numerical variables
correlation_matrix <- cor(hr_known[, 18:31])
hi_cor <- findCorrelation(correlation_matrix, cutoff = 0.75)
paste("correlation of tot_wk_years and yrs_at_co is", round(cor(hr_known$tot_wk_yrs,  hr_known$yrs_at_co),2) )  # -> 0.59
## [1] "correlation of tot_wk_years and yrs_at_co is 0.59"
#tot_wk_yrs, yrs_at_co are correlated; eliminate tot_wk_yrs
hr_known <- hr_known %>% select(-tot_wk_yrs)
Using binary categorical variables makes interpretation easier
#Binarize bus_trav - now has three levels - will come in handy later for attributable risk
hr_known$bus_trav <- fct_collapse(hr_known$bus_trav, 
    rare = c("Non-Travel", "Travel_Rarely"),
    often = "Travel_Frequently")

#Binarize and factor yrs_last_pro - will come in handy later for attributable risk
hr_known$yrs_last_pro <- ifelse(hr_known$yrs_last_pro >= 5,"long", "short")
hr_known$yrs_last_pro <- factor(hr_known$yrs_last_pro)

#Binarize and factor prev_emp (previous employers) - handy for later attributable risk
hr_known$prev_emp <- ifelse(hr_known$prev_emp > 5, "many", "few")
hr_known$prev_emp <- factor(hr_known$prev_emp)

#Binarize and factor month_rate  - handy for later attributable risk
hr_known$month_rate <- ifelse(hr_known$month_rate > 10000, "high", "low")
hr_known$month_rate <- factor(hr_known$month_rate)

#Binarize and factor home_dist - handy for later attributable risk
hr_known$home_dist <- ifelse(hr_known$home_dist > 20, "far", "near")
hr_known$home_dist <- factor(hr_known$home_dist)
Now we have to prepare the test dataset in exactly the same way that we have prepared the training(known) dataset. It important to note that we have not examined or analyzed the test dataset to preserve its integrity for future comparison.
# Prepare the test dataset  *************************************
test_num <- test %>% select(age, hr_rate, day_rate, month_rate, month_inc,
                           home_dist, emp_num, prev_emp, perc_sal_up, tot_wk_yrs,
                           yrs_at_co, yrs_in_role, yrs_last_pro, yrs_w_man)
test_cat <- test %>% select(attrit, bus_trav, dept, educ, ed_field, emp_sat,
                           sex, involv, level, role,satis,married,ot, perf_rate,
                           relat_sat, stk_opt_lev, train_times, work_life)

#change integers to floating-point                           
test_num[1:14] <- lapply(test_num[1:14], as.numeric)

#separate integer categoricals from character categorical variables
test_cat_int <- test_cat %>% select(educ, emp_sat, involv, level, satis,
                                      perf_rate, relat_sat, stk_opt_lev,
                                      train_times, work_life)
test_cat_chr <- test_cat %>% select(attrit, bus_trav, dept, ed_field, sex, role, married, ot)
#factorize known_cat_chr

test_cat_chr[1:8] <- lapply(test_cat_chr[1:8], as.factor)
  
                            
#create R-friendly names (no leading number)
test_cat_intX <- data.frame(test_cat_int %>% map(make.names))

#put the categorical data frames together
test_cat_full1 <- test_cat_chr %>% bind_cols(test_cat_intX)

#put everything together
test_cat_full2 <- test_cat_full1 %>% bind_cols(test_num)
  

#more user-friendly dataset name
hr_test <- test_cat_full2 

#send attrit to last column
hr_test <- hr_test[c(2:length(hr_test),1)]

#remove highly correlated numerical variables
correlation_matrix <- cor(hr_test[, 18:31])
hi_cor <- findCorrelation(correlation_matrix, cutoff = 0.75)
hr_test <- hr_test %>% select(-tot_wk_yrs)

#Binarize bus_trav - now has three levels
hr_test$bus_trav <- fct_collapse(hr_test$bus_trav, 
    rare = c("Non-Travel", "Travel_Rarely"),
    often = "Travel_Frequently")


#Binarize and factor yrs_last_pro - will come in handy later for attributable risk
hr_test$yrs_last_pro <- ifelse(hr_test$yrs_last_pro >= 5,"long", "short")
hr_test$yrs_last_pro <- factor(hr_test$yrs_last_pro)

#Binarize and factor prev_emp (previous employers) - handy for later attributable risk
hr_test$prev_emp <- ifelse(hr_test$prev_emp > 5, "many", "few")
hr_test$prev_emp <- factor(hr_test$prev_emp)

#Binarize and factor home_dist (previous employers) - handy for later attributable risk
hr_test$home_dist <- ifelse(hr_test$home_dist > 20, "far", "near")
hr_test$home_dist <- factor(hr_test$home_dist)

#Binarize and factor month_rate  - handy for later attributable risk
hr_test$month_rate <- ifelse(hr_test$month_rate > 10000, "high", "low")
hr_test$month_rate <- factor(hr_test$month_rate)
Data Visualization
#*****  Visualization   *************************************
# pare down the known dataset to show jobs vs salary
roles <- hr_known %>% select(role, month_inc, attrit) 

# table of job titles vs salary
role  <-roles %>% 
   dplyr::group_by(role) %>%
   dplyr::summarise(mean_inc = mean(month_inc))
Set a theme for plotting
my_theme <- function(base_size = 12, base_family = "sans"){
  theme_minimal(base_size = base_size, base_family = base_family) +
  theme(
    axis.text = element_text(size = 12),
    axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5),
    axis.title = element_text(size = 14),
    panel.grid.major = element_line(color = "grey"),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "aliceblue"),
    strip.background = element_rect(fill = "navy", color = "navy", size = 1),
    strip.text = element_text(face = "bold", size = 12, color = "white"),
    legend.position = "right",
    legend.justification = "top", 
    legend.background = element_blank(),
    panel.border = element_rect(color = "grey", fill = NA, size = 0.5)
  )
}

theme_set(my_theme())
ggplot(role, aes(mean_inc,fct_reorder(role,mean_inc))) + 
   geom_point() +
   ggtitle("Employee Monthly Salaries") +
   xlab("Mean Monthly Salary") +
   ylab("Occupation")
The large gap between management and production employees may be a source of dissatisfaction leading to employees leaving the company.
Explanatory Models
An explanatory model will help us identify which predictors show the highest association with the outcome variable(employee attrition). Since our outcome variable(attrit) is binary and categorical, a generalized linear model, rather than an ordinary least squares linear model is appropriate
# GLM via direct model - backward look - explanatory model
m4_glm <- glm(attrit ~., data = hr_known,
              family = binomial)
Feature Selection
We need to find the variables that are statistically significant (probability that the estimates are due to chance is less than 5%), ANDthat have a significant effect (odds Ratio greater than 1.00)
# put summary in tidy form from broom package
m4_glm_tdy <- tidy(m4_glm) 
#sort by p. value to choose those which are significant
m4_glm_tdy %>% arrange(p.value) %>% filter(p.value <=.05)
#calculate odds ratios and sort by odds ratio
m4_glm_odds <- m4_glm_tdy %>%
   mutate(odds_ratio = exp(estimate)) %>% 
   arrange(desc(odds_ratio))
look_m4 <- glance(m4_glm)
deviance <- look_m4$deviance
null_deviance <- look_m4$null.deviance
deg_of_freedom <- look_m4$df.residual
## [1] "deviance is: 536"
## [1] "null deviance is 975"
## [1] "degrees of freedom is 1036"
Since the deviance is much less than the null deviance, our model explains a large proportion of the outcome. Also, since the deviance is less than the degrees of freedom, the model fits the data well and is not overfitted.
Feature Selection
We need to find the variables that are statistically significant (probability that the estimates are due to chance is less than 5%, AND that have a significant effect: Odds Ratio greater than 1.00
library(knitr)
library(kableExtra)
#select those with p-value > 0.05 AND odds ratio > 1.00
m4_glm_odds_p_o <- m4_glm_odds %>% filter(p.value <= 0.05 & odds_ratio > 1.00)
m4_p_o <- m4_glm_odds_p_o %>%
   select(term,p.value,odds_ratio)
m4_tab <- kable(m4_p_o, digits = 4, col.names = c("variable", "p-value", "odds ratio"),
                caption = "Significant Employment Variables")
So, now we know that employees who are working overtime(presumably a lot), have had many previous employments, have to travel often, and/or are male are most likely to be associated with our outcome variable - attrition.
But which of the variables is most deserving of our attention and efforts to mitigate? Addressing this feature first will give us the biggest bang-for-the-buck in terms of retaining employees?
Taking working overtime as an example, we have an “exposed” group (those employees working overtime), and a “non-exposed group” (those not working overtime). Risk is defined as the probability of an outcome (in our case employee attrition) occuring in a specified time period (also called the incidence of the outcome). The authors of the dataset don’t define the time period here, but I will assume it to be one year. So, looking at overtime as the exposure under study, during that year both the exposed and non-exposed group are both exposed to the same background risk (not-overtime) for attrition. The exposed group is exposed to additional risk due to the nature of the exposure. This excess risk is termed “Attributable Risk” since the excess risk of attrition is attributable to overtime. This is a very useful concept since you can now use it to estimate the potential amount of risk that could be eliminated by eliminating the risk factor. Let’s say that overtime has a 20% attributable risk in our work population. If we eliminate working overtime, then, assuming all other risks evenly distributed, the risk of employee attrition would drop 20%. As a practical measure, we can compare the attributable risks of our four identified risk factors and concentrate our efforts at mitigating those with the highest atributable risks.
We can use a simple formula to calculate attributable risk:
Source: Gordis, Epidemiology, 3rd Edition
Attributable Risk for Overtime
# ot
#make a 2x2 table
exposure <- hr_known$ot
outcome <- hr_known$attrit
ot_tab <- epitable("Overtime"= exposure, "Attrition"= outcome, rev = "both")
ot_tab
##         Attrition
## Overtime Yes  No
##      Yes  99 223
##      No   79 702
#incidence of attrit in ot exposed
inc_att_ot <- ot_tab[1,1] / (ot_tab[1,1] + ot_tab[1,2])
#incidence of attrit in ot unexposed
inc_att_notot <- ot_tab[2,1] / (ot_tab[2,1] + ot_tab[2,2])
#attributable risk due to ot
ar_ot <- (inc_att_ot - inc_att_notot)/inc_att_ot # attributable risk due to ot
## [1] "Attributable risk due to overtime =  0.671"
## [1] "67.1 percent of the overall risk is due to overtime"
Attributable Risk for Previous Employers
# prev_emp
#make a 2x2 table
exposure <- hr_known$prev_emp
outcome <- hr_known$attrit
prev_emp_tab <- epitable("Previous Employers"= exposure, "Attrition"= outcome, rev = "both")
prev_emp_tab
##                   Attrition
## Previous Employers Yes  No
##               many  36 143
##               few  142 782
#incidence of attrit in ot exposed
inc_att_prev_emp <- prev_emp_tab[1,1] / (prev_emp_tab[1,1] + prev_emp_tab[1,2])
#incidence of attrit in ot unexposed
inc_att_notprev_emp <- prev_emp_tab[2,1] / (prev_emp_tab[2,1] + prev_emp_tab[2,2])
#attributable risk due to ot
ar_prev_emp <- (inc_att_prev_emp - inc_att_notprev_emp)/inc_att_prev_emp # AR due to prev_emp
## [1] "Attributable risk due to Previous Employers =  0.236"
## [1] "23.6 percent of the overall risk is due to Previous Employers"
Attributable Risk for Business Travel
# bus_trav
#make a 2x2 table
exposure <- hr_known$bus_trav
outcome <- hr_known$attrit
bus_trav_tab <- epitable("Business Travel"= exposure, "Attrition"= outcome, rev = "both")
bus_trav_tab
##                Attrition
## Business Travel Yes  No
##           often  53 164
##           rare  125 761
#incidence of attrit in bus_trav exposed
inc_att_bus_trav <- bus_trav_tab[1,1] / (bus_trav_tab[1,1] + bus_trav_tab[1,2])
#incidence of attrit in ot unexposed
inc_att_notbus_trav <- bus_trav_tab[2,1] / (bus_trav_tab[2,1] + bus_trav_tab[2,2])
#attributable risk due to bus_trav
ar_bus_trav <- (inc_att_bus_trav - inc_att_notbus_trav)/inc_att_bus_trav # attributable risk due to bus_trav
## [1] "Attributable risk due to Business Travel =  0.422"
## [1] "42.2 percent of the overall risk is due to Business Travel"
Attributable Risk for Gender
# sex
#make a 2x2 table
exposure <- hr_known$sex
outcome <- hr_known$attrit
sex_tab <- epitable("Sex"= exposure, "Attrition"= outcome, rev = "both")
sex_tab
##         Attrition
## Sex      Yes  No
##   Male   115 551
##   Female  63 374
#incidence of attrit in ot exposed
inc_att_sex <- sex_tab[1,1] / (sex_tab[1,1] + sex_tab[1,2])
#incidence of attrit in ot unexposed
inc_att_notsex <- sex_tab[2,1] / (sex_tab[2,1] + sex_tab[2,2])
#attributable risk due to sex
ar_sex<- (inc_att_sex - inc_att_notsex)/inc_att_sex # attributable risk due to sex
## [1] "Attributable risk due to Gender =  0.165"
## [1] "16.5 percent of the overall risk is due to Gender"
Now we have good evidence to support directing our attention towards mitigating the amount of and the effects of working overtime. We can expect to have the greatest impact on employee retention if we deal with the effects of overtime work compared to the other risk factors.
Predictive Models
How might we use our newfound information to look forward rather than backward? For example, can we predict whether an employee is likely to leave employment based on the criteria we have in his or her employment record? First, we will use the GLM model to make our predictions. Since we have already split our data into training(known) and testing portions, we can run our model and compare our predictions to the actual outcome data in the test dataset. Then we can calculate the accuracy of our model against this real data.
Generalized Linear Model (GLM)
# GLM via caret package - forward look - predictive model
cctrl2 <- trainControl(method = "LOOCV",
                       classProbs = TRUE,
                       summaryFunction = twoClassSummary)

m4_glm <- train(attrit ~ ., data = hr_known, 
                            method = "glm", 
                            trControl = cctrl2,
                            metric = "ROC", 
                            preProc = c("center", "scale"))

pred_glm <- predict(m4_glm, newdata = hr_test)
cf_mat_glm <- confusionMatrix(pred_glm, hr_test$attrit)
## [1] "GLM Accuracy is 87.2 percent"
We have achieved 87.2 percent accuracy using our GLM model. Might we achieve greater accuracy by making a different model with another algorithm? We will try two other models to test this.
Random Forest (RF)
# Random Forest
cctrl2 <- trainControl(method = "LOOCV",
                       classProbs = TRUE,
                       summaryFunction = twoClassSummary)


m1_rf <- train(attrit ~ ., data = hr_known, 
                            method = "rf", 
                            trControl = cctrl2,
                            metric = "ROC", 
                            preProc = c("center", "scale"),
                            ntree = 20,
                            importance = FALSE)


pred_rf <- predict(m1_rf, newdata = hr_test)
cf_mat_rf <- confusionMatrix(pred_rf, hr_test$attrit)
paste("Random Forest Accuracy is", round(cf_mat_rf$overall[[1]],3)*100, "percent")
## [1] "Random Forest Accuracy is 85.3 percent"
XGBoost (XGB)
# XGBoost
xgbGrid <- expand.grid(nrounds = c(1, 10),
                       max_depth = c(1, 4),
                       eta = c(.1, .4),
                       gamma = 0,
                       colsample_bytree = .7,
                       min_child_weight = 1,
                       subsample = c(.8, 1))

cctrl2 <- trainControl(method = "LOOCV",
                       classProbs = TRUE,
                       summaryFunction = twoClassSummary)

m3_xgb <- train(attrit ~ ., data = hr_known, 
                            method = "xgbTree", 
                            trControl = cctrl2,
                            metric = "ROC", 
                            preProc = c("center", "scale"),
                            tuneGrid = xgbGrid)

pred_xgb <- predict(m3_xgb, newdata = hr_test)
cf_mat_xgb <- confusionMatrix(pred_xgb, hr_test$attrit)
paste("XGBoost Accuracy is", round(cf_mat_xgb$overall[[1]],3)*100, "percent")
## [1] "XGBoost Accuracy is 84.7 percent"
Attrition prediction for single employee
We can now use our models to predict whether a single employee or a group of employees is likely to leave the company. We can look at an employee who has the employment data we used in our study. We will name the employee Employee1 (or emp1 for coding purposes)
The employee has the following data:
Business Travel: Rare
Department: Sales
Education Field: Life Sciences
Gender: Female
Job Role: Executive
Marital Status: Single
OverTime: Yes
Education: 2
Environment Satisfaction: 2
Job Involvement: 3
Job Level: 2
Job Satisfaction: 4

Performance Rating: 3
Relationship Satisfaction: 1
Stock Option Level: 0
Training Times Last Year: 0
Work Life Balance: 1
Age: 41
Hourly Rate: 94
Daily Rate: 1102
Monthly Rate: high
Monthly Income: 5993
Distance From Home: near
Employee Number: 1
Number Companies Worked: many Percent Salary Hike: 11 Years At Company: 6
Years In Current Role: 4
YearsSince Last Promotion: short
Years With Current Manager: 5

Using these criteria, we can make the following prediction:
pred_emp1 <- predict(m4_glm, newdata = emp1)
pred_emp1
## [1] Yes
## Levels: No Yes
## [1] "Is this employee likely to leave? ---> Yes"

No comments:

Post a Comment