Sunday, June 17, 2018

Does Obesity Cause Hypertension?

Asking a blunt question such as the title of this blog post is something you might be surprised to know is rather rare in epidemiologic studies. Most of the time, authors ask research questions looking for associations between suspect exposure variables and disease outcomes. Well, what’s wrong with that? We all know that association doesn’t imply causation, right? 
Well, that’s true when implemented in many published studies. You may feel unsatisfied after reading a paper that offers you a statistically “significant” association between an exposure (or treatment) and an outcome. Associations are touted all the time in news reports and advertising as causes of, links to, and connections between some exposure and a disease, or other outcome. 
But we have the Hill Criteria, nine “viewpoints” to help determine if observed epidemiologic associations are causal. These were were advanced during the 1960’s when tobacco companies were vigorously denying any causal relationship between smoking and lung cancer.
While helpful in a rhetorical sense, there is no rigor to the Hill Criteria and they can be manipulated to the point that they become tools of any clever sophist. 
Let’s say we are studying potential causes of hypertension. In particular we are interested in determining if obesity can cause high blood pressure. We think that there are other potential causes of hypertension such as gender and age, socioeconomic status, genetic causes associated with racial origin, educational status, and other unmeasurable potential causes. We want to know whether something causes hypertension since we can’t design a reasonable intervention to affect it (drug, diet, activity, etc.), if we can’t say with reasonable assurance that it is caused by the thing we are planning to intervene on. 
Using a directed acyclic graph, we can visualise a causal relation between obesity and hypertension(HBP). DAG’s constructed using Dagitty.

Proposed causal effect of obesity on hypertension
Proposed causal effect of obesity on hypertension
We can propose other biologically plausible causes of hypertension
We can propose other biologically plausible causes of hypertension
All of the red arrows indicate open paths of association (confounding) between other unmeasured exposures and the outcome, hypertension. If we were able to “set” the level of obesity exposure to a particular value (obese or not obese) then we would be removing the conditional dependence of obesity on its parents (low activity, genetics and high calorie diet). We would then have a directed acyclic graph that looked like this:
Please read Pearl for the theory on this subject.
This is what we are trying to do in a randomized experiment where the treatments (exposures), are randomly assigned. This removes the confounding spurious associations between obesity and hypertension. Any effect of obesity on hypertension would now be due only to obesity via the direct causal path between the two. For the obvious ethical and practical reasons, we can’t perform an experiment like this. There is however, a vast amount of observational data stored away in government databases, hospital records, and disease and device registries that could be used to infer causal relationships if approached in a way that isolates the exposure of interest from confounding covariates.
For example, the US National Health and Nutrition Examination Study (NHANES) compiles yearly data on a cohort of adults and children via questionnaires, physical examination and laboratory studies. I will select the data from the 2009 study and download the data on body mass (BMI) and blood pressure.
I am going to deal with missing data by assuming that all missing values are missing at random, and will use listwise deletion to deal with the NA’s. I could have used an imputation method (see Random Imputation in my January post), but this would have made the blog post much longer.
library(nhanesA)
library(NHANES) 
data(NHANES) # get the NHANES data
nhanes <- NHANES # construct a data frame
rm(NHANES) # clear the download
nhanes1 <- unique(nhanes) #remove duplicate ID's
We’ll confine the data to 2009-2010 survey period.
#select the 2009 tables
nhanesTables('Q', 2009)
nhanesTables('EXAM', 2009)
nhanesTables('DEMO', 2009)
bp_x <- nhanes("BPX_F")
write.csv(bp_x, "bp_x.csv") #necessary step to remove column labels - write to file and read back
bp <- read.csv("bp_x.csv")
bp <- bp[, -1] #removes the first column inserted by write function above
sys <- data.frame("id" = bp$SEQN, "sys" = bp$BPXSY1) #systolic blood pressure
dias <- data.frame("id" = bp$SEQN, "dias" = bp$BPXDI1) #diastolic blood pressure
nas <- which(is.na(sys$sys)) #remove NA's
sys <- sys[-nas, ]
nas <- which(is.na(dias$dias))
dias <- dias[-nas, ]
bp_tot <- cbind(sys, dias) #bind sys and dias
bp_tot <- bp_tot[, -3] #remove duplicated column
bp_tot$bp <- ifelse(bp_tot$sys > 130 | bp_tot$dias > 80, 1, 0) #catregorize blood pressure - ACA guidelines
bp_tot$id <- as.character(bp_tot$id) # for future join
#BMI
bmi <- nhanes("BMX_F")
## Processing SAS dataset BMX_F      ..
write.csv(bmi, "bmi.csv")
bmi <- read.csv("bmi.csv")
bmi <- data.frame("id" = bmi$SEQN, "bmi" = bmi$BMXBMI)
nas <- which(is.na(bmi$bmi))
bmi <- bmi[-nas, ]
bmi$id <- as.character(bmi$id)
Join the blood pressure and BMI data
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 3.5.1
## Warning: package 'dplyr' was built under R version 3.5.1
hbp <- left_join(bp_tot, bmi, by = "id")
hbp <- na.omit(hbp) #remove NA's
glimpse(hbp)
## Observations: 7,461
## Variables: 5
## $ id   <chr> "51624", "51626", "51627", "51628", "51629", "51630", "51...
## $ sys  <int> 114, 112, 92, 154, 102, 118, 106, 142, 94, 126, 84, 122, ...
## $ dias <int> 88, 62, 36, 70, 50, 82, 60, 62, 38, 62, 50, 76, 12, 68, 4...
## $ bp   <dbl> 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, ...
## $ bmi  <dbl> 32.22, 22.00, 18.22, 42.39, 32.61, 30.57, 13.21, 26.04, 2...
Categorize BMI using WHO guidelines
hbp <- hbp %>% mutate(obese = ifelse(hbp$bmi > 30.00, 1, 0))
For purposes of brevity I will not include the steps to download from NHANES and clean the rest of the data. We will get NHANES data on obesity, gender, race, income, education, blodd pressure and age. After downloading all sections, cleaning the data and renaming the variables, the final tidy data frame is ready to analyze.
## Observations: 4,874
## Variables: 8
## $ id     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ bp     <int> 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1...
## $ obese  <int> 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0...
## $ male   <int> 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0...
## $ black  <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ poor   <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0...
## $ low_ed <int> 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0...
## $ old    <int> 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0...
Notice that we have a tidy data frame with obese present as the treatment (as in a randomized study and bp (blood pressure - has hypertension or not) as the outcome. I will be using the terms treatment and exposure interchangeably. 
Since we are trying to approach the elimination of selection bias that is achievable in a randomized study, an important assumption that needs to be addressed is the conditional independence assumption. That is, given pretreatment covariates X, outcomes Y0, and Y1 (outcomes under no treatment (Y0) and treatment (Y1)), and treatment A, treatment assignment is independent from the potential outcomes. 
(Y0,Y1 A | X 
We are trying to get to the status of a randomized study in order to eliminate as far as possible the conditional dependence of the treatment on any backdoor confounders. We would like to be able to ignore the treatment assignment. The “treatment assignment” can be seen in a mosaic graph of the exposure(treatment) and the outcome. If we have independence of the outcome and treatment, we should see a mosaic consisting of four equal area squares in a 2x2 matrix. In our data there is assymetry in the squares, indicating selection bias in our sample. So we cannot ignore the treatment “assignment” for this data.
If we could achieve balance in a subset of our data, the conditional independence assumption would hold. If the ignorability assumption holds, the average causal effect (ACE) is simply the difference in mean effect in each outcome group (Y0 - Y1). 
The ACE given the set of potential confounding covariates (male, black, poor, old, low_ed) equals the mean difference in the outcome (bp), given the covariates and the treatment for each group (obese vs. non-obese) 
ACE|X = E[Y1|X, A=1] - E[Y0, A=0] 
Treatment assignment becomes irrelevant if we can approach the balance of a study where treatment is randomly assigned. Thus: 
ACE|X = E[Y1|X] - E[Y0|X] rearranging: 
ACE|X = E[Y1 - Y0|X] 
One method (among others) to approximate this condition is to match the subjects based on the probability (the propensity) of having the treatment or not having the treatment. Rubin 
The Propensity Score Theorem is a corollary to the Conditional Independence Assumption: the potential values of the outcome variables are conditionally independent of the treatment when we condition on the list of important covariates. (Y0,Y1 A | X 
The Propensity Score Theorem states that the outcome variables (Y1 and Y0) are conditionally independent of the treatment if we condition on the propensity of each individual to receive treatment given the list of covariates. In order to constrain the probability to between zero and one, we use the logit link, hence a generalized linear model(glm). So if we match the two outcome groups (Y1 and Y0) on the propensity scores of each individual, we can ignore the list of covariates when looking at our causal graph. It is now possible to make a convincing causal statement about our treatment in question (obesity) and calculating its causal effect (ACE). 
For proof see: Pellizzari

So we will make our propensity score model based on probability of treatment (exposure) “assignment”.
psmodel <- glm(obese ~ bp + male + black +
                 poor + low_ed + old,
               family = "binomial",
               data = hbp15)
pscore <- psmodel$fitted.values
First five P-scores:
##     1     2     3     4     5 
## 0.395 0.573 0.308 0.425 0.368
The results of propensity score matching can best be seen graphically. The p-score density plots in the treated and untreated groups show a easily visualised structural difference .
hbp15$pscore <- pscore
ps <- hbp15 %>% dplyr::select(obese, pscore)
ps_0 <- ps %>% filter(obese == 0)
ps_1 <- ps %>% filter(obese == 1)
ggplot() +
  geom_density(aes(x = pscore),
              color = "blue",
               data = ps_0) +
  geom_density(aes(x = pscore),
               color = "red",
               data = ps_1) +
  xlim(.22, .65) +
  labs(title = "Propensity Scores In Obesity Exposure Groups Before Matching",
       subtitle = "Blue = Unexposed, Red = Exposed",
       x = "P-score",
       y = "Density") +
  theme(plot.title = element_text(size = 10, face = "bold"))
 
Now the treatment groups will be matched by propensity score.
library(Matching)
psmatch <- Match(Tr = hbp15$obese,
                 M = 1, 
                 X = pscore,
                 replace = FALSE)
matched <- hbp15[unlist(psmatch[c("index.treated",
                                "index.control")]),]
The completeness of the matching process can be determined by calculating the Standardized Matching Difference (SMD). It should be less than 0.10 to demonstrate adequate matching. 
Bedfore Matching:
#include all vars in xvars except treatment 
xvars <- c("bp","male","black", 
           "poor","low_ed","old","pscore")
library(tableone)
matchedtab0 <- CreateTableOne(vars = xvars,
                              strata = "obese",
                              data = hbp15,
                              test = FALSE)
print(matchedtab0, smd = TRUE)
##                     Stratified by obese
##                      0           1           SMD   
##   n                  3049        1825              
##   bp (mean (sd))     0.34 (0.47) 0.43 (0.50)  0.200
##   male (mean (sd))   0.51 (0.50) 0.46 (0.50)  0.099
##   black (mean (sd))  0.14 (0.35) 0.21 (0.41)  0.174
##   poor (mean (sd))   0.23 (0.42) 0.24 (0.43)  0.026
##   low_ed (mean (sd)) 0.49 (0.50) 0.54 (0.50)  0.097
##   old (mean (sd))    0.36 (0.48) 0.40 (0.49)  0.081
##   pscore (mean (sd)) 0.37 (0.07) 0.39 (0.07)  0.296

After matching:
matchedtab1 <- CreateTableOne(vars = xvars,
                              strata = "obese",
                              data = matched,
                              test = FALSE)
print(matchedtab1, smd = TRUE)
##                     Stratified by obese
##                      0           1           SMD   
##   n                  1825        1825              
##   bp (mean (sd))     0.44 (0.50) 0.43 (0.50)  0.007
##   male (mean (sd))   0.47 (0.50) 0.46 (0.50)  0.021
##   black (mean (sd))  0.20 (0.40) 0.21 (0.41)  0.015
##   poor (mean (sd))   0.22 (0.41) 0.24 (0.43)  0.049
##   low_ed (mean (sd)) 0.55 (0.50) 0.54 (0.50)  0.019
##   old (mean (sd))    0.41 (0.49) 0.40 (0.49)  0.029
##   pscore (mean (sd)) 0.39 (0.07) 0.39 (0.07)  0.009
There has been a marked decrease of SMD’s to below 0.10
#demonstrate the confounding after matching
ps <- matched %>% dplyr::select(obese, pscore)
ps_0 <- ps %>% filter(obese == 0)
ps_1 <- ps %>% filter(obese == 1)
ggplot() +
  geom_density(aes(x = pscore),
               color = "blue",
               data = ps_0) +
  geom_density(aes(x = pscore),
               color = "red",
               data = ps_1) +
  xlim(.22, .65) +
  labs(title = "Propensity Scores In Obesity Exposure Groups After Matching",
       subtitle = "Blue = Unexposed, Red = Exposed",
       x = "P-score",
       y = "Density") +
   theme(plot.title = element_text(size = 10, face = "bold"))
There is an almost exact match of the exposure groups. 
Outcome Analysis
We can now treat our exposure(obesity) as being free of conditional dependence on any of the covariates (as if this was a randomized trial). We can use a glm model to estimate the causal effect size of obesity on hypertension.
#Outcome analysis - use regression for effect size
#exponentiate to change log-odds to odds ratio
library(broom)
## Warning: package 'broom' was built under R version 3.5.1
set.seed(123)
m1 <- glm(bp ~ obese + male + black + poor + low_ed + old,
          family = "binomial",
          data = matched)
or <- broom::tidy(exp(round(coef(m1), 3)))
ci <- exp(confint(m1))

## [1] "Odds ratio of obese =  0.998"
## [1] "2.5% Confidence Limit of odds ratio =  0.872"
## [1] "97.5% Confidence Limit of odds ratio =  1.143"
Since the confidence interval of the obesity odds ratio spans 1.00, we can say that obesity has no demonstrated causal effect on hypertension. 
This is as close as we will get to a randomized trial.
Does our mosaic graph look more symmetrical?
The squares are now almost, but not quite symmetrical. They do lend support for showing that our matching has made a strong advance in supporting our ability to accept the Conditional Independence Assumption 
So what was the price we paid for our increase in accuracy and plausibility?
#How much data we lost by matching
id_exp0_ante <- hbp15 %>% filter(obese == 0)
id_exp1_ante <- hbp15 %>% filter(obese == 1)
id_exp0_post <- matched %>% filter(obese == 0)
id_exp1_post <- matched %>% filter(obese == 1)
lost_treated <- nrow(id_exp1_ante) - nrow(id_exp1_post)
lost_controls <- nrow(id_exp0_ante) - nrow(id_exp0_post)
## [1] "Number of lost obese subjects =  0"
## [1] "Number of lost non-obese subjects(controls) =  1224"
We lost no treated subjects, but we discarded the data from unmatched controls. I think that this is a small sacrifice to allow us to use the same type of reasoning involved with a randomized trial to study the vast amount of data available in observational datasets.