Ozone Production - Data Exploration and Analysis

Data Exploration:


Well, that was a lot of work, but we now have a tidy dataset to work with. The first thing now is to see what interesting relationships exist among the numerical variables using a pairs plot. GGpairs will need 1d atomic vectors to work with. Since scaling has attached attributes to the numerical variables, we have to remove them (1d atomic vectors have no attributes except names).
temp <- air3[, c(2:13)] # store the numerical variables
temp <- data.frame(as.matrix(temp)) # a way to remove attributes
air4 <- air3[, -c(2:13)] #start a new data frame
air4 <- bind_cols(air4, temp) #attach the numerical variables

********************************************************


Instead of re-computing the pairs plot, I copied it as a .png and printed it below. It takes about two minutes to process the plot, so you don’t have to wait for it. You can remove the #’s from the code below if you want to compute it.
#ggpairs(temp,
#        lower = list(continuous = wrap("points",
#                                       alpha = 0.004,
#                                       color = "blue",
#                                       size = .5)))

********************************************************


I’m going to concentrate on ozone as our outcome variable. Let’s see how ozone concentration varies with the month.
ggplot(air4) +
  geom_boxplot(aes(x = month, y = exp(lo3))) +
  xlab("Month") +
  ylab("Ozone Concentration")
Ozone concentration is significantly decreased in the eighth month(August). This makes sense since everyone in Europe is on vacation in August.

********************************************************


Let’s see how ozone varies by the day of the week
o3_day <- air4 %>%
          group_by(dayofweek) %>% 
          summarize(O3 = exp(mean(lo3)))

ggplot(o3_day) +
  geom_bar(aes(x = dayofweek, y = O3),
           stat = "identity",
           width = .50,
           fill = "#538195") +
  xlab("Day of the Week") +
  ylab("Ozone Concentration")
Ozone decreases on the weekend when there is less driving and manufacturing activity. It increases during the middle of the week.

********************************************************

Now let’s look at benzene.
ggplot(air4) +
  geom_boxplot(aes(x = month, y = exp(lbenz)))+
  xlab("Month") +
  ylab("Benzene Concentration")
Benzene shows a similar monthly pattern.
Now, by the day:
benz_day <- air4 %>%
          group_by(dayofweek) %>% 
          summarize(benzene = exp(mean(lbenz)))

ggplot(benz_day) +
   geom_bar(aes(x = dayofweek, y = benzene),
            stat = "identity",
            width = .50,
            fill = "#538195") +
   xlab("Day of the Week")
There is a similar daily pattern for benzene.

********************************************************

Another reasonable step would be to form a explanatory model for ozone concentration. First we need to see if there is any autocorrelation to deal with. First we’ll determine the autocorrelation for the ozone. We can exponentiate back the scaled ozone concentration and form a time series by the day.
#extract the relevant variables
oz <- air4 %>% dplyr::select(day, lo3) %>% mutate(oz_conc = exp(lo3))
#group and summarize by mean O3 concentration
oz_conc <- oz %>% group_by(day) %>% summarize(mean(oz_conc))
#create a time series
oz_ts <- ts(oz_conc[, 2], start = 1, end = 389)
#autocorrelation plot
acf(oz_ts)
#partial autocorrelation plot
pacf(oz_ts)
I interpreted this to mean that mean ozone concentration has a daily autoregressive AR(2) pattern. We will need this shortly. 
Also, we need information on the variance structure of the data. A generalized least squares model allows us to use the correlation structure and deal with heterogeneity. First we use an ordinary least squares model to obtain variance information by plotting the residuals against the fitted values
m1 <- lm(lo3 ~ tmp + no2 + lah + rh + lcogt + lcopt + lbenz + lnmhcpt + lnoxgt + lno2pt + month, data = air4)
air5 <- air4
air5$residuals <- residuals(m1)
air5$fitted <- fitted(m1)
ggplot(air5) + 
  geom_point(aes(x = fitted, y = residuals))
There might be some degree of heterogeneity, but we can check each pollutant variable against the residuals.
air6 <- air5[, c(6:18)]
air6 <- air6 %>% gather(pollutants, concentration, -residuals)
ggplot(air6) +
  geom_point(aes(x = concentration, y = residuals)) +
  facet_wrap(~ pollutants)
These all look like pretty asymmetrical blobs without any definite heterogenous structure. So we don’t have to include the variance structure for the pollutants.
In addition, we need to know more about how the residuals vary by time, in this case by the month.
ggplot(air5) +
  geom_boxplot(aes(x = month, y = residuals))
The residuals do have some variation by month so we can also deal with this in the model.
First we will use a plain gls model, which is the equivalent of an lm(ordinary least squares) model.
A generalized linear squares model can address the changing variance structure(with month) and the autoregressive pattern(with hour). I will pare down the data set to the first 70 days to keep computing time to a reasonable time(20 min on my system).
We’ll make three models. First, a plain gls model (the equivalent of an lm (ordinary least squares model), Next, a gls model which addresss the changing monthly variance. Finally, a third model that addresses the autocorrelation by the day.
seventyday <- air5 %>% dplyr::filter(day <= 70)
m1 <- gls(lo3 ~ tmp + lah + no2 + rh + lcogt + lcopt + lbenz + lnmhcpt + lnoxgt + lno2pt + month,
          data = seventyday)
m2 <- gls(lo3 ~ tmp + rh + lah + no2 +lcogt +
            lcopt + lbenz + lnmhcpt + lnoxgt +
            lnoxpt + lno2pt,
          data = seventyday,
          weights = varIdent(form = ~ 1 | month))
Warning: This model (m3)takes a while (about 15 min on my system) to run
#add a variable for each hour of observation
seventyday$hour <- seq(1, nrow(seventyday))
m3 <- gls(lo3 ~ tmp + rh + lah + lcogt +
            lcopt + lbenz + lnmhcpt + lnoxgt +
            lnoxpt + lno2pt + no2,
          data = seventyday,
          weights = varIdent(form = ~ 1 | month),
          correlation = corARMA(p = 2, form = ~ hour))
AIC(m1, m2, m3)
##    df       AIC
## m1 14  1551.194
## m2 15  1369.708
## m3 17 -1197.961
So m3 (it has the lowest AIC) is the preferred model. This shows that accounting for heterogeneity and temporal autocorrelation yields a better fitting model.
summary(m3)
## Generalized least squares fit by REML
##   Model: lo3 ~ tmp + rh + lah + lcogt + lcopt + lbenz + lnmhcpt + lnoxgt +      lnoxpt + lno2pt + no2 
##   Data: seventyday 
##         AIC       BIC   logLik
##   -1197.961 -1105.832 615.9805
## 
## Correlation Structure: ARMA(2,0)
##  Formula: ~hour 
##  Parameter estimate(s):
##      Phi1      Phi2 
##  1.072019 -0.179571 
## Variance function:
##  Structure: Different standard deviations per stratum
##  Formula: ~1 | month 
##  Parameter estimates:
##        3        4        5 
## 1.000000 0.916434 1.021868 
## 
## Coefficients:
##                  Value  Std.Error    t-value p-value
## (Intercept)  0.4534933 0.04908615   9.238723  0.0000
## tmp         -0.0237906 0.03726585  -0.638401  0.5233
## rh          -0.0197987 0.02441825  -0.810817  0.4176
## lah          0.0682283 0.04595007   1.484837  0.1378
## lcogt       -0.0285697 0.01199214  -2.382364  0.0173
## lcopt        0.2334718 0.02803608   8.327548  0.0000
## lbenz        0.1784954 0.06537455   2.730350  0.0064
## lnmhcpt     -0.2918614 0.09987381  -2.922301  0.0035
## lnoxgt       0.0532073 0.00694660   7.659472  0.0000
## lnoxpt      -0.9426631 0.05106855 -18.458782  0.0000
## lno2pt      -0.1753994 0.04279561  -4.098537  0.0000
## no2          0.1584252 0.01423643  11.128156  0.0000
## 
##  Correlation: 
##         (Intr) tmp    rh     lah    lcogt  lcopt  lbenz  lnmhcp lnoxgt
## tmp     -0.126                                                        
## rh      -0.188  0.676                                                 
## lah      0.224 -0.058 -0.262                                          
## lcogt   -0.052  0.056 -0.002  0.033                                   
## lcopt   -0.119  0.064 -0.026 -0.322 -0.110                            
## lbenz   -0.215  0.375  0.212 -0.497  0.028  0.289                     
## lnmhcpt  0.090 -0.119  0.085  0.500 -0.064 -0.428 -0.879              
## lnoxgt   0.077 -0.036 -0.061  0.075 -0.079 -0.131 -0.118  0.062       
## lnoxpt  -0.462  0.359  0.451  0.131  0.006 -0.067 -0.125  0.468  0.045
## lno2pt  -0.426  0.065 -0.053 -0.284  0.048 -0.101  0.560 -0.556  0.037
## no2      0.081 -0.055 -0.011  0.030 -0.487 -0.105 -0.094  0.074  0.370
##         lnoxpt lno2pt
## tmp                  
## rh                   
## lah                  
## lcogt                
## lcopt                
## lbenz                
## lnmhcpt              
## lnoxgt               
## lnoxpt               
## lno2pt   0.106       
## no2      0.042  0.039
## 
## Standardized residuals:
##         Min          Q1         Med          Q3         Max 
## -2.71437363 -0.61660816  0.02373259  0.67835571  2.53653802 
## 
## Residual standard error: 0.4143045 
## Degrees of freedom: 1680 total; 1668 residual

************************************************

Analysis

Well, now we have a list of atmospheric pollutants that have strong associations with ozone concentration. But all we have is associations. What if we want to intervene somehow to limit the degree of ozone pollution in the urban atmosphere. We then would need convincing evidence for a causal relationship between one or more of these pollutants and ozone. We can make this kind of causal inference if we can accomplish several objectives:
  1. Select a proposed causal variable for the outcome variable. In our case we can ask the question: “Does increasing NO2 concentration cause ozone concentration to rise above potentially toxic levels? No2 is my choice as a causal variable since NO2 is a very reactive molecule compared to the other variables in the list. 
  2. Eliminate the backdoor paths to the causal variable. We will do this by “setting” the causal variable (lno2pt - NO2 concentration) at a level of 100 µg/ m3. The WHO recommends limiting exposures to less than 40 µg/m3, so setting our toxic level at 100 should be enough to produce a significant public health hazard.
#dividing the dataset based on WHO No2 recommended concentration level
#go back to the original dataset before scaling

ggplot(air2) +
  geom_density(aes(x = no2)) +
  geom_vline(aes(xintercept = 100), color = "red")
air2 <- air2 %>% mutate(toxno2 = factor(if_else(no2 >= 100, 1, 0)))
Next, we separate the exposure groups based on propensity scoring. This is done to approximate the controlled exposure of a randomized experiment. The exposure(toxno2 = 1) is the “treatment”. The propensity(pscore) is the probability of each observation to receive treatment rather than “control” (toxno2 = 0). The propensity score is denoted for each subject by:
Ï€i=P(A=1|Xi)

If an subject has a propensity score of .3, then given their particular covariate values, there is a 30% chance they will be treated. The propensity score is a balancing score in that if two subjects have the same propensity score, but different sets of covariates, they are equally likely to be treated. If we restrict our analysis to a subpopulation of subjects who have the same propensity score values, then there should be balance in the two treatment groups. In other words, the distribution of covariates in the treated group should be the same as the distribution of covariates in the control group. The propensity score is an allocation probability which would be known in a randomized trial:
P(A=1|X)=P(A=1)=0.5
for example. In an observational study such as this one, it is unknown, but, since it involves observed data (A and X), we can estimate it. To estimate P(A=1|X):
  1. Make the outcome binary (i.e., set a cutoff point)
  2. Fit a logistic regression model: outcome(treatment) A, covariates X.
  3. From that model, get the predicted probability(fitted value) for each subject (the propensity score)
psmodel <- glm(toxno2 ~ lo3 + tmp + rh + lah + lcogt + lcopt + lbenz + lnmhcpt + lnoxgt + lnoxpt,
            family = "binomial",
            data = air2)
pscore <- psmodel$fitted.values 
air2$pscore <- pscore
Next, see how well the treated and untreated groups match up
#extract the data and make a new data frame
ps <- air2 %>% dplyr::select(toxno2, pscore)

ps_0 <- data_frame("toxno2" = ps$toxno2[ps$toxno2 == "0"],
                   "pscore" = ps$pscore[ps$toxno2 == "0"])
ps_1 <- data_frame("toxno2" = ps$toxno2[ps$toxno2 == "1"],
                   "pscore" = ps$pscore[ps$toxno2 == "1"])
ggplot() +
  geom_density(aes(x = pscore),
               color = "blue",
               data = ps_0) +
  geom_density(aes(x = pscore),
               color = "red",
               data = ps_1) +
  xlim(-.20, 1.3)
We have to match these up better to compare.
#change toxno2 to integer
set.seed(123)
air2$toxno2 <- as.integer(as.character(air2$toxno2))
psmatch <- Matching::Match(Tr = air2$toxno2,
                 M = 1, 
                 X = pscore,
                 replace = FALSE,
                 caliper = 0.2)

matched <- air2[unlist(psmatch[c("index.treated",
                                 "index.control")]),]
xvars <- c("rh", "tmp", "lcogt", "lcopt", "lbenz", "lnmhcpt","lnoxgt",    "lnoxpt", "lo3", "lah")
matchedtab <- tableone::CreateTableOne(vars = xvars,
                              strata = "toxno2",
                              data = matched,
                              test = FALSE)
getsmd <- function(df) { 
  options(digits = 3)
  print("SMD")
  x <- tableone::ExtractSmd(df)
  print(x)
  paste("sum of SMD's = ", round(sum(x[1:length(x)]), 3))
}
getsmd(matchedtab)
## [1] "SMD"
##         1 vs 2
## rh      0.0748
## tmp     0.1121
## lcogt   0.2720
## lcopt   0.3086
## lbenz   0.2479
## lnmhcpt 0.2668
## lnoxgt  0.2012
## lnoxpt  0.0381
## lo3     0.1915
## lah     0.1120
## [1] "sum of SMD's =  1.825"
I tried the matching with several calipers ranging from 0.1 to 0.5. This was the lowest SMD sum I could get using caliper = 0.2. Ideally, all the SMD’s (standardized differences) should be .1 or less, so we haven’t gotten a perfect match, but it’s not too bad. What we are trying to do is simulate a randomized experiment with regard to unknown confounders.
Please see the excellent Corsera course about causal inference for a discussion about propensity score matching. Coursera
#Outcome analysis - paired t-test
y_trt <- matched$lo3[matched$toxno2==1]
y_con <- matched$lo3[matched$toxno2==0]
diffy <- y_trt - y_con
t.test(diffy)
## 
##  One Sample t-test
## 
## data:  diffy
## t = 5, df = 1000, p-value = 6e-08
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.0365 0.0776
## sample estimates:
## mean of x 
##     0.057
So the difference between the toxic NO2 group (the treatment group) and the non-toxic group (the control group) is significantly different than zero.

***********************************************

Let’s see how well we have matched.
ps <- matched %>% dplyr::select(toxno2, pscore)

ps_0 <- data_frame("toxno2" = matched$toxno2[matched$toxno2 == "0"],
                   "pscore" = matched$pscore[matched$toxno2 == "0"])
ps_1 <- data_frame("toxno2" = matched$toxno2[matched$toxno2 == "1"],
                   "pscore" = matched$pscore[matched$toxno2 == "1"])
ggplot() +
  geom_density(aes(x = pscore),
               color = "blue",
               data = ps_0) +
  geom_density(aes(x = pscore),
               color = "red",
               data = ps_1) +
  xlim(-.20, 1.3)
Looks like a much better match!
So, does NO2 cause Ozone? Well, I think we have some reason to say there might be some chemical mechanism occurring in the atmosphere between NO2 and Ozone that results in increasing ozone production. This kind of evidence chould spark some interest in research which investigates such a mechanism. Also, it could justify some type of intervention that would limit NO2 production as a component of environmental policy. 
sessionInfo() R version 3.5.0 (2018-04-23) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages: [1] stats graphics grDevices utils
[5] datasets methods base
other attached packages: [1] bindrcpp_0.2.2 tableone_0.9.3 [3] psych_1.8.4 Matching_4.9-3 [5] MatchIt_3.0.2 GGally_1.4.0
[7] nlme_3.1-137 MASS_7.3-49
[9] anytime_0.3.1 lubridate_1.7.4 [11] forcats_0.3.0 stringr_1.3.1
[13] dplyr_0.7.6 purrr_0.2.5
[15] readr_1.1.1 tidyr_0.8.1
[17] tibble_1.4.2 ggplot2_3.0.0
[19] tidyverse_1.2.1
loaded via a namespace (and not attached): [1] Rcpp_0.12.17 lattice_0.20-35
[3] assertthat_0.2.0 rprojroot_1.3-2
[5] digest_0.6.15 R6_2.2.2
[7] cellranger_1.1.0 plyr_1.8.4
[9] backports_1.1.2 RApiDatetime_0.0.3 [11] survey_3.33-2 evaluate_0.10.1
[13] httr_1.3.1 pillar_1.2.3
[15] rlang_0.2.1 lazyeval_0.2.1
[17] readxl_1.1.0 rstudioapi_0.7
[19] Matrix_1.2-14 rmarkdown_1.10
[21] labeling_0.3 splines_3.5.0
[23] foreign_0.8-70 munsell_0.5.0
[25] broom_0.4.5 compiler_3.5.0
[27] modelr_0.1.2 pkgconfig_2.0.1
[29] mnormt_1.5-5 htmltools_0.3.6
[31] tidyselect_0.2.4 reshape_0.8.7
[33] crayon_1.3.4 withr_2.1.2
[35] grid_3.5.0 jsonlite_1.5
[37] gtable_0.2.0 magrittr_1.5
[39] scales_0.5.0 cli_1.0.0
[41] stringi_1.2.3 reshape2_1.4.3
[43] xml2_1.2.0 RColorBrewer_1.1-2 [45] tools_3.5.0 glue_1.2.0
[47] hms_0.4.2 survival_2.41-3
[49] rsconnect_0.8.8 parallel_3.5.0
[51] yaml_2.1.19 colorspace_1.3-2
[53] rvest_0.3.2 knitr_1.20
[55] bindr_0.1.1 haven_1.1.2
>

No comments:

Post a Comment