Environ Ecol Stat (2009) 16:75–87 DOI 10.1007/s10651-007-0050-z
Using data augmentation via the Gibbs Sampler to incorporate missing covariate structure in linear models for ecological assessments Edward L. Boone · Keying Ye · Eric P. Smith
Received: 1 October 2004 / Revised: 1 May 2006 / Published online: 16 October 2007 © Springer Science+Business Media, LLC 2007
Abstract Missing covariate values in linear regression models can be an important problem facing environmental researchers. Existing missing value treatment methods such as Multiple Imputation (MI), the E M algorithm and Data Augmentation (DA) have the assumption that both observed and unobserved data come from the same distribution, most commonly a multivariate normal or a conditionally multivariate normal family. These methods do try to incorporate the missing data mechanism and rely on the assumption of Missing At Random (MAR). We present a DA method which does not rely on the MAR assumption and can model missing data mechanisms and covariate structure. This method utilizes the Gibbs Sampler as a tool for incorporating these structures and mechanisms. We apply this method to an ecological data set that relates fish condition to environmental variables. Notice that the presented DA method detects relationships that are not detected when other missing data methods are employed. Keywords Bayesian methods · Biological monitoring · Data augmentation · Ecological health · Gibbs sampler · Stressor-response · Missing data
E. L. Boone (B) Department of Statistical Sciences and Operations Research, Virginia Commonwealth University, Richmond, VA 23284, USA e-mail:
[email protected] K. Ye Department of Management Science and Statistics, University of Texas at San Antonio, 6900 N Loop 1604 W, San Antonio, TX 78249, USA E. P. Smith Department of Statistics, Virginia Tech, Blacksburg, VA 24061, USA
123
76
Environ Ecol Stat (2009) 16:75–87
1 Introduction Missing covariates is a problem familiar to many environmental researchers. Many times covariates can become missing for a variety of unintended reasons: non-response, equipment failure, lack of collection, etc. The literature provides many methods for the researcher to deal with missing covariates: listwise deletion, mean substitution (Little and Rubin 1986; Schafer 1997), the E M algorithm and multiple imputation. Listwise deletion has the greatest effect in losing information since it simply discards any incomplete observations and thus reduces the sample size. In implementing methods such as mean substitution and the E M algorithm (Dempster et al. 1977) a single value for a missing data is imputed as if the missing value were the true unobserved value. Ibrahim (1990), Lipsitz and Ibrahim (1996), Xie and Paik (1997), Ibrahim et al. (1999) and Satten and Carroll (2000) all considered various methods for dealing with missing covariates in linear and non-linear parametric models employing the E M algorithm and variants to incorporate covariate structure. These methods do not directly incorporate any uncertainty associated with the missing value into the analysis. On the other hand, multiple imputation approaches seek to incorporate such uncertainty by imputing missing values via certain distribution. In this paper, we consider a Bayesian approach to treat missing covariates and we use the intrinsic properties of the Markov Chain Monte Carlo (MCMC) technique, the Gibbs Sampling, to impute values for the missing covariates. Using the sampling properties of the Gibbs sampler we can naturally incorporate covariate structure and missing data mechanisms into the imputed values. Missing at Random (MAR), in the sense of Rubin (1976), is a common requirement for many missing data treatments. MAR is essentially the idea, that the unobserved value is not missing due to its value, rather it is missing due to randomness. For many data sets this is the common case. However, when measurements are censored because of instrument precision, sometimes the data is missing because its value is too low or too high to be detected by the device. Apparently, in such a situation, the data is not missing at random and it cannot be fit into the MAR scenario. Incorporating nonMAR missing data mechanisms into a statistical analysis is still challenging. In this paper, we show how such mechanisms can be incorporated into the model via a prior distribution on the missing data parameters. When multiple imputation (MI) methods are implemented, usually the assumption that both observed data and unobserved data come from a distribution such as a multivariate normal or conditionally multivariate normal distribution are made. MI procedures are now included in the packages such as SAS with PROC MI and PROC MIANALYZE (Yuan 2001) and in S-Plus (Schimert et al. 2000). The SAS implementation of MI requires the statistical assumptions that both the missing and observed data comes from a multivariate continuous distribution. To use MCMC methods, a multivariate normal distribution must be observed (Yuan 2001). SAS also recommends that the amount of missing data is not too large. The S-Plus implementation allows for factor level and continuous combinations (Schimert et al. 2000) using the approach taken by Schafer (1997). Schafer discussed the missing data treatment methods where factor level and continuous variables are present by assuming that conditional on the factor level variables, the observed and missing data come from a multivariate normal
123
Environ Ecol Stat (2009) 16:75–87
77
distribution. Efron (1994) explored a bootstrap implementation of MI which makes no assumptions about the distribution of the missing data or the missing data mechanism. He also noted the importance of modelling in the missing data mechanism. Our method makes no assumptions about the distribution of the observed covariates and we can deal with situations where the missing covariate is not continuous nor a missing data mechanism is present.
2 The method The method we propose in this article follows from the Data Augmentation (DA) approach where parameters are added in for the missing values (Tanner 1993), whereas these parameters (augmented data) are associated with a distribution. The variability in the distribution generates the uncertainty associated with missing values. Multiple Imputation is an implementation of this idea where values are imputed for the missing values and parameter estimates are obtained multiple times and averaged using standard formulas (Little and Rubin 1986). Our method does not require these formulas since we can easily marginalize the posterior distribution via the Gibbs Sampler (Gelman et al. 1995). Our focus is on missing covariates and throughout this paper we assume the response is fully observed. Let Yi be the ith response and X ji be the jth covariate for the ith observation. Under a linear regression framework with DA as a missing data treatment our model is given by: Yi ∼ N (µi , σ 2 ),
(1)
where µi =
j∈Pi
X ji β j +
Z ji β j ,
(2)
j∈Pic
and Cov(Yi , Yk |β, σ 2 ) = 0. In Eq. 2, Z ji is the parameter representing the missing value of the jth covariate in observation i, Pi is the set indicating which covariates are observed for the ith response, and Pic is its complement. In this formulation, it is possible that each Z ji parameter will have a distinct prior distribution p ji (Z ji ). For notation, we will refer to the collection of missing data parameters as Z, and the corresponding joint prior as p(Z). The model given by (1) and (2) leads to the completed data likelihood, L(Y|β, σ 2 , Z, X) which is dependent on the missing covariate parameters Z and the observed covariates X. Using a joint prior distribution p(β, σ 2 , Z) on β, σ 2 and Z, we can determine the posterior distribution p(β, σ 2 , Z|Y, X), via Bayes theorem: p(β, σ 2 , Z|Y, X) =
p(β, σ 2 , Z)L(Y|β, σ 2 , Z, X) . p(β, σ 2 , Z)L(Y|β, σ 2 , Z, X)dβdσ 2 dZ
123
78
Environ Ecol Stat (2009) 16:75–87
Notice that this posterior distribution depends on the missing covariate parameters Z. Valid inferences on the model parameters β and σ 2 need to be free of Z. We can obtain the desired posterior distribution p(β, σ 2 |Y, X) via marginalization: p(β, σ 2 |Y, X) =
p(β, σ 2 , Z|Y, X)dZ.
(3)
The integral in (3) will be of very high dimensions even for low fractions of missing data (as the dimentsion is the number of missing values). The Gibbs Sampler gives us an effective way to deal with this high dimension integral by generating the samples from the posterior distribution. Using these samples we can directly examine the marginal distributions of β and σ 2 . The model proposed above assumes that p(Z) follows a proper probability distribution (i.e. p(Z)dZ = 1). Additional information about the missing covariates can be incorporated into the model via p(Z), which is an informative prior distribution. 3 Incorporation of covariate structures One common piece of information that we know a priori is the covariate structure. For example, we may know that a covariate X j is ordinal with k levels. This information can be incorporated into the model by assuming p(Z ji ) = {Pr (Z ji = l), l = 1, . . . , k}. This is more appropriate than assuming Z ji is approximately normally distributed. In order to perform the Gibbs sampling to estimate the parameters β, σ 2 and Z, the full conditional posterior distributions are needed. The remainder of this section will present some of the common full conditionals researchers may use in practice. Suppose that the covariate X j is a continuous measurement that can be assumed approximately normally distributed with mean µ j and variance σ j2 . This leads to following full conditional posterior distributions for Z ji : Z ji |others ∼ N
σ 2 µ j + σ j2 β j (Yi − x− j,i β − j )
σ 2 + σ j2 β 2j
,
σ 2 σ j2 σ 2 + σ j2 β 2j
,
(4)
where Z ji |others represents the conditional distribution of Z ji given all other quan tities, β − j is the vector of β without β j , and x− j,i is the xi completed data vector omitting the jth covariate. Hence, if β is p × 1, then β − j is ( p − 1) × 1, and similarly for x− j,i . The distributions in (4) can be easily sampled from using standard software packages. The form in (4) gives insight into how the imputed values are sampled. The full conditional posterior mean is a weighted combination of the prior mean µ j and the regression equation back-solved, i.e. Yi − x− j,i β − j . If the regression relationship between the response and the jth covariate X j ) is weak, i.e., β j ∼ 0, the prior mean and variance will dominate in the posterior mean in (4). Hence, choosing an unrepresentative mean and variance will cause leverage points to be sampled and may result in the Gibbs Sampler not converging. The conditional posterior variance of Z ji in (4) shows an effect of choosing a large prior variance σ j2 . If β j = 0 (i.e. no regression relationship between Y and X j ), and σ j2 −→ ∞, then V ar (Z ji |others) −→ ∞,
123
Environ Ecol Stat (2009) 16:75–87
79
thus the full conditional posterior is not proper. Therefore, it is not reasonable to set σ j2 −→ ∞ such as choosing a “vague” or “flat” prior. With these arguments in mind, we recommend that when the researcher has no a priori information available about the prior mean and variance of Z ji , then MAR should be assumed and an empirical prior distribution for Z ji be set. Similar as the case without an interaction term, suppose we have a normal prior distribution and an interaction is present, we will proceed in the same manner. Suppose that there is an interaction term between X j and X k . Denote by β jk the regression coefficient for the interaction. With Z ji ∼ N (µ j , σ j2 ), the following full conditional posterior distribution of Z ji can be derived: Z ji |others ∼ N
σ 2 µ j + σ j2 (β j + β jk X k )(Yi − x− j,i β − j )
σ 2 + σ j2 (β j + β jk X k )2
,
σ 2 σ j2 σ 2 + σ j2 (β j + β jk X k )2
.
Again in this case, if β j = 0, β jk = 0 and σ j2 −→ ∞ then the full conditional posterior distribution is not proper. Likewise, if β j = 0, β jk = 0 then the prior mean and variance will dominate. Furthermore, the issues of leverage also apply here. Similar to the case without interaction term discussed before, if the researcher has no a priori information about the prior mean and variance of Z ji , MAR is recommended and the empirical prior distribution parameters is preferred. Suppose that the covariate X j are a dichotomous measurement. A Bernoulli prior distribution on Z ji can be placed to preserve the dichotomous structure. This is appealing since it ensures that only values which could have been observed for X ji , are actually sampled. If we assume Z ji ∼ Ber noulli( p), then the full conditional posterior distribution will be: 1 2 P(Z ji = k|others) ∝ P(Z ji = k) exp − 2 (Yi − x− j,i β− j − kβ j ) . (5) 2σ This distribution can be easily sampled from using discrete distribution sampling techniques. The normalization constant is obtained by summing (5) over k = 0 and 1. A vague prior distribution for this situation would be P(Z ji = 0) = P(Z ji = 1) = 1/2. This reflects the lack of prior information on which value is more likely. In this discrete case, problems with leverage are not present. Suppose the covariate X j is an ordinal level measurement with K possible values, a straight forward extension of the dichotomous situation. We can place a discrete prior distribution P(Z ji = k) for k = 1, . . . , K to preserve the discrete ordinal structure on Z ji . This results in the following full conditional posterior distribution for Z ji : P(Z ji = k|others) ∝ P(Z ji
1 2 = k) exp − 2 (Yi − x− j,i β− j − kβ j ) . 2σ
Again this is a discrete distribution making sampling fast and easy. The normalization constant is determined in a similar manner as in the dichotomous case, where we sum over k = 1, . . . , K . Likewise, a vague prior distribution for this situation would be
123
80
Environ Ecol Stat (2009) 16:75–87
P(Z ji = k) = 1/K for k = 1, . . . , K . This informative prior distribution reflects the lack of prior knowledge on the value of Z ji . More generally, we can specify any appropriate prior distribution for Z ji , however sampling from the resulting full conditional induced by the prior distribution may be difficult. Techniques such as Metropolis–Hastings sampling may need to be employed to sample from these densities. Gilks et al. (1996) gave an extensive overview of sampling techniques which may be employed. Missing at Random (MAR) in the sense of Rubin (1976) is a crucial requirement for many methods (Little and Rubin 1986; Schafer 1997). Recall from Rubin (1976), the missing data mechanism gφ (m|Z ji ) relates the value of the variable Z ji to the probability it is missing where m is an indicator variable for whether Z ji is missing. In the case of MAR gφ (m|Z ji ) is a constant hence the probability that a value is missing does not depend on the unobserved value. In our method we do not need to assume MAR since utilizing our prior distribution for Z ji and the missing data mechanism gφ (m|Z ji ) we can update our prior information about Z ji by using Bayes theorem and arrive at an updated prior p(Z ji |m). In practice, we may have additional information to help formulate p(Z ji |m). For example, we may know a specific measuring device is more likely to fail when measurements are higher. We can incorporate the a priori knowledge into p(Z ji |m). This missing data mechanism is often called censoring of covariates which may occur when measuring lifetimes of objects or when the covariate measurement exceeds the observable scale. For an unobservable region C, the missing data mechanism in this case is given by: gφ (m|Z ji ) =
1 if Z ji ∈ C . 0 otherwise
From this we find the updated prior distribution for Z ji is: p(Z ji |m) =
C
0
p(Z ji ) p(Z ji )d Z ji
if Z ji ∈ C otherwise
.
This leads to the following full conditional posterior distribution: p(Z ji |others) ∝
C
0
p(Z ji ) p(Z ji )d Z ji
2 exp − 2σ1 2 (Yi − x− β − Z β ) ji j − j j,i
if Z ji ∈ C otherwise
These distributions can be sampled from using the Metropolis Algorithm (see Gilks et al. 1996). The ability to directly incorporate missing data mechanisms is one of the strengths of the proposed method which gives the researcher a tool for a more appropriate analysis. We should note, however, that in any case where MAR is not assumed and a missing data mechanism is used, the inferences become conditional on the missing data mechanism.
123
.
Environ Ecol Stat (2009) 16:75–87
81
4 An environmental example In this section we consider a data set collected by the Ohio Environmental Protection Agency regarding the benthic health of waterways. Policy makers need an understanding of which variables contribute or detract from overall benthic health in order to make informed decisions about regulations and policies. Researchers need similar understanding in order to better direct efforts to improve water quality. The data set consists of benthic health, environmental and chemical variables. The objective is to understand how the environmental and chemical variables affect the response. The response is the Index of Biotic Integrity (IBI) which measures the overall health of the fish community (Ohio EPA 1988). High values of this measure correspond to high diversity, many intolerant fish species present and more organized fish communities. Low values correspond to low diversity, no community structure and only tolerant fish present. To measure the environmental health the Quality of Habitat and Environment Index (QHEI) was employed (Ohio EPA 1989). For chemical measures Dissolved Oxygen (DO), Total Suspended Solids (TSS), pH, and Nitrates (NH3 ) were collected. The pH measure used is an ordinal variable determining whether the pH was below the acceptable limit (pH < 6) or within the acceptable limit (6 ≤ pH ≤ 9). The NH3 measure is also an ordinal variable with 0 corresponding to no detectable amount, 1 corresponding to nitrates detectable but below one, 2 corresponding to nitrates between two and three, and 3 corresponding to nitrates larger than three. The data set contains 2,087 observations. These variables were chosen because they represent a variety of classes of problems that influence benthic health. QHEI represents habitat quality, DO represents the amount of oxygen dissolved in the water, TSS is often associated with the solids discharged from sewage treatment plants and industry, pH is partly determined by acid rain and NH3 is influenced by agricultural runoff. Norton (1999) considered a subset of this data where all missing values were deleted and restricted analysis to the Eastern Corn Belt region of Ohio.
4.1 Analysis using the proposed method The data has a large proportion of missing data. Figure 1 shows the percentage of observed values for each covariate. If list-wise deletion were used 46% of the data would be discarded, thus we have a high fraction of missing data. Due to the discrete nature of pH and NH3 standard methods such as multivariate E M and multivariate normal MI are inappropriate. On the other hand, S+ includes Conditional Gaussian Modelling (impCgm) of Schafer (1997) in their missing data library. These routines allow for some of the variables to be dichotomous. However, these routines are not available for our data set due to degrees of freedom restrictions when conditioning. Table 1 shows the estimated regression coefficients using listwise deletion, mean substitution, PROC MI and S+ impCgm routines. From this table we see that for this model QHEI, DO and NH3 are significantly different from zero for all types of missing value treatments considered. Furthermore, we notice that the various methods produce different estimates and standard errors. Both the PROC MI and S+ impCgm
123
Environ Ecol Stat (2009) 16:75–87
0.6 0.4 0.0
0.2
Percent Observed
0.8
1.0
82
QHEI
NH3
TSS
pH
DO
Fig. 1 Percentage of observed data by covariate Table 1 Coefficient estimates, standard errors and P-values for regression model using listwise deletion, mean substitution, Proc MI and S+ impCgm Variable Listwise deletion Est QHEI DO TSS pH NH3
4.336 0.930 −0.035 2.914 −4.265
Mean substitution
Proc MI
S+ impCgm
SE
P-val Est
SE
P-val Est
SE
P-val Est
SE
P-val
0.265 0.278 0.242 3.958 0.407
0.000 4.496 0.001 1.169 0.882 −0.295 0.462 5.549 0.000 −3.565
0.207 0.277 0.240 2.786 0.379
0.000 4.252 0.001 1.119 0.219 −0.319 0.046 3.669 0.000 −3.741
0.211 0.282 0.232 3.115 0.375
0.000 4.244 0.000 1.101 0.168 −0.340 0.238 1.239 0.000 −3.839
0.211 0.279 0.237 1.585 0.362
0.000 0.000 0.151 0.434 0.000
estimates were computed using 100 multiple imputations and non-informative prior distributions. In both procedures, the estimates stabilize before 50 imputations. Using the proposed method we set the following conjugate prior distributions on β and σ 2 : β j ∼ N (0, 10) σ 2 ∼ I nv − Gamma(3/2, 1/2) All continuous variables were standardized, hence this should be a vague prior distribution for the regression parameters. Recall that the I nv − Gamma(3/2, 1/2) has a mean of one and infinite variance. For the continuous variables QHEI, DO and TSS we chose the prior distribution for the missing parameter to be N (0, 1) due to the fact that these are standardized variables. Hence these prior distributions are empirical prior distributions. For the dichotomous variable, pH, we chose a uniform prior of P(Z ji = 0) = 1/2. For the ordinal variable NH3 we used a similar uniform prior
123
Environ Ecol Stat (2009) 16:75–87
83
distribution P(Z ji = k) = 1/K , where K is the number of corresponding categories. Using these prior distributions we determined the appropriate full conditionals and performed the analysis. For our analysis we assumed MAR since we were unaware of any missing data mechanisms. Even though we are assuming a priori that the missing data are not correlated, from Eq. 4 we see that the imputed values are correlated. We ran 10 over-dispersed chains of 3,200 samples from the Gibbs Sampler discarding the first 200 samples from each chain as burn-in samples. Each of these chains appeared to converge quickly. The remaining 30,000 samples showed mild autocorrelation out to lag 5. We retained all 30,000 samples from which to draw inferences. Figure 2 gives the box plots of the samples from the posterior distribution of the regression coefficients generated by the Gibbs Sampler. To assess whether enough samples have been collected we used the potential scale reduction method in Gelman et al. (1995). This method estimates the factor Rˆ by which the variability might be reduced by continuing sampling. Values “near” 1 are considered ideal. Gelman et al. (1995) suggest values less than 1.2 are acceptable in practice. In Table 2 we see for each of the coefficients Rˆ is very close to 1, suggesting our samples are adequate and further sampling is not required. 4.2 Results
−4
−2
0
2
4
Using this model we find that QHEI, DO, pH and NH3 are significantly different from zero. Standardizing the variables QHEI and DO changes the interpretation of the coefficients. In this case, for each one standard deviation increase in QHEI we can expect an increase in IBI of 4.36 units while all other variables are held constant.
QHEI
DO
TSS
pH
NH3
Fig. 2 Boxplots of samples from the posterior distribution of the regression coefficients generated by the Gibbs Sampler
123
84
Environ Ecol Stat (2009) 16:75–87
Table 2 Estimated lower, median and upper posterior percentiles, probabilities P ∗ = min{P(β j < 0), P(β j > 0)} and potential scale reduction factor Rˆ for regression parameters Variable QHEI DO TSS pH NH3
L95 3.967 0.717 −0.721 2.832 −3.749
M 4.364 1.234 −0.275 3.784 −3.079
U95 4.769 1.749 0.172 4.659 −2.406
P∗ 0.000 0.000 0.113 0.000 0.000
Rˆ
1.001 1.001 1.002 0.999 1.000
Prior distribution for β j is N (0, 10)
This result agrees with biological expectations since the better the environment, the healthier the fish residing in that environment should be. For one standard deviation increase in DO we can expect a 1.23 unit increase in IBI when all other variables are held constant. This also agrees with biological expectation since fish communities need oxygen in order to thrive. The dichotomous variable pH shows that we can expect a 3.78 unit difference in IBI when pH < 6 versus when 6 ≤ pH ≤ 9 while all other covariates are held constant. This again agrees with biological expectations since low pH makes water tolerable for only a few species, hence reducing the diversity of the community. For each one category increase in the ordinal variable NH3 , we can expect a decrease in IBI of 3.07 units when all other covariates are held constant. Thus high NH3 corresponds to poorer benthic health. Even though TSS is not significant in this model we see the sign of the regression coefficient is negative, corresponding to high values of TSS being associated with decreased benthic health.
4.3 Sensitivity and model validation To determine whether the model is sensitive to the prior distribution specification, we changed the prior distribution for β from N (0, 10I) to N (1, 10I). This represents a mild shift in the mean relative to the estimated parameters. We ran 10 over-dispersed chains of 3,200 samples from the Gibbs Sampler. After discarding the first 200 samples from each chain and combining the chains together we obtained 30,000 samples to draw inferences from. Table 3 shows the percentiles for posterior distribution for each regression coefficient. By comparing Tables 2 and 3 we notice that the percentiles of the posterior distributions agree quite well and that there are no substantive changes in any of the inferences between models. Hence, we can conclude that the inferences are not sensitive to mild shifts in the mean. We also ran similar sensitivity studies to determine whether the model is sensitive to shifts in prior distribution variance. We considered the following prior distributions for β: N (0, 10I), N (0, 20I) and N (0, 100I) and obtained similar results. Hence the model is not sensitive to shifts in prior distribution variance. To validate our model we used posterior predictive performance evaluation to determine whether our model is over or under fit. For cross-validation we split the data into fit and hold out data sets. We fit the model to the fit data set and assessed the predictions
123
Environ Ecol Stat (2009) 16:75–87
85
Table 3 Estimated lower, median and upper posterior percentiles, probabilities P ∗ = min{P(β j < 0), P(β j > 0)} and potential scale reduction factor Rˆ for regression parameters Variable QHEI DO TSS pH NH3
L95
M
3.972 0.716 −0.714 2.841 −3.734
P∗
U95
4.372 1.232 −0.271 3.796 −3.072
4.771 1.745 0.175 4.658 −2.393
0.000 0.000 0.118 0.000 0.000
Rˆ
1.001 1.001 1.001 0.999 1.000
Prior distribution for β j is N (1, 10)
of the hold out sample. When using a hold out sample of size r , Gelman et al. (1995) suggest using the samples to create a statistic: 2 χobs =
pr ed r (Yi − E(Yi |Y))2 i=1
V ar (Yi
pr ed
|Y)
.
(6)
pr ed pr ed |Y) = xi βˆ and V ar (Yi |Y) = xi V arˆ (β)xi + σˆ2 and βˆ is the samwhere E(Yi ple mean vector, V arˆ (β) is the sample variance-covariance matrix of the posterior samples from the Gibbs Sampler. We estimate σˆ2 as the average of the posterior samples of σ 2 from the Gibbs Sampler. The statistic given by (6) compares the variability 2 as a reference in the hold out sample with the variability in the model. Using the χ393 distribution we can obtain a P-value. From the fully observed data a hold out sample of 400 observations was randomly sampled without replacement. We used the Gibbs Sampler to obtain 1,200 samples from the posterior predictive distribution for each of the 400 observations. Using the pr ed pr ed |Y) and V ar (Yi |Y) and last 1,000 samples from the chain we obtained E(Yi then calculated (6) for the hold out sample. We repeated this process 100 times to ensure the cross validation did not depend on the hold out sample taken. Of the 100 2 reference distribution. samples, 15 fell in the α = 0.05 rejection region of the χ393 This shows the data has slightly more variability than predicted from the model. Further inspection of the data and hold out samples shows that this discrepancy is the result of a few extreme observations in the data set. Any observations where the standardized residual is greater than 3 were deemed to be extreme observations. This data set contained three observations of this nature. When these observations were in the 2 hold out sample, the result from (6) fell into the α = 0.05 rejection region of the χ393 reference distribution.
5 Conclusions By comparing values in Table 1 we see the common missing data methods of list-wise deletion, SAS Proc MI and S+ impCgm do not show pH as being a significant variable in the model. In contrast, the method we present does show pH significant in the model,
123
86
Environ Ecol Stat (2009) 16:75–87
which agrees with biological expectations. This shows that differences may occur in inferences depending on the missing data method chosen. By comparing the estimates in Tables 1 and 2 we also notice that the regression coefficients and P-values agree quite well except on whether pH is significant or not. Results show that the standard error for pH is less using our method. Further notice that the P-value associated with TSS are similar across all methods, suggesting our method is not under estimating the standard errors. The addition of prior information about the dichotomous nature of the pH measurement allowed us to determine whether it is significant or not. The new method proposed in this article provides researchers a mean to incorporate known covariate structures into their analysis when missing covariate data is confronted. There are some unique attributes of this method that should be mentioned. First, the method does not require the assumption of MAR. Second, the only assumptions made are solely about the missing covariates. We do not need to make any assumptions about the observed covariates. Lastly, we should mention that the method does not require the standard formulas used to combine the results from the imputations. In this paper we did not discuss the issue of model selection using this method. Addressing the missing data should not be overlooked when conducting model selection and future work should investigate model selection using this framework. Li et al. (1991) consider how to perform approximate likelihood ratio tests with multiple imputations. However, in our situation the standard formulas for obtaining the approximation likelihood ratio do not apply. Bayes Factors Gelman et al. (1995) could be created by using the samples from the Gibbs Sampler to evaluate the marginal distribution of the model given the data. Acknowledgments This research was funded in part by U.S. EPA-Science To Achieve Results (STAR) Grant #RD-83136801-0. Although the research described in the article has been funded wholly or in part by the U.S. Environmental Protection Agency STAR programs, it has not been subjected to any EPA review and therefore does not necessarily reflect the views of the Agency, and no official endorsement should be inferred.
References Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the E M algorithm. J R Stat Soc, Ser B (Methodological) 39:1–38 Efron B (1994) Missing data, imputation and the bootstrap. J Am Stat Assoc 89:463–474 Gelman A, Carlin J, Stern H, Rubin D (1995) Bayesian data analysis 2nd edn. Chapman and Hall, Boca Raton Gilks W, Richardson S, Spiegalhalter D (1996) Markov chain Monte Carlo in Practice. Chapman and Hall, London Ibrahim J (1990) Experimental design for binary data. J Am Stat Assoc 85:753–760 Ibrahim J, Chen M-H, Lipsiz S (1999) Monte carlo E M for missing covariates in parametric regression models. Biometrics 55:591–596 Li K, Raghunathan T, Rubin D (1991) Large sample significance levels from multiply imputed data sets using moment based statistics and an F reference distribution. J Am Stat Assoc 86:1065–1073 Lipsitz S, Ibrahim J (1996) Conditional model for incomplete covariates in parametric regression models. Biometrika 83:916–922 Little R, Rubin D (1986) Statistical analysis with missing data. John Wiley & Sons, New York Norton S (1999) Using biological monitoring data to distinguish among types of stress in streams of the Eastern Cornbelt Plains ecoregion. PhD thesis, George Mason University
123
Environ Ecol Stat (2009) 16:75–87
87
Ohio-EPA (1988) Biological criteria for the protection of aquatic life: volume II: users manual for biological assessment of Ohio surface waters. State of Ohio Environmental Protection Agency, WQMASWS-6 Ohio-EPA (1989) The Qualitative Habitat Evaluation Index (QHEI): rationale, methods and application. State of Ohio Environmental Protection Agency Rubin D (1976) Inference and missing data. Biometrika 56:384–388 Satten GA, Carroll RJ, (2000) Conditional and unconditional categorical regression models with missing covariates. Biometrics 56:384–388 Schafer J (1997) Analysis of incomplete multivariate data. Chapman and Hall, London Schimert J, Schafer J, Hesterberg T, Fraley C, Clarkson D (2000) Analyzing data with missing values in S-plus. Insightful Corporation, Seattle Tanner M (1993) Tools for statistical inference, 2nd edn. Springer-Verlag, New York Xie F, Paik M (1997) Multiple imputation methods for missing covariates in generalized estimating equations. Biometrics 53:1538–1546 Yuan Y (2001) Multiple imputation for missing data: concepts and new development. Technical Report P267-25, SAS Institute
Author Biographies Edward L. Boone is Assistant Professor of Statistics at the Department of Statistical Sciences and Operations Research, Virginia Commonwealth University, Richmond, Virginia, USA. His research interests include environmental and ecological statistics, statistical methods for Quantitative Trait Loci and experimental design methodology. Keying Ye is Professor of Statistics at the Department of Management Science and Statistics, University of Texas at San Antonio, Texas, USA. His research interests include environmental and ecological statistics, statistical modeling in GIS data, statistical analysis in biomedical studies, experimental design and clustering methodologies. He has also worked in the field of Bayesian statistics, including methodological development and applications. Eric P. Smith is a Professor in the Statistics Department at Virginia Tech.
123