Evaluating the Relationship Between Ecological and Habitat Conditions Using Hierarchical Models Edward L. BOONE, Keying YE, and Eric P. SMITH Understanding the relationship between an index of biological community state and habitat is important for policy makers and environmental managers. A common approach to modeling this relationship is to use regression. However, this simple method becomes complicated when the data are clustered and have both within-cluster and between-cluster spatial correlation. This article proposes a Bayesian hierarchical model that incorporates both types of spatial correlation. This model yields both an understanding of the withincluster relationships as well as an overall relationship between these variables. We apply this method to evaluate the relationship between the index of biotic integrity (a common measure of fish condition) and the qualitative habitat evaluation index (a common measure of habitat quality). This method allows us to show that there is a relationship between the biological community state and habitat and that this relationship varies across river basins, while accounting for the within- and between-spatial correlations. Key Words: Bayesian methods; Biological monitoring; Ecological health; Spatial statistics; Stressor-response.
1. INTRODUCTION Environmental data pose many challenges to data analysts such as issues of scale, missing values, irregular sampling, and below-detection measurements. One problem that has not received much attention is a scale issue, namely that data are clustered by nature. For example, in aquatic studies, watersheds often have attributes that induce similarity among samples. The same attributes often result in differences between watersheds. In cases like these, it is often more informative to draw inferences from the combined data that ignores the watershed effect. These combined data allow researchers to make decisions based on the full data versus making decisions about an individual cluster. In addition to clustering, environmental data often have spatial correlation present within and between Edward L. Boone is Assistant Professor, Department of Mathematics and Statistics, University of North Carolina, Wilmington, NC 28403 (E-mail:
[email protected]). Keying Ye is Associate Professor, and Eric P. Smith is Professor, Department of Statistics, Virginia Tech, Blacksburg, VA 24061. ©2005 American Statistical Association and the International Biometric Society Journal of Agricultural, Biological, and Environmental Statistics, Volume 10, Number 2, Pages 131–147 DOI: 10.1198/108571105X45922
131
132
E. L. BOONE, K. YE, AND E. P. SMITH
clusters and any inferences about these data should account for this spatial correlation. Bayesian methodology gives us the ability to combine information across clusters while accounting for both within- and between-cluster spatial correlations. Geostatistical models and lattice models are widely used in the literature. Both Cressie (1993) and Haining (1990) provided excellent coverage of these topics. Geostatistical models directly incorporate spatial correlation between two points into the model via a function of a distance metric. For example, if we are interested in a random variable Y at various sites, then Y(si ), Y at site i is correlated to Y(sj ), Y at site j by a function, C, the distance between the two sites is σ[Y(si ), Y(sj )] = C(|si − sj |), where | · | is Euclidean distance. Note that C must be chosen carefully so that the resulting variance-covariance matrix is positive definite. This is in contrast to lattice models which use neighbors to induce the spatial correlation. When distances are available and meaningful, the geostatistical approach may be reasonable. When this is not the case lattice correlation structures may be acceptable. The basic underlying idea in the lattice correlation structures we will be using is autocorrelation. Specifically, we will be interested in simultaneous autocorrelation structure. For example, if we are interested in a random variable Y at a specific site si , and denoted by Y(si ), this variable is a function of its neighbors Y(si ) = µ(si ) +
n
bij [Y(sj ) − µ(sj )] + (si ),
j=1
where µ(si ) is the mean of Y at site si , and bij are parameters governing the autocorrelation. This article presents a Bayesian hierarchical model using geostatistical spatial correlation structures at the lower level (site level), and lattice correlation structures at the second level (river basin level). The structure of the hierarchical model allows for combining information across clusters and thus drawing coherent inferences while respecting the spatial correlation present in the data. We estimate the model using Markov chain Monte Carlo techniques. Section 2 describes a dataset collected by the Ohio Environmental Protection Agency concerning benthic health of Ohio’s waterways that motivated this work. Section 3 introduces our proposed model and Section 4 discusses the prior distributions and the full conditionals needed to sample from the posterior distribution. Fitting the model to the data and discussion of the results are in Section 5. Also in Section 5 we consider how to test various neighborhood structures using Bayes factors. Section 6 contains the conclusions.
2. THE DATA To measure biological condition or health, the state of Ohio employs the Index of Biotic Integrity (IBI) (Ohio EPA 1988). This index is a sum of metrics derived from the presence,
EVALUATING THE RELATIONSHIP BETWEEN ECOLOGICAL AND HABITAT CONDITIONS
133
Figure 1. Map of Ohio’s Waterways. The river basins are as follows: A = Hocking, B = Scioto, C = Grand, D = Maumee, E = Sandusky, F = Central Tributaries, G = Astabula, H = Southeast Tributaries, I = Little Miami, J = Huron, K = Rocky, L = Great Miami, M = Chagrin, N = Portage, O = Muskingum, P = Mahoning, Q = Cuyahoga, R = Black, and S = Mill Creek.
abundance, and health of fish. High values of the IBI correspond to healthier fish populations and conversely low IBI corresponds to degraded biological health. When the IBI is high, researchers observe organized communities of insectivores, carnivores, and intolerant fish. In low IBI areas, however, little community organization, few species, and a large percent of physical anomalies are observed (Ohio EPA 1988). In addition, the Ohio EPA employs the Qualitative Habitat Evaluation Index (QHEI) to measure habitat quality (Ohio EPA 1989). High values of QHEI correspond to high habitat quality that is supportive of aquatic life. QHEI is composed of various scores associated with sediment, riffle, cover, riparian, and pool quality. The data we use in this study were collected in 19 of the 23 river basins in Ohio. These basins are shown in Figure 1. An overall sample size of 765 was collected. The sample sizes available in each river basin varied from a minimum of 5 observations in the Huron River basin to 60 observations in the Cuyahoga River basin. Both IBI and QHEI were standardized to the grand mean. Table 1 shows the summary statistics for standardized IBI and QHEI by river basin. From Table 1 we see that IBI, QHEI, and the correlation between IBI and QHEI do not appear to be homogeneous across river basins. The dataset also includes scaled longitude and latitude measures from which we created a Euclidean distance, d, between the sampling points. Table 1 also shows that min(d) and max(d) differ across river basin. Ohio’s 23 river basins provide a natural clustering in terrain and urbanization to account for the heterogeneity in each river’s environment. For example, the Great Miami river basin
E. L. BOONE, K. YE, AND E. P. SMITH
134
Table 1. Summary Statistics on Standardized QHEI and IBI Across River Basins. Corr is the Pearson correlations between IBI and QHEI, min(d ) and max(d ) are the minimum and maximum distances present in the river basins. IBI
QHEI
River Basin
Label
Mean
SD
Mean
SD
Corr
min(d )
max(d )
ni
Hocking Scioto Grand Maumee Sandusky Central Ashtabula Southeast Little Miami Huron Rocky Great Miami Chagrin Portage Muskingum Mahoning Cuyahoga Black Mill Creek
A B C D E F G H I J K L M N O P Q R S
.340 .599 .943 −.526 −.071 .873 .370 −.257 .519 −1.051 −.071 .570 .311 −.490 .035 −.829 −.684 −.311 −1.322
.940 1.006 .780 .654 .719 1.008 .816 1.029 .920 .679 .913 .780 .679 .688 1.046 .520 .929 .545 .407
−.259 .385 .264 −.453 −.145 .371 .029 −.256 .489 −1.038 .166 .468 .494 −1.064 −.152 −.113 .063 .067 −.781
.870 1.106 .817 1.066 .918 .607 1.189 .777 .786 .541 .631 .822 .636 1.201 1.247 .888 .889 .915 1.349
.543 .599 .702 .160 .240 .540 .822 .389 .549 .653 .334 .251 .404 .717 .480 .055 .177 .392 .549
.024 .357 .032 .030 .007 .008 .051 .001 .009 .063 .005 .119 .001 .001 .003 .007 .001 .011 .084
9.556 18.547 5.019 14.731 8.111 12.011 2.019 9.350 9.261 .785 4.308 15.270 4.277 7.139 15.309 8.811 6.073 4.549 1.783
51 51 24 48 32 39 9 57 57 5 38 51 42 43 53 53 60 43 9
(L) includes two large cities, Dayton and Cincinnati. In contrast, the Hocking river is primarily rural and forested. The Cuyahoga (Q), Rocky (K), and Chagrin (M) are relatively short in length and flow through the greater Cleveland area. Hence, we cannot assume that habitat in each river basin will have homogeneous relationships with biological health. Using a hierarchical model to combine information across river basins will allow researchers to better understand how habitat relates to the overall biological health in Ohio’s waterways while still permitting the river basins to be heterogeneous. Norton (1999) performed a cluster analysis on a subset of this data and found that benthic health clusters on river basins. More clustering results were observed by Lipkovich (2002). Within each basin, rivers and streams flow through cities, forests, agricultural areas, and so on. Thus it is likely that observations near each other will exhibit spatial correlation. For example, two observations taken from the same riffle area are likely to be correlated because the local environment is similar. This spatial correlation needs to be accounted for in order to draw valid inferences. Benthic health between basins may also be correlated. As mentioned earlier the Cuyahoga, Rocky, and Chagrin rivers all have much of their length encompassed by the Cleveland metropolitan area. Hence, the relationship of habitat to benthic health may be similar in these rivers and we need to accommodate this between-cluster correlation in our analysis. Furthermore, we note that the river basins in Ohio form a highly irregular lattice structure. To explore whether spatial correlation may be present in the lower levels we employ the classical empirical semivariogram estimate of IBI (Cressie 1993). Due to the low number of observations in some of the river basins we pooled all the data by simply ignoring the basin in which the data were collected. This pooling allows us to initially explore the spatial
EVALUATING THE RELATIONSHIP BETWEEN ECOLOGICAL AND HABITAT CONDITIONS
Figure 2.
135
Classical empirical semivariogram estimate of IBI for pooled data.
correlation at the site level. Figure 2 shows a plot of the estimated semivariogram. From this we see that the semivariogram increases to an asymptote as the distance increases, suggesting that spatial correlation may be present at the within-river basin level. Any modeling should attempt to incorporate this correlation in order to make valid inferences. Note that because the data are pooled, the Euclidean distances are larger than the max(d) reported in Table 1.
3. HIERARCHICAL MODELS An attraction of hierarchical models is their ability to combine information across studies or clusters. In our application we wish to combine regressions in each river basin to understand parameters at both the river basin level and the state-wide level. The data are further complicated by the fact that it is collected across space and hence may exhibit spatial correlation at both the within- and between-river basin level. Using the Euclidean distances of points within river basins and the neighborhood structure of the river basins we can create a model which has the ability to reflect the within and between river basin correlations. In this section we introduce the model, discuss prior distributions and the resulting full conditionals needed to estimate the model via the Gibbs sampler. Our hierarchical model for a vector of sites s will be as follows: Y(s)
∼ N (X(s)β, Σ),
β
∼ N (Zγ, Γ),
γ
∼ N (a, A),
E. L. BOONE, K. YE, AND E. P. SMITH
136
where Y(s) and X(s) is the response vector and the design matrix, respectively, observed at the spatial coordinates s, and Σ is the covariance matrix for the first level which depends on the distances between the observations within each cluster or basin. For the second level, Z is the design matrix, γ is the overall regression parameter, and Γ is the between-cluster covariance matrix. The third level contains the hyperparameters a and A corresponding to the prior mean and variance, respectively, of the γ vector. For our application we organize our data in a format that allows for ease of computation and interpretation. Let Y(s, i), X(s, i), β(i) = (β0,i , β1,i ) be the response vector, the design matrix, and the regression vector, respectively, for cluster i. We organize the response vector Y(s), design matrix X(s), and mean vector β as follows: Y(s, 1) Y(s, 2) , Y(s) = .. . X(s)
=
β
=
Y(s, p) X(s, 1) 0 0 X(s, 2) .. . 0 0 β(1) β(2) .. , .
···
0 0 .. .. . . · · · X(s, p)
,
β(p) where the 0’s are vectors of the appropriate length. We will use the exponential covariance structure at the first level, which is given by: σi2 e−3djk /ρi Σ(ρi )jk = 0
if sj , sk ∈ B(i) otherwise ,
where B(i) denotes the sites s in cluster i, σi2 and ρi denote the variance and correlation parameters, respectively, for the ith cluster. In this parameterization djk is a distance between observations at locations sj and sk , ρi corresponds to the practical range (Shabenberger and Pierce 2001), which is the minimum distance for which the correlation between observations j and k is less than .05. This covariance structure also allows for ease of computation and interpretation of the parameter ρi . Due to the fact that points that lie in different clusters are not correlated via Euclidean distance, Σ is a block diagonal matrix with each block corresponding to the correlation matrix for each cluster denoted by Σ(i). We will define Z, γ, σ 2γ , and Γ as follows:
EVALUATING THE RELATIONSHIP BETWEEN ECOLOGICAL AND HABITAT CONDITIONS Z
=
γ
=
σγ2
=
1 0 γ0 γ1 σγ2 0 σγ2 1
0 1 1 0
0 1
... 1 ... 0
137
0 1
,
,
.
To model the correlations structure between the river basins we impose the following correlation structure on β 0 = (β0,1 , . . . , β0,p ) : Γ(φ0 ) = σγ2 0 (I − φγ0 N)−1 (I − φγ0 N )−1 and for β 1 = (β1,1 , . . . , β1,p ) : Γ(φ1 ) = σγ2 1 (I − φγ1 N)−1 (I − φγ1 N )−1 furthermore, β0,i and β1,j are uncorrelated for all i, j. Here N is the neighborhood matrix for the lattice process, φγ0 and φγ1 are the parameters governing the spatial correlation and σγ2 0 and σγ2 1 are the variances corresponding to γ0 and γ1 , respectively. The neighborhood matrix, N, is defined by Nij = 1 if clusters i and j are neighbors, Nij = 0, otherwise. This spatial structure is known as a simultaneous auto-regression model for lattice data (Cressie 1993). We denote Γ(φ) as the second-stage correlation matrix where the rows and columns of Γ(φ) correspond to the rows of β. By calculating the marginal distribution, the resulting model, for vectors of observations Y(i, s) and Y(i , s ) has the following covariance structure X(s, i) Γi,i X(s , i ) + Σ(i)s,s , if i = i , cov(Y(s, i), Y(s , i )) = X(s, i) Γi,i X(s , i ), if i = i , where X(s, i) and X(s , i ) are the corresponding covariate matrices, Γi,i is the secondlevel covariance structure between β(i) and β(i ). From this covariance structure we see the process is not second-order stationary. The covariance between two points depends on more than just the difference between the sites. It also depends on whether the points are in the same group and which group they are in. This situation does not lend itself well to the nonstationary methods available in the literature, since we have edges in the middle of the overall domain. In the special case where X(i, s) is a vector of ones this becomes an additively separable spatial structure, C(d, i, i ) = CL (i, i ) + I{i=i } CG (d), where CL (i, i ) is the covariance between cluster i and i imposed by the lattice structure, I{i=i } is an indicator function taking the value one when i = i , zero otherwise and CG (d) is the geostatistical covariance structure. In this case, we have a lattice structure added to a truncated geostatistical structure. When X(i, s) is not a vector of ones the structure no longer remains strictly additive. The covariance is also a function of the covariates at each site as will be the case in our example. Due to the fact that this model does not have a separable covariance structure we cannot decompose this model into a simpler model.
4. COMPUTATION To estimate the posterior distribution of the model parameters we use the Gibbs sampling method with Metropolis steps for any sampling from a nonstandard distribution. More
E. L. BOONE, K. YE, AND E. P. SMITH
138
detail on these and other Markov chain Monte Carlo methods was given by Gilks, Richardson, Spiegelhalter (1996). We will use the samples from the posterior distribution to draw inferences about the model parameters.
4.1 PRIOR DISTRIBUTIONS To be consistent, we used the same prior distributions for each of the river basins. For the first layer parameters we chose the following prior distributions: σi2
∼ Inv − Gamma(ν/2, 1/2),
ρi
∼ Exponential(c).
The prior distribution for σi2 can also be viewed as an Inv − χ2 distribution with ν degrees of freedom. If ν = 2, then the distribution is proper with infinite variance making it a reasonable vague prior distribution for σi2 . The ρi parameter was chosen to have an exponential prior distribution in order to reflect our prior belief that each cluster is second order stationary and hence correlations at large distances are rare, and to reflect the domain of the parameter. Prior distributions for the second stage parameters are: γi
∼ N (a, A),
σγ2 i
∼ Inv − Gamma(ν/2, 1/2),
φγi
∼ Uniform(1/ξ1 , 1/ξn ),
where ξ1 and ξn are the minimum and maximum eigenvalues of N , respectively. These are required to ensure that the resulting covariance matrix is positive definite (Cressie 1993). Recall, φγi is the that controls the correlation among the γ i parameter in the second level. If φγi = 0, then there is no correlation structure present or the correlation structure is misspecified. However, if φγi = 0, we cannot make any statements about the correlation structure solely on φγi .
4.2 FULL CONDITIONAL POSTERIOR DISTRIBUTIONS We estimate the model via the Gibbs sampler which requires full conditional posterior distributions. Let ni denote the number of observations in cluster i and nc the number of clusters. The model and prior distributions leads to the following full conditionals: First-layer parameters: β|· ∼ N {µβ |· , Σβ |· }, where µβ |·
=
[X(s) Σ(ρ)−1 X(s) + Γ(φ)−1 ]−1 [X(s) Σ(ρ)−1 Y(s) + Γ(φ)−1 Zγ]
Σβ |·
=
[X(s) Σ(ρ)−1 X(s) + Γ(φ)−1 ],
EVALUATING THE RELATIONSHIP BETWEEN ECOLOGICAL AND HABITAT CONDITIONS and
σi2 |·
∼ Inv − Gamma
ni + ν , 2
(Y(s, i) − X(s, i)β(i)) Σ(ρi )−1 (Y(s, i) − X(s, i)β(i)) + 1 2 π(ρi |·)
Σ(ρi )|1/2 e−(Y(s,i)−X(s,i)β (i)) Σ(ρi ) ∝ e−ρi /c |Σ
−1
(Y(s,i)−X(s,i)β (i))/2
139
,
,
Second layer parameters: γ|·
∼ N {[Z Γ(φ)−1 Z + A−1 ]−1 [Z Γ(φ)−1 β + A−1 a], [Z Γ(φ)−1 Z + A−1 ]},
σγ2 i |· π(φi |·)
∼ Inv − Gamma ∝
nc + ν (β − Zγ) Γ(φi )−1 (β − Zγ) + 1 , 2 2
Γ(φi )|1/2 e−(β −Zγγ ) Γ(φi ) |Γ
−1
γ )/2 (β −Zγ
,
,
where the p(x|·) notation denotes the conditional distribution of x given all other quantities. The full conditionals for φ and ρ are not standard distributions. Choosing the step parameter for the Metropolis method can be difficult. To determine each Metropolis step we ran a chain of 2,000 samples from the Gibbs sampler and used the last 1,000 samples to compute the variance for the unique samples and used one-third of this variance as the step value in the remaining computations. These samples were not used to draw any inferences.
5. DATA ANALYSIS To analyze the data we need to set values for the prior parameters. For the prior parameters for γ we chose a = 0, a two row vector of 0’s and A is a 2 × 2 matrix with prior variance of Aii = 100 on the diagonal and 0’s on the off diagonal, to reflect our lack of information about where γ lies. For all of the variance parameters σi and σγi we chose the prior parameter ν = 1 to achieve a prior distribution with infinite variance, reflecting minimal influence. Each of the prior distributions for ρ were assigned a value of c = 1 to reflect a small prior practical range. Finally for φi , we chose ξ1 = −.3632 and ξn = .2502, the minimum and maximum eigenvalues of N , respectively. Using the Gibbs sampler, we obtained 10 over-dispersed chains of 3,500 samples (Gilks et al. 1996). The first 1,000 samples were discarded from each chain as burn-in samples. Autocorrelation was present in each of the chains out to lag five. Hence the effective posterior distribution sample size is smaller than the 25,000 samples collected. We will use all 25,000 samples to draw inferences. To assess whether enough samples from the Gibbs sampler have been collected we use the potential scale reduction method
in Gelman, Carlin, Stern, and Rubin (1995). Using this method we estimate the factor Rˆ by which the variability might be reduced by continuing sampling. Values “near" 1 are considered ideal and Gelman et al. (1995) suggested values less than 1.2 are acceptable in practice. Due to the fact
E. L. BOONE, K. YE, AND E. P. SMITH
140
Table 2. Estimated Lower, Median and Upper Posterior Percentiles, Probabilities P∗ = min { P(β1i
0)}, Potential Scale Reduction Factor Sample Sizes for the River Basins.
River basin
Label
L95
M
U95
P∗
Hocking Scioto Grand Maumee Sandusky Central Ashtabula Southeast Little Miami Huron Rocky Great Miami Chagrin Portage Muskingum Mahoning Cuyahoga Black Mill Creek
A B C D E F G H I J K L M N O P Q R S
.240 .208 .197 −.016 .083 .372 .126 .138 .312 .030 .019 .028 .151 .271 .227 −.036 .078 .116 .037
.474 .404 .447 .142 .235 .614 .430 .422 .543 .416 .348 .218 .375 .407 .411 .137 .307 .281 .232
.709 .604 .711 .304 .394 .858 .732 .689 .780 .822 .663 .418 .608 .542 .595 .316 .544 .450 .469
.0001 .0001 .0006 .0385 .0015 .0001 .0041 .0020 .0001 .0185 .0196 .0122 .0008 .0001 .0001 .0615 .0042 .0005 .0108
ˆ R
1.0009 1.0000 1.0008 1.0000 1.0004 1.0003 1.0004 1.0004 1.0005 1.0004 1.0004 1.0001 1.0006 1.0000 1.0003 1.0006 1.0006 1.0001 1.0001
ni 51 51 24 48 32 39 9 57 57 5 38 51 42 43 53 53 60 43 9
that the computation of Rˆ depends on variance calculations and that our samples are dependent, we used a thinning interval of 10 to make the samples for this computation “pseudo-independent.” Tables 2 and 3 show the summaries of these posterior samples. In Table 3, we see that for each of the coefficients, Rˆ is very close to 1. For all 84 parameters the maximum Rˆ was 1.0267 indicating our samples obtained are adequate and further sampling is not required. Figure 3 shows boxplots for each of the mean parameters, by river basin. Recall that both IBI and QHEI are standardized, hence the intercept β0 in each river basin level model corresponds to the mean level of standardized IBI. The boxplots for β0 show that the river basins appear to have different mean levels of IBI. To specifically test the difference in mean IBI between two river basins i and j we can create a new variable η = β0i − β0j . We can estimate the marginal posterior distribution for this new parameter η by subtracting the samples obtained from the Gibbs sampler for β0i and β0j and use these differences to estimate the posterior probability P ∗ = min{P (η > 0), P (η < 0)} that the posterior Table 3. Estimated Lower, Median, and Upper Posterior Percentiles, Probabilities P∗ = min{P(ψi
0)} and Potential Scale Reduction Factor ψ
L95
M
U95
P∗
γ0 γ1 σγ0 σγ1 φγ0 φγ1
−.237 .228 .313 .139 −.208 −.354
.005 .364 .455 .206 .081 −.088
.223 .542 .704 .326 .237 .236
.475 .001 .000 .000 .275 .342
ˆ R
1.0007 1.0007 1.0010 1.0001 1.0020 1.0049
EVALUATING THE RELATIONSHIP BETWEEN ECOLOGICAL AND HABITAT CONDITIONS
141
Figure 3. Boxplots of the intercept and slope parameter samples obtained from the Gibbs sampler. The river basins are as follows: A = Hocking, B = Scioto, C = Grand, D = Maumee, E = Sandusky, F = Central Tributaries, G = Astabula, H = Southeast Tributaries, I = Little Miami, J = Huron, K = Rocky, L = Great Miami, M = Chagrin, N = Portage, O = Muskingum, P = Mahoning, Q = Cuyahoga, R = Black, and S = Mill Creek.
distribution of η is concentrated near 0. If the probability P ∗ is small, then the distribution is not concentrated near 0. If the probabilities are moderate, then there is evidence that the river basins are not different with respect to mean IBI. For example, we can test whether the mean level of IBI is the same in both the Maumee (D) and Great Miami (L) river basins. These two river basins together comprise most of the waterways in western Ohio. The Maumee drains northwest Ohio’s water into Lake Erie and the Great Miami drains southwest Ohio’s water into the Ohio River. We created the variable η for this situation and estimated P ∗ ≈ .0001. This evidence suggests that the Maumee and Great Miami river basins have a difference in mean IBI. This is an example of the type of questions regulators can pose and obtain answers to. The naive analysis assumes that there is no spatial correlation within or between the two basins, whereas our analysis incorporates this information. Table 2 contains the marginal posterior summaries that shows the slopes seem to vary across basin. The column P ∗ = min{P (β1i < 0), P (β1i > 0)} in Table 2 gives us the Bayesian p value for testing whether the posterior distribution for the parameter is
142
E. L. BOONE, K. YE, AND E. P. SMITH
Figure 4. Boxplots of the variance parameter samples obtained from the Gibbs sampler. The river basins are as follows: A = Hocking, B = Scioto, C = Grand, D = Maumee, E = Sandusky, F = Central Tributaries, G = Astabula, H = Southeast Tributaries, I = Little Miami, J = Huron, K = Rocky, L = Great Miami, M = Chagrin, N = Portage, O = Muskingum, P = Mahoning, Q = Cuyahoga, R = Black, and S = Mill Creek.
concentrated near 0. Also, 95% Bayesian intervals of those parameters are included in Table 2. From this we see in the Maumee (D) and Mahoning (P) river basins the relationship between IBI and QHEI is not significant at the α = .05 level. Figure 4 shows the boxplots of the variance and covariance parameters. Even though we view these as nuisance parameters for our analysis we can gain some insight into the spatial structure which may exist in the basins. Recall, in our parameterization, ρi corresponds to the minimum distance for which the correlation between two points is less than .05, also known as the practical range. For river basins with the marginal posterior distribution of ρi concentrated near 0, the practical range is near 0, hence only points very close together may be spatially correlated. In contrast, river basins with the marginal posterior distribution of ρi extending out beyond 0 may have spatial correlation present. In our application the spatial correlation parameters are nuisance parameters, hence we are not interested in testing their values. We also notice from Figures 3 and 4 that parameter estimates for the river basins with low sample sizes, Ashtabula (G), Huron (J), and Mill Creek (S) have relatively large degree
EVALUATING THE RELATIONSHIP BETWEEN ECOLOGICAL AND HABITAT CONDITIONS
143
Table 4. Estimated Lower, Median, and Upper Posterior Percentiles, Probabilities for Second Layer Parameters Under Prior Variance Values Aii = 10 and Aii = 10,000 ψ
L95
M
U95
10
γ0 γ1 σγ0 σγ1 φγ0 φγ1
−.223 .226 .315 .139 −.209 −.351
.002 .363 .457 .207 .089 −.088
.214 .534 .704 .327 .239 .224
10,000
γ0 γ1 σγ0 σγ1 φγ0 φγ1
−.222 .230 .313 .139 −.202 −.348
.003 .363 .458 .207 .087 −.087
.215 .530 .706 .326 .244 .230
Aii
of variability. This is an attractive feature because it shows that the model preserves the uncertainty associated within each river basins. Essentially, this reflects the sampling error associated in each river basin. The smaller the sample size the larger we would expect the sampling error to be. Table 3 shows the estimated posterior percentiles for the second layer parameters which correspond to the overall relationship. As expected, the posterior distribution for second layer intercept γ0 is centered near 0 due to the fact that the data were centered to the overall mean. The posterior distribution for the combined slope parameter γ1 is centered around .364. This indicates that a one standard deviation (state-wide scale) would be associated with a .364 standard deviation increase in IBI (state-wide scale). This combined information allows researchers and policy makers to make decisions about IBI on a state-wide level. The lower level parameters give researchers the ability to understand how QHEI is related to IBI in each river basin. Tables 3 and 4 also show the percentiles of the marginal posterior distribution for the SAR parameters φγ0 and φγ1 . Both of these distributions have a large amount of probability mass on both sides of 0 which indicates that the correlation between basins is not strong in this situation. This could be due to the correlation structure chosen or no correlation structure.
5.1 SENSITIVITY Many of the prior distributions were chosen to be relatively uninformative. For example, the Uniform distribution on the second layer correlation parameters φ and the Inv − χ2 with df = 1 was selected to obtain infinite prior variance for all variance parameters. The parameters that will have the largest influence on any inferences are the choice of variance for γ. In order to assess the sensitivity of the prior distribution parameters on the inferences we varied the prior variance for γ. Any sensitivity to this prior variance value will be mainly manifested in the parameters γ0 and γ1 . Recall, γ ∼ N (0, A), where A is a 2 × 2 matrix
E. L. BOONE, K. YE, AND E. P. SMITH
144
Table 5. Estimated Lower, Median, and Upper Posterior Percentiles, Probabilities P∗ = min{P(ψi
0)} and Potential Scale Reduction Factor Neighborhood Matrix N∗ ψ
L95
M
U95
P∗
γ0 γ1 σγ0 σγ1 φγ0 φγ1
−.269 .250 .307 .132 −.527 −.435
.018 .357 .455 .201 −.256 −.057
.280 .474 .696 .312 .226 .509
.437 .001 .000 .000 .159 .421
ˆ R
1.0023 1.0000 1.0007 1.0000 1.0019 1.0079
with prior variance of Aii = 100 on the diagonal and 0’s on the off diagonal. By changing the prior variance to both Aii = 10 and Aii = 10,000 then examining the results we can determine the influence of this parameter on our inferences. For each of these scenarios we ran 10 chains of 3,500 samples from the posterior distribution discarding the first 1,000 as burn-in samples. Table 4 shows the posterior percentiles under each scenario. From here we can see that the second layer parameters are not very sensitive to the value of Aii . When further comparing the values in Tables 4 and 3 we see the inferences are very similar. In all the scenarios γ1 is clearly significant and the estimate is stable for the various values of Aii . To explore the sensitivity of the inferences due to the choice of neighborhood matrix we created a new neighborhood matrix N∗ where we restrict the river basins which are neighbors to only those rivers that drain into Lake Erie. The choice of this neighborhood structure comes from the fact that many of river basins feeding into Lake Erie are spanned by the Cleveland and Toledo Metropolitan areas and hence may be correlated. Furthermore, only river basins with adjoining edges will be considered neighbors. We then fit the model for this new neighborhood matrix using the method described earlier. Because we are only interested in the sensitivity of the inferences to neighborhood matrix we ran 10 chains of 3,500 samples from the posterior distribution discarding the first 1,000 in each as burnin samples. Using these samples we estimated the marginal posterior percentiles for the second layer parameters. Table 5 shows the estimated posterior percentiles for the second layer parameters using neighborhood matrix N∗ . From Table 5 we see that the spatial correlation parameter φγ0 using N∗ has marginal posterior distribution concentrated below 0. This shows the sensitivity of the analysis to the neighborhood matrix. The interval for γ0 is slightly wider by a factor of 1.16 and γ1 is smaller by a factor of .71. Thus, the estimates depend on the choice of neighborhood matrix. This is a common problem in spatial statistics (see Bullock 2002). The two models here, one with neighborhood matrix N, and the other with N∗ are not nested. Furthermore, they have exactly the same number of parameters. Any form of likelihood ratio test or penalized method such as AIC or BIC is inappropriate. In the next section we are going to see if we can use Bayesian testing to differentiate these models. Bayes factors allow us to perform an appropriate test to determine which neighborhood matrix is more likely.
EVALUATING THE RELATIONSHIP BETWEEN ECOLOGICAL AND HABITAT CONDITIONS
145
5.2 BAYES FACTORS FOR TESTING MODELS Given two models M1 and M0 the Bayes factor of M1 against M0 is given by B10 = where
p(D|M1 ) , p(D|M0 )
p(D|Mi ) =
L(D|θi , Mi )p(θi |Mi )dθi .
Berger (1985) gave a good discussion of Bayes factors and Kass and Raftery (1995) gave interpretations for Bayes factors. In our case we wish to test which neighborhood matrix is most appropriate for our situation. Let M1 be the model using the fully connected neighborhood matrix N, M2 be the model using the Lake Erie connected neighborhood matrix N∗ and M3 be the model having no second level correlation structure. Using the samples from the Gibbs sampler we approximate p(D|Mi ) by r
p(D|Mi ) ≈
1 L(D|θi,k , Mi )p(θi,k |Mi ), r
(5.1)
k=1
where r is the number of samples from and θi,k is the kth sample from the joint posterior distribution. We computed the quantity in Equation (5.1) for each model using 25,000 samples from the posterior distribution obtained from the Gibbs sampler. Using these we formed the Bayes factors of interest and the results are B12 ≈ 5.9, B13 ≈ 1.0, and B32 ≈ 5.0. Using the interpretations of Kass and Raftery (1995) we see the Bayes factor B13 ≈ 1.0 gives minimal to no evidence that the fully connected second-level neighborhood structure is better than the structure with no second-level spatial correlation. Furthermore, B12 ≈ 5.9 provides evidence that the fully connected neighborhood method is better than the Lake Erie connected neighborhood structure. From B32 ≈ 5.0 we see that there is positive evidence that the no second-level spatial structure is better than the Lake Erie neighborhood structure. Based on the evidence that M1 and M3 are both better than M2 and we cannot determine that M1 is clearly better than M3 , we suggest choosing M3 in order to have a parsimonious model. Due to the low number of basins and the highly irregular lattice structure, we found it difficult to accurately determine the spatial correlation structure at the second layer for this dataset. It may be that M3 is favorable because the Euclidean distance metric used in the first level is misspecified.
6. CONCLUSIONS The hierarchical model presented here allows researchers to combine information across studies which have spatial correlation within and between the studies. The model does not require each study to have similar variances or spatial correlation structure. Although we have considered only the linear regression case, the model could be extended to
146
E. L. BOONE, K. YE, AND E. P. SMITH
the generalized linear model framework. This extension would be straight forward using the Gibbs sampling technique. This would give researchers the ability to consider a greater variety of datasets. Furthermore, we need not restrict ourselves only to spatial data settings. Time series data which occurs across space could also be analyzed with this type of model, and would incorporate both the temporal and spatial correlation. For our application this model allowed us to estimate γ which can be interpreted as the state-wide estimate of average relationships. The model also allowed us to find differences in mean IBI between river basins, specifically the Great Miami (L) and Maumee (D) river basins. Furthermore, the parameters σγ0 and σγ1 in Table 3 are large compared to the respective mean values γ0 and γ1 , indicating there is variability in both the mean IBI and in the regression relationship between IBI and QHEI between river basins. We feel that any management or policy decisions about biological health in Ohio’s waterways should be made on a river basin level versus a statewide level. All inferences are dependent on the spatial correlation structure chosen. Choosing another neighborhood matrix N or geostatistical covariance structure may change the results. The conditional autoregression (CAR) model could be an alternative to the SAR model to model the correlations between the river basins. Although our data do not support a second stage correlation structure, the method presented allows for the modeling and detection of this type of structure. Due to the model uncertainty mentioned here, we could use Bayesian model averaging to attempt to average the results across various spatial correlations structures to obtain a result that does not depend on the correlation structure chosen. This is appealing from the point of view that the spatial correlation parameters are nuisance parameters and would reduce our concern about correlation structure misspecification. Each model requires approximately 120 hours of computation time. Hence, the computational expense of computing the posterior model probabilities for even a small number of models is prohibitive.
ACKNOWLEDGMENTS We thank the referees and editor for comments on the draft of this article. This research was funded in part by U.S. EPA-Science To Achieve Results (STAR) Grant #RD-83136801-0. Although the research described in the article has been funded wholly or in part by the U.S. Environmental Protection Agency’s STAR programs, it has not been subjected to any EPA review and therefore does not necessarily reflect the views of the Agency, and no official endorsement should be inferred.
[Received March 2004. Revised October 2004.]
REFERENCES Berger, J. O. (1985), Statistical Decision Theory and Bayesian Analysis (2nd ed.), New York: Springer. Bullock, B. P. (2002), “Diameter Distributions of Juvenile Stands of Loblolly Pine (Pinus taeda L.) With Different Planting Densities,” unpublished Ph.D. dissertation, Virginia Polytechnic Institute and State University, Blacksburg, VA.
EVALUATING THE RELATIONSHIP BETWEEN ECOLOGICAL AND HABITAT CONDITIONS
147
Cressie, N. A. C. (1993), Statistics for Spatial Data, New York: Wiley. Gelman, A., Carlin J. B., Stern, H. S., and Rubin, D. B. (1995), Bayesian Data Analysis, Boca Raton, FL: Chapman Hall and CRC Press. Gilks, W. R., Richardson, S., and Spiegelhalter, D. J. (1996), Markov Chain Monte Carlo in Practice, London: Chapman and Hall. Haining, R. (1990), Spatial Data Analysis for the Social and Environmental Sciences, Cambridge: Cambridge University Press. Kass, R., and Raftery, A. (1995), “Bayes Factors,” Journal of the American Statistical Association, 90, 773–795. Lipkovich, I. A., Smith, E. P., Ye, K. (2002), “Evaluating the Impact of Environmental Stressors on Benthic Microinvertebrate Communities via Bayesian Model Averaging,” Case Studies in Bayesian Statistics (vol. VI), pp. 267–283. Lipkovich, I. (2002), “Bayesian Model Averaging and Variable Selection in Multivariate Ecological Models,” unpublished Ph.D. dissertation, Virginia Polytechnic Institute and State University, Blacksburg, VA. Norton, S. B. (1999), “Using Biological Monitoring Data to Distinguish Among Types of Stress in Streams of the Eastern Corn Belt Plains Ecoregion,” unpublished Ph.D. dissertation, George Mason University, Fairfax, VA. Ohio Environmental Protection Agency (1988), Biological Criteria for the Protection of Aquatic Life: Volume II: Users Manual for Biological Assessment of Ohio Surface Waters, State of Ohio Environmental Protection Agency. WQMA-SWS-6. (1989), The Qualitative Habitat Evaluation Index (QHEI): Rationale, Methods and Application, State of Ohio Environmental Protection Agency. Schabenberger, O., and Pierce, F. J. (2001), Contemporary Statistical Models for the Plant and Soil Sciences, Boca Raton, FL: Chapman Hall and CRC Press.