Bayesian Nonparametric Binary Response Regression Models with Application in Environmental Management Song S. Qian
Michael Lavine
Environmental Sciences and Resources
Institute of Statistics and Decision Sciences
Portland State University
Duke University
Portland, OR 97207, USA
Durham, NC 27708, USA
Craig A. Stow Nicholas School of the Environment Duke University Durham, NC 27708, USA
(revised) September 9, 1998 1
Bayesian Nonparametric Binary Response Regression Models with Application in Environmental Management
Abstract In environmental management, we often have to deal with binary response variables whose outcome dictates the course of action. This paper introduces a nonparametric Bayesian binary regression model that is more exible than the commonly used logistic or probit models. Due to the Bayesian feature, the model can be easily used to combine observed data with our knowledge of the subject to produce site-speci c results. By using three examples, this paper shows the potential application of the model in the environmental management, and its advantages in terms of exibility in model speci cation, robustness to outliers, and realistic interpretation of data.
Keywords: acid deposition, Bayesian inference, Dirichlet process, sh response, Gibbs sampler, lake eutrophication, PCB, risk assessment, salmonid
2
Bayesian Nonparametric Binary Response Regression Models with Application in Environmental Management
1 Introduction Environmental management decisions are often based on evaluating the likely outcome of a binary response. Whether the water quality of a lake or river will meet designated uses under a speci c watershed development scenario, or whether certain shes will survive at a given pollution level are familiar examples involving binary responses. The anticipated outcome of these binary responses frequently dictates the environmental management strategy chosen. Complex simulation models are sometimes used to predict binary outcomes, and a management strategy that will result in the desired response is chosen based on simulated results. However, simulation models typically do not provide quantitative measures of prediction uncertainty or of the relative probabilities of two possible consequences. Generalized linear models (GLM) have also been used to predict binary responses (e.g., Reckhow et al., 1987 and Zeger, et al., 1988). These models have a disadvantage in their in exible model form. In addition applications of GLM in environmental and ecological studies have relied heavily on cross-sectional data, restricting their site-speci c applicability. In this paper we present a nonparametric Bayesian binary response model that has a exible form, is applicable to site speci c problems, but that can use outside 3
information such as cross-sectional data. In Section 2 We brie y introduce generalized linear models, some developments of nonparametric and Bayesian methods, and the Bayesian nonparametric binary model. We demonstrate the model with three examples: (1) a lake eutrophication study in North Carolina (USA) (North Carolina lakes), (2) sh response to acid deposition in Adirondack lakes (USA) (Adirondack sh), and (3) a risk assessment of polychlorinated biphenyls (PCBs) exposure from Lake Michigan (USA) sh consumption (PCB in sh).
2 Methods: Binary Response Regression Models We consider a binary response variable Y taking values of 1 or 0, and a single explanatory variable X . The most commonly used statistical models for this type of data are the generalized linear models:
g(i) = 0 + 1xi
(1)
where i is the probability of positive response (Y taking the value 1) when the
X value is xi, and g is the link function (McCullagh and Nelder, 1989; Nelder and Wedderburn, 1972). Logistic and probit functions are two commonly used link functions. The logistic function is de ned as:
g() = log 1 ? 4
(2)
and the probit function is the inverse of a Normal cumulative density function:
g() = ?1 ()
(3)
Regardless of the link function used, the parameters of model (1) ( 0 and 1) are estimated by the maximum likelihood estimation through an iteratively re-weighted least-squares method. The probability of positive response in a logistic regression (2) is: 0 + 1 x = 1 +e e 0 + 1x
(4)
Equation (4) represents the cumulative density function (cdf) of a logistic density. For the probit regression (3), the probability of positive response is estimated by the cdf of a Normal density. In general, a binary response regression model can be summarized as:
= F ((X ))
(5)
where F represents a cdf and represents a function of the explanatory variables. The function may be linear or nonlinear and may contain unknown parameters. If prior information is available, Bayesian analyses of the binary response regression with either logit or probit link functions can be used. These methods typically involve the Gibbs sampler (Gelfand, et al., 1990) or other Markov chain Monte Carlo sampling algorithms. In the absence of prior information, Zeger and Karim (1991) presented a Gibbs sampling algorithm for logistic regression using at (or 5
non-informative) prior distributions for the parameters; Albert and Chib (1993) provide Bayesian inference for the standard probit regression using Gibbs sampling; and Czado (1994) provides a hybrid sampling algorithm with Gibbs and Metropolis/rejection sampling steps for a generalized probit regression model. Bayesian methods can provide improved inference by accounting for uncertainty regarding model parameters. However, both the Bayesian and non-Bayesian approaches are parametric and constrain structural form of the functional relationship between the predictor X and the probability to the typical sigmoid form as indicated in (4). As such, these parametric approaches fail to convey uncertainty regarding model structure. Hastie and Tibshirani (1990) presented a nonparametric logistic regression model to remove these constraints in , but not in F . In this study, a nonparametric Bayesian binary regression model is developed of the form:
= f (X )
(6)
where f is an isotonic nonparametric function, and 0 f (X ) 1. In a nonparametric and Bayesian setting, we estimate the joint distribution of (f1; ; fnjY ), where Y represents the binary data, and fi are the values of f at
xi ; i = 1; ; n. (x0i s are ordered.) n is the number of distinct x values, and we will use N to represent total sample size. 6
Since f is bounded below by 0 and above by 1, the transformation from f1 ; ; fn to s1; ; sn+1 is one to one, where si = fi ? fi?1 for i = 1; ; n + 1, and f0 = 0 and
fn+1 = 1 are the lower and upper bounds of a cdf. Therefore, the joint distribution of f1; ; fn can be describe through the distribution of s1; ; sn+1. Since si 2 (0; 1) and
nP +1 i=1
si = 1:0, a plausible multivariate probability distribution describing the
joint distribution of fs1; ; sn+1g is the Dirichlet distribution (Wilks, 1962) { the multivariate version of a beta distribution. We use the Dirichlet process (Ferguson, 1973; Antoniak, 1974) to describe the prior distribution of fs1; ; sn+1g through two pieces of information: G0, the prior expected shape of f (the best \guess" we have on how the relationship should look like), and , our con dence on G0. One might interpret as a measure of faith in the prior guess (G0) measured in units of numbers of observations (Ferguson, 1973). When a prior is derived from cross-sectional data, is often selected based on the sample size and our judgment on how relevant is the cross-sectional data set to the site-speci c problem as we illustrated in the North Carolina lake example (Section 3.1). Let DP (G0; ) be the Dirichlet process prior of f , which means that the prior distribution of fs1; ; sn+1g is a Dirichlet distribution with parameters (d1; ; dn+1), where di = G0i ? G0;i?1. An important property of a Dirichlet distribution is that the conditional distribu-
7
tion of si given si?1 and si+1 is a rescaled Beta distribution:
si
or
si+1 ? si?1 si?1; si+1 Beta(di ; di+1)
(7)
fi ? fi?1 f ; f Beta(d ; d ) i i+1 fi+1 ? fi?1 i?1 i+1
(8)
The posterior distribution of ff1 ; ; fn+1jY g is estimated by using the Gibbs sampler (Geman and Geman, 1984; Gelfand and Smith, 1990, Gelfand, et al., 1990; and Smith and Roberts, 1993); The full set of conditional distributions is:
pr(fijf ?i; Y ) Using the conditional probability formula, we have:
pr(fijf ?i; Y ) / pr(fijf ?i) pr(Y jfi; f ?i)
/ pr(fijfi?1; fi+1) pr(Y ijfi)
(9)
where f ?i = f j ; j 6= i, Y is the vector of observations and Yi is the observation vector at xi. The distribution of f f+1??ff??1 1 is the beta distribution de ned in equation (8). Therei
i
i
i
fore, the rst factor on the right hand side of equation (9) can be generated from a rescaled beta distribution. The second factor is the likelihood:
pr(Y ijfi) =
m
i Y
j =1
fiy (1 ? fi)1?y ; ij
8
ij
where yij (= 0 or 1) is the jth observed response and mi is the total number of observations at xi. Since the maximum of the likelihood is reached when fi = P my = ij
j
i
fi, which is the observed relative frequency of success, one way to generate fi is by using the rejection method (Devroye, 1986). That is to generate fi according to the rst factor in equation (9) and accept the generated value with probability m Qi fiyij (1 ? fi)1?yij j =1 m Qi (f )yij (1 ? fi )1?yij j =1 i
(10)
Gelfand and Kuo (1991) presented an auxiliary variable algorithm for the same binary regression problem. The auxiliary variables enable a Gibbs sampler that can generate posterior samples without using rejection method. However, since the number of auxiliary variables is n n, our algorithm is guaranteed to be more ecient if the rejection rate is not too large (say, less than 50%). It is not dicult to see that the closer the prior is to the posterior, the smaller the rejection rate will be. The method presented here is a special case of the semiparametric binary regression model of Newton, et al. (1996), in which F is modeled by using a modi ed Dirichlet process and is a linear function of multiple predictors. When there is only one predictor, this method is fully nonparametric. Our method provides a much simpler and straight forward computing algorithm for a single predictor problem, a common situation in environmental management. Another advantage of our method is the ease of identifying the prior model, as illustrated in the next section. 9
3 Examples Predicting phosphorus or nitrogen concentrations and classifying lakes according to trophic level are common in lake modeling. Whether a lake-speci c mechanistic or a statistical model is used, it is always necessary to compare the model prediction to a set of criteria (e.g., Vollenweider, 1968) to classify a lake's trophic state. These criteria usually come from cross-sectional data. When a lake is classi ed as eutrophic, we typically do not know how much con dence we can put on this statement. Many sources of uncertainty may aect the nal results. The criteria used may have originated from cross-sectional data which do not represent the particular lake in question and the model used for predicting phosphorus concentration has an inherent uncertainty. Lake managers may nd that an assessed probability that the lake is eutrophic or that sh will live in a lake is more useful. With such a probability, a decision maker can make a decision by balancing the costs of alternative management options against the risk of having an algal bloom or no sh in a lake. Using a probabilistic interpretation of risk could lead readily into a broader decision framework (Berger, 1985). However, it is dicult to obtain this probability using conventional mechanistic or empirical lake water quality models, since information on model error is rarely available to users. In the following three examples, we emphasize the speci cation of the prior model, i.e., eliciting the prior expected shape and the precision parameter. Data used in this paper are available from the lead author upon request (send email to: 10
[email protected]).
3.1 Chlorophyll a standard violation in North Carolina (USA) lakes We used the nonparametric binary regression model to estimate the probability of algal blooms for lakes in North Carolina, USA. In the summer of 1981, chlorophyll a (Chla) and total phosphorus (TP ) were measured in 63 lakes and reservoirs in North Carolina (Reckhow, 1993). Chla is an indicator of the amount of phytoplankton in the water body. The State of North Carolina's de nition of an algal bloom is a Chla concentration higher than 40 g/L. In most North Carolina lakes, TP is the nutrient responsible for algal growth and is used as a predictor of the Chla concentration. We used TP as the predictor variable and transformed the Chla response such that the response is 0 if Chla < 40g/L and 1 if Chla 40g/L. We assume that the probability (P ) that Chla 40g/L is a function of the in lake total phosphorus concentration:
P = f (TP )
(11)
It is reasonable to assume that f is monotonically increasing because phosphorus is usually the limiting nutrient. Because P is a probability, it is bounded between 0 and 1. We have no reason to prefer any speci c model of f . Hence, the nonparametric model is used. 11
We developed the Dirichlet process prior model from a regional cross-sectional data set obtained from the U.S. Environmental Protection Agency's National Eutrophication Survey (EPA-NES). The EPA-NES was undertaken in the early 1970s and involved a one year trophic state survey in about 700 lakes nationwide. Since North Carolina is in the southeast of the U.S., only the southeast regional data set was used. The linear logistic regression model tted to the southeast regional data was used as the expected shape of f (TP ) (Fig. 1). Since the precision parameter can be regarded as the information we have in the prior model measured as the number of data points, the value of is often selected to be close to the eective prior sample size (Ferguson, 1973). However, in this example, because the southeast regional dataset included few North Carolina lakes, the precision parameter is believed to be much smaller than the sample size of the dataset (over 400). Two values were chosen for the precision parameter corresponding to the low ( = 20) and high ( = 60) con dence levels on the prior expected shape. Since the total sample size of the North Carolina dataset is 63, = 20 represents our belief that the prior model is only about one third as important as the North Carolina data, and = 60 means that we believe that the prior information is as important as the information in the North Carolina data set. The posterior distribution of f (TP ) (Fig. 1) diers from the prior expected shape, depending on which prior precision parameter is used. A high yielded a posterior 12
Fig. 1 about here
model that is much closer to the prior compared with the posterior model developed from a low . However, we note that the dierence appears mainly in the region with few observations (TP > 0.05 mg/L). The linear logistic regression model tted to the data from the 63 North Carolina lakes provides a contrast to the posterior model tted using the low precision parameter value (Fig. 2). Because only four lakes surveyed had TP concentrations
Fig. 2
higher than 0.1 mg/L, the high probability of standard violation in the linear logistic
about here
model tted to the North Carolina data is somewhat misleading. The 50 some data points with TP concentration less than 0.05 indicate that the probability of standard violation is very close to 0 when TP is in this range. Since the linear logistic model is constrained to be sigmoid, those 50 some points also determined that the probability of standard violation for TP > 0.05 mg/L should be very close to 1, even though there are few data in that region. The posterior nonparametric Bayesian model uses the power of the prior model to overcome this problem better. To illustrate this point, both models were re tted with the two points having the highest TP values removed. The linear logistic regression predictions are the same with or without the two data points of the highest TP values, while predicted probabilities of standard violation for these two are closer to the prior model using the nonparametric Bayesian posterior model.
13
3.2 Brook trout in Adirondack lakes When cross-sectional data are not available, expert opinions are often solicited for decision making purposes (Savage, 1971; Shafer, 1986). As in the case with crosssectional data, expert opinions re ect collective behavior and site-speci c details may not be included. When site-speci c data are available, one should combine them with expert opinions to yield a more reliable prediction. This example uses data taken from a study of sh response to lake acidi cation (N = 75) (Reckhow, 1987; 1988b; Reckhow et al., 1987; Lavine, 1994). Adirondack lakes (in New York state, USA) historically supported brook trout populations. Experimental evidence indicates that acid precipitation is a likely cause of the current absence of brook trout in many Adirondack lakes. The probability that a lake continues to support brook trout was modeled as a logistic function of lake's pH and calcium concentration by Reckhow (1987, 1988b). Prior opinions were elicited from an expert, and data were collected from a large number of lakes in the Adirondack region. Reckhow (1987, 1988b) performed an empirical Bayesian analysis of a linear logistic regression using pH and calcium concentration as the predictors. The priors of the parameters were constructed from a sheries expert's answers to questions like: \Given 100 lakes in the Adirondacks that have supported brook trout populations in the past, and if all 100 lakes have pH = 5.6 and calcium 14
concentration = 130 eq/L, what number do you now expect to continue to support the brook trout population?" This question was repeated twenty times with a variety of pH-calcium pairs to yield 20 predicted responses. In this study, we use pH as the only predictor variable since it is shown that calcium concentration played a less important role in predicting sh presence (Reckhow, 1987, 1988b). The prior model used in this study was a linear regression of the nonin nite logits of the expert's responses t as a function of pH, as in the 0 model of Lavine (1994). In Lavine (1994), sensitivity of the posterior model to the speci cation of prior distributions were studied. Applying two classes of prior models to the same data set and a logistic model, the posterior predictions constrained to resemble the shape of the cumulative density function of a logistic distribution for all prior models. The data show that when pH is higher than 6, there appears to be no relationship between sh present and the pH. In fact, many water quality standards consider that the \normal" pH level of a water body is between 6 and 9. The logic would be that if the pH level is normal in a given lake, then the presence of brook trout should not be aected by the pH; other factors are controlling. Comparing the posterior models (Fig. 3) with the posterior predictions using the parametrically tted models in Lavine (1994), we see that the nonparametric model ts the data better. The posterior models (Fig. 3) 15
show that probability of presence of brook trout is a constant when pH is higher than
Fig. 3
6. This constant (about 0.78) is closer to the recorded fraction of lakes supporting
about here
brook trout (0.65 or 30 out of 46 lakes with pH level above 5.5) than to the prior estimate of 1.0.
3.3 Risk assessment of PCB exposure from consuming Lake Michigan (USA) sh Human exposure to polychlorinated biphenyls (PCBs) from consuming Great Lakes shes is a continuing health concern, especially to pregnant women and young children (Jacobson and Jacobson, 1993, 1996). In response, sh consumption advisories have been issued by various state and local agencies to caution the public of possible risks associated with eating contaminated sh. The state of Wisconsin (USA) issued an advisory for Lake Michigan shes containing ve consumption categories based on sh PCB concentrations. According to this advisory, sh can be eaten without restriction if concentrations are below 0.05 mg/kg; consumption should be restricted to no more than one meal per week if concentrations are between 0.05 and 0.20 mg/kg; when concentrations are between 0.20 and 1.00 mg/kg, consumption should be limited to no more than one meal per month; concentrations between 1.00 and 1.90 mg/kg result in a six meal per year restriction, and people are advised not to eat any sh with PCB concentration greater than 1.90 mg/kg (Wisconsin DH & DNR, 1997). 16
Since anglers cannot easily know the PCB concentration of their catch, the advisory translates these concentration-based consumption categories into sh size ranges for the important recreational species. In this example, we present a size-based probabilistic assessment of PCB exposure from consuming ve Lake Michigan salmonids. PCB concentration varies highly among individual sh of the same size and species (Madenjian et al., 1994). We present this variability in terms of probability that the concentration of a speci c size sh exceeds certain advisory level. Data used in this analysis were collected by the Wisconsin and Michigan Departments of Natural Resources from 1984 - 1994 for ve salmonid species: brown trout (Salmo trutta ), chinook salmon (Oncorhynchus tshawytscha ), coho salmon (Oncorhynchus kisutch ), lake trout (Salvelinus namaycush ), and rainbow trout (Oncorhynchus mykiss ). All data are skin-on lets from individual sh, approximating
the portion that people eat. The relationship between PCB concentration and sh size is apparent (Fig. 4). Chemical analysis details can be found in Stow (1995). There is no information on the functional form of the relationship between the probability of PCB concentration exceeding advisory levels and the sh size. In a previous study Stow and Qian (1998) developed regression models of PCB concentration using sh size as the predictor. These models show that a larger sh tend to have a higher PCB concentration. In other words, eating larger sh have a higher risk of 17
Fig. 4 about here
PCB exposure. This assessment is supported by conventional wisdom, i.e., if two sh live in the same habitat, the larger one is usually older; therefore it will accumulate more PCB than the smaller (or younger) one. A vague prior model for this example was developed from the above two source of information. First we assume that the relationship is monotonic, the larger the sh, the higher the probability. Second, we assume the expected shape of the prior model is linear (it takes value 0.45 at size of 15 cm and 0.55 at size of 120 cm). The precision parameter is chosen to be 5, a very small number compared with the sample size of the data (from 220 to 452, Fig. 4). The linear relationship of the prior expected shape has no bearing on the posterior models (Fig. 5).
Fig. 5
The posteriors have regions of rapid changes. These jumps may re ect the dietary shift of a sh. Madenjian, et al. (1998) reported that small lake trout (< 40 cm) eat small alewife (Alosa pseudoharengus, which have an average PCB concentration of 0.2 mg/kg), intermediate-size lake trout (40 60 cm) eat alewife and rainbow smelt (Osmerus mordax, whose PCB concentrations ranged from 0.2 to 0.45 mg/kg) and large lake trout ( 60 cm) eat large alewife (with an average PCB concentration of 0.6 mg/kg). This behavior is captured in our posteriors.
18
about here
4 Discussion We have shown a nonparametric Bayesian regression model for binary response data. Compared to the usual parametric binary regression models, this Bayesian model is more exible, is easy to implement, and the cost of computing is low. The North Carolina lakes example demonstrates the use of cross-sectional data for eliciting prior information. In this example, the nonparametric model revealed the weakness of the site-speci c data, i.e., there are insucient data for lakes with high TP concentrations. Therefore, prior information makes signi cant contribution in assessing the risk of having an algal bloom for lakes with high TP concentrations. In contrast, prior information plays an insigni cant role in determining the posterior probability of sh presence for lakes with a pH level higher than 6.0. Results from the Adirondack sh example, which illustrates the use of expert opinion as the prior model, show that the probability of sh presence is positively associated with lake pH when the pH level is less than 6, and there appears to have no relationship between sh presence probability and pH when the pH level is above 6. This result is absent from previous studies where linear logistic model is used. Both examples show that nonparametric model is more exible in capturing locally persistent patterns in the data. In the PCB in sh example, a somewhat vague prior model elicited from previous analyses of the same data set was used. We compared the results from this 19
binary response model and results from two regression models (Fig. 6) (Stow and Qian, 1998). The comparison indicates that the choice of statistical method is very important in risk assessment, since dierent statistical methods in this example led to dierent conclusions. Figure 6 shows that dierent models yield dierent estimated risk (probability of PCB exceeding certain level). In other words, dierent advisory
Fig. 6
category boundaries will result from the three models if the upper boundary of each
about here
advisory category is selected based on an acceptable risk. An alternative for assessing the risk of PCB exposure is to calculate the posterior distribution of the length at which a sh exceeds the standard. We will not pursue this since we do not have the probability distribution of sh size. In a parametric regression model, the behavior of the regression function in one region is always closely related to the behavior in another region, even though the two regions may be far apart. This link between regions may or may not be appropriate. Nonparametric regression models are usually tted based on local data; therefore behavior of the regression model in one region is nearly independent of the behavior in another region. Site-speci c behavior can also be addressed using non-Bayesian approach. For example, Zeger, et al. (1988) introduced the mixed GLM that explicitly models siteinduced heterogeneity in regression parameters. In the examples, the number of initial runs were taken to be 500,000 and samples 20
are taken one in every 500 iterations thereafter. However, we found that 1,000 initial runs are sucient and samples could be taken every 50 iterations with no apparent serial correlation. The rejection rates were less than 10% in the North Carolina lake example (n = 28) and Adirondack sh example (n = 75). In the PCB in sh example the rejection rates were high (but less than 50%) for evaluating posterior probabilities that PCB concentrations would exceed 0.05 mg/kg. For the other three PCB advisory levels rejection rates were less than 25%. Numbers of parameters estimated (n) were 45 for brown trout, 68 for chinook salmon, 44 for coho salmon, 57 for lake trout, and 52 for rainbow trout.
Acknowledgments The authors thank K.H. Reckhow, P. Muller, R.L. Wolpert, E.C. Lamon, P. Vaas for discussions and suggestions on an earlier draft. We greatly appreciate the helpful comments and suggestions from three referees and the editor.
21
References Albert, L.H. and Chib S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association 88:669-679. Antoniak, C.E. (1974). Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Annals of Statistics, 2:1152-1174. Berger, J.O. (1985). Statistical Decision Theory and Bayesian Analysis. Second edition, Springer-Verlag, NY. Cleveland, W.S. (1993) Visualizing Data. Hobart Press, New Jersey. Czado, C. (1994). Bayesian inference of binary regression models with parametric link. Journal of Statistical Planning and Inferences, 41:121-140. Devroye, L. (1986). Non-Uniform Random Variate Generation. Springer-Verlag, New York. Ferguson, T.S. (1973). A Bayesian analysis of some nonparametric problems. Annals of Statistics, 1:209-230.
Gelfand, A.E., and Kuo, L., (1991). Nonparametric Bayesian bioassay including ordered polytomous response. Biometrika, 78:657-666. Gelfand, A.E., and Smith A.F.M., (1990). Sampling-based approaches to calculating 22
marginal densities. Journal of the American Statistical Association, 85:398-409. Gelfand, A.E., Hills S.E., Racine-Poon A., and Smith A.F.M. (1990). Illustration of Bayesian inference in normal data models using Gibbs sampling. Journal of the American Statistical Association, 85:972-985.
Geman, S. and Geman D. (1984). Stochastic relaxation, Gibbs distributions, and Bayesian restoration of images. IEEE Trans. on Pattern Analysis and Machine Intelligence, 6:721-741.
Hastie, T.J. and Tibshirani R.J. (1990). Generalized Additive Models, Chapman and Hall, London. Jacobson, J.L. and Jacobson, S.W. (1993). A 4-year followup study of children born to consumers of Lake Michigan sh. Journal of Great Lakes Research, 19:776-783. Jacobson, J.L. and Jacobson, S.W. (1996). Intellectual impairment in children exposed to polychlorinated biphenyls in utero. The New England Journal of Medicine, 335:783-789. Lavine, M. (1994). An approach to evaluating sensitivity in Bayesian regression analysis. Journal of Statistical Planning and Inference, 40:233-244. Madenjian, C.P., Carpenter, S.R. and Rand, P.S. (1994). Why are the PCB concentrations of salmonine individuals from the same lake so highly variable? Canadian 23
Journal of Fisheries and Aquatic Sciences, 51:800-807.
Madenjian, C.P., Hesselberg, R.J., Desorcie, T.J., Schmidt, L.J., Stedman. R.M., Begnoche, L.J., and Passino-Reader, D.R. (1998). Estimate of net trophic transfer eciency of PCBs to Lake Michigan lake trout from their prey. Environmental Science and Technology, 32:886-891.
McCullagh, P. and Nelder J.A. (1989). Generalized Linear Models (second edition). Chapman and Hall, London. Nelder, J.A. and Wedderburn R.W.M. (1972). Generalized linear models. Journal of the Royal Statistical Society, A, 135:370-384.
Newton, M.A., Czado C., and Chappell R. (1996). Bayesian inference for semiparametric binary regression. Journal of the American Statistical Association, 91:142153. Reckhow, K.H. (1987). Robust Bayes models of sh response to lake acidi cation. In: M.B. Beck (editor) Systems Analysis in Water Quality Management, Pergamon Press, Oxford, pp. 61-72. Reckhow, K.H. (1988a). Empirical models for trophic state in southeastern U.S. lakes and reservoirs. Water Resources Bulletin, 24:723-734. Reckhow, K.H. (1988b). A comparison of robust Bayes and classical estimators for 24
regional lake models of sh response to acidi cation. Water Resource Research, 24:1061-1068. Reckhow, K.H. (1993). A random coecient model for chlorophyll- nutrient relationships in lakes. Ecological Modelling, 70:35-50. Reckhow, K.H., Black R.W., Stockton T.B., Jr., Vogt J.D., and Wood J.G. (1987). Empirical models for sh response to lake acidi cation. Canadian Journal of Fisheries and Aquatic Sciences, 44:1432-1442.
Savage, L.J. (1971). Elicitation of personal probabilities and expectations. Journal of the American Statistical Association, 66:783-801.
Shafer, G. (1986). Savage revisited. Statistical Science, 1:463-485. Smith, A.F.M. and Roberts G.O. (1993). Bayesian computation via the Gibbs sampler and related Markov chain Monte Carlo methods. Journal of the Royal Statistical Society, B, 55:3-23.
Stow, C.A. (1995). Factors related to PCB concentrations in Lake Michigan salmonids. Environmental Science and Technology, 29:522-527. Stow, C.A. and Qian, S.S. (1998). A size-based probabilistic assessment of PCB exposure from Lake Michigan sh consumption. Environmental Science and Technology, 32:2325-2330. 25
Vollenweider, R.A. (1968). The scienti c basis of lake and stream eutrophication, with particular reference to phosphorus and nitrogen as eutrophication factors. Technical Report DAS/DSI68.27, Organization for Economic Cooperation and Development, Paris, France. Wilks, S.S. (1962). Mathematical Statistics. John Wiley and Sons, Inc., New York. Wisconsin Division of Health and Wisconsin Department of Natural Resources (1997) Important Health Information for People Eating Fish from Wisconsin Waters. Pub-
lication No. FH824 97. Zeger, S.L. and Karim, M.R. (1991). Generalized linear models with random eects: a Gibbs sampling approach. Journal of the American Statistical Association, 86:7986. Zeger, S.L., Liang K.Y., and Albert, P.S. (1988). Models for longitudinal data: a generalized estimation equation approach. Biometrics, 44:1049-1060.
26
Biographical Sketch Song Qian is an assistant professor of environmental sciences and resources at Portland State University. He received his Ph.D. and an M.S. from Duke University in environmental science and statistics, respectively. He also has an M.S. in environmental systems engineering from Nanjing University (China) and a B.S. from Tsinghua University in environmental engineering.
Michael Lavine is an associate professor of statistics and of environment. He has joint appointments at Duke university in statistics (Institute of Statistics and Decision Sciences, primary) and the environment (Nicholas School of the Environment, secondary). He received a Ph.D. from University of Minnesota in statistics, an M.S. from Dartmouth College, a B.S. from Beloit College (both in mathematics).
Craig Stow is a visiting assistant professor of aquatic sciences in the Nicholas School of the Environment at Duke University. He received a Ph.D. from Duke University in Environmental Modeling, an M.S. from Louisiana State University in Marine Sciences and a B.S. from Cornell University in Environmental Technology.
27
List of Figures Figure 1. Posterior models of the North Carolina lakes example The model in the lower panel was tted with = 60, and the upper with = 20. The solid line is the posterior median, the dotted line is the prior expected shape, the short dashed line is the 5th percentile, and the long dashed line is the 95th percentiles of the posterior probability of standard violation. Dots in the gure are the North Carolina data points. Figure 2. Comparison of the linear logistic model and the Bayesian model The solid line is the posterior median ( = 20); the dotted line is the prior expected shape; the short dashed line is the tted linear logistic model; the long dashed line is the posterior median estimated with the two highest TP points removed. The box plots are the extrapolated prediction of the distributions of the probability of standard violation for the two removed data points using the Bayesian nonparametric model. The horizontal line inside the box represents the median, the box represents the inter-quartile range, and the whiskers are the 5 and 95 percentiles. Figure 3. Posterior models of the Adirondack sh example. The model in the lower panel was tted with = 10, and the upper with = 5. The solid line is the posterior median, the dotted line is the prior expected shape, the short dashed line is the 5th quantile, and the long dashed line is the 95th 28
quantile of the posterior. Dots in the gure are the data points. There is no signi cant dierence between the posterior models tted using the high and low precision parameter values. Figure 4. Log transformed PCB concentration versus size of rainbow trout (N = 220), lake trout (N = 396), coho salmon (N = 269), chinook salmon (N = 452), and brown trout (N = 220). The shaded lines are the Wisconsin advisory categories. Figure 5. Probabilities of exceeding Wisconsin advisory categories (rows) for each of ve species (columns) estimated using the Bayesian nonparametric binary regression model. The solid lines are the estimated median of the probabilities, the dotted lines are the 5th and the dashed lines are the 95th quantiles of the estimated probabilities. Figure 6. Probabilities of exceeding 0.05 mg/kg (solid), 0.20 mg/kg (dotted), 1.00 mg/kg ( ne dash) and 1.90 mg/kg (coarse dash), for each of ve species estimated using the binary regression (labeled as Binary), nonparametric regression (Nonparametric) and linear regression (Parametric) models. Vertical lines indicate advisory
category size boundaries established by the state of Wisconsin (USA). The line type of each vertical line corresponds to the line type of the concentration used to establish that advisory category. All coho are in 12 meals/year advisory category.
29
Low
1.0
0.8
0.6
Probability of standard violation
0.4
0.2
0.0 High
1.0
0.8
0.6
0.4
0.2
0.0 0.005
0.050
0.500
TP (mg/L)
Figure 1: 30
median prior shape 5th quantile 95th quantile
1.0
posterior median prior shape Linear logisitic posterior, points removed
Probability of standard violation
0.8
0.6
0.4
0.2
0.0 0.005
0.050
0.500
TP (mg/L)
Figure 2: 31
Low 1.0
0.8
0.6
Probability of fish present
0.4
0.2
0.0 High 1.0
0.8
0.6
0.4
0.2
0.0 4
5
6
7
8
pH
Figure 3: 32
median prior shape 5th quantile 95th quantile
RAINBOW 1.5 1.0 0.5 0.0 -0.5 -1.0 -1.5
LAKE 1.5 1.0 0.5 0.0
log PCB concentration (mg/kg)
-0.5 -1.0 -1.5
COHO 1.5 1.0 0.5 0.0 -0.5 -1.0 -1.5
CHINOOK 1.5 1.0 0.5 0.0 -0.5 -1.0 -1.5
BROWN 1.5 1.0 0.5 0.0 -0.5 -1.0 -1.5 20
40
60
80
Length (cm)
Figure 4: 33
100
Probability
20
1.0 0.8 0.6 0.4 0.2 0.0
1.0 0.8 0.6 0.4 0.2 0.0 20
40
40
60
80
100
20
40
60
80
100
1.90 mg/kg Brown
1.90 mg/kg Chinook
1.90 mg/kg Coho
1.90 mg/kg Lake
1.90 mg/kg Rainbow
1.00 mg/kg Brown
1.00 mg/kg Chinook
1.00 mg/kg Coho
1.00 mg/kg Lake
1.00 mg/kg Rainbow
0.20 mg/kg Brown
0.20 mg/kg Chinook
0.20 mg/kg Coho
0.20 mg/kg Lake
0.20 mg/kg Rainbow
0.05 mg/kg Brown
0.05 mg/kg Chinook
0.05 mg/kg Coho
0.05 mg/kg Lake
0.05 mg/kg Rainbow
60
80
100
20
40
60
80
100
Length (cm)
34 Figure 5:
20
40
60
80
1.0 0.8 0.6 0.4 0.2 0.0
1.0 0.8 0.6 0.4 0.2 0.0
100
20
40
60
80
100
Rainbow Binary
Rainbow Nonparametric
Rainbow Parametric
Lake Binary
Lake Nonparametric
Lake Parametric
1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2
Probability
0.0 Coho Binary
Coho Nonparametric
Coho Parametric
Chinook Binary
Chinook Nonparametric
Chinook Parametric
1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 Brown Binary
Brown Nonparametric
Brown Parametric
1.0 0.8 0.6 0.4 0.2 0.0 20
40
60
80
100
20
Length (cm)
Figure 6: 35
40
60
80
100