Bayesian deduction for redundancy detection in groundwater quality

4 downloads 0 Views 3MB Size Report
Bayesian deduction for redundancy detection in groundwater quality monitoring networks. Khalil Ammar,1 Abedalrazq Khalil,2 Mac McKee,3 and Jagath ...
WATER RESOURCES RESEARCH, VOL. 44, W08412, doi:10.1029/2006WR005616, 2008

Bayesian deduction for redundancy detection in groundwater quality monitoring networks Khalil Ammar,1 Abedalrazq Khalil,2 Mac McKee,3 and Jagath Kaluarachchi1 Received 11 October 2006; revised 13 February 2008; accepted 30 April 2008; published 8 August 2008.

[1] A new methodology for designing a network for monitoring ambient, long-term

groundwater quality is presented in this paper. The methodology is based on a sparse Bayesian learning approach known as a relevance vector machine (RVM) which produces probabilistic predictions that quantify the uncertainty in both the data and the model parameters. A reliable and parsimonious network configuration that is pertinent to the physics of the case study, revealed through understanding of the information content of the available data, is sought through application of the RVM. The methodology has been employed to reduce redundancy in the network for monitoring nitrate (NO 3 ) in the West Bank Palestinian National Authority aquifers to illustrate the potential for use of RVMs in optimal groundwater monitoring and to explore possible trade-offs between different monitoring objectives, e.g., monitoring cost versus uncertainty in groundwater. A sparse monitoring network configuration produced by the RVM-based model indicates that only 32% of the existing monitoring sites in the aquifer are sufficient to characterize the nitrate state. Proof of correctness and accuracy using rigorous statistical tests is presented. Citation: Ammar, K., A. Khalil, M. McKee, and J. Kaluarachchi (2008), Bayesian deduction for redundancy detection in groundwater quality monitoring networks, Water Resour. Res., 44, W08412, doi:10.1029/2006WR005616.

1. Introduction [2] The objective of a monitoring network is to gather information to be used for such purposes as characterization of ambient conditions, detection of the existence or location of undesirable conditions, or verification of compliance with regulations [Loaiciga et al., 1992]. Our objective is to test a new methodology for detecting spatial redundancy in a network for monitoring ambient, long-term groundwater quality. Specifically, our purpose is to determine and eliminate from an existing network those monitoring sites that could be considered redundant in terms of the redundant information they supply, i.e., minimal added information. This is carried out in a manner that diminishes the uncertainty in groundwater quality detection that would result from the arbitrary elimination of monitoring wells. Therefore, the question is how to reduce the size of the existing monitoring network while maintaining efficient collection of information without oversampling, thereby saving cost and time associated with data collection and analysis. Accordingly, a methodology for optimizing the network should reduce the current monitoring network dimension by retaining the most relevant monitoring sites and discarding those that are least relevant. However, a reduction of the dimension of the monitoring network might imply an increase in the uncertainty of the groundwater 1 Department of Civil and Environmental Engineering, Utah State University, Logan, Utah, USA. 2 Department of Earth and Environmental Engineering, Columbia University, New York, New York, USA. 3 Utah Water Research Laboratory, Utah State University, Logan, Utah, USA.

Copyright 2008 by the American Geophysical Union. 0043-1397/08/2006WR005616

quality condition being monitored. This trade-off between monitoring cost (in terms of the network dimension) and uncertainty is a concern in the design and operation of the monitoring network. The methodology presented here capitalizes on the main strength of relevance vector machines (RVMs), their ability to generate sparse models, to develop a sparse monitoring network from an already existing one. The method can be used to examine such questions as which sampling sites from an already existing monitoring network should be included and which redundant sites should be eliminated while at the same time quantifying the change in uncertainty of the resulting estimates of groundwater quality. [3] Several methods have been proposed to address the problem of reducing redundancy in existing monitoring networks. Mogheir and Singh [2002] applied a probabilistic framework using information theory for reducing the number of redundant wells and obtaining an optimal groundwater monitoring network design to quantify the information (uncertainty) needs and to provide informationbased statistical measures to evaluate the efficiency of the monitoring network. This resulted in classifying the monitoring network into zones of high and low redundant information. This makes possible a reduction in the monitoring network density at high-information zones by eliminating redundant wells or expanding the existing monitoring network for collection of more information at low-information zones. A limitation of the approach of Mogheir and Singh [2002] is in its reliance upon the maximum distance lag between wells to reduce the number of redundant wells within zones of high redundancy, which does not reflect the relative importance in terms of pollutant concentration or impact on uncertainty of the data obtained from each individual well within the zone. Nunes et al. [2004] pro-

W08412

1 of 15

W08412

AMMAR ET AL.: BAYESIAN DEDUCTION FOR REDUNDANCY DETECTION

posed space-time models for optimizing a groundwater monitoring network for redundancy reduction. Nunes et al. [2004] used variance reduction techniques for spatial redundancy reduction employing the variance of the estimation error as an indicator of which spatial distribution is best between available sampling locations and selecting the combination that minimizes the variance. They used the sum of differences between time series to evaluate temporal redundancy reduction. Stations that show larger differences between them over time and that have the best spatial distribution are retained. They solved the optimization models using simulated annealing, adopting entropy to parameterize the simulated annealing algorithm. They found that the subset of stations that best reflects the spatial variability of the state variable includes the least redundant stations and maximizes the relevance of the data collected. [4] Reed et al. [2000] analyzed spatial redundancies using global mass interpolation. This was done in a case study involving plume interpolation and genetic algorithm at Hill Air Force Base in northern Utah. Other, later studies have involved larger, more complex examples. For example, Reed et al. [2001] analyzed spatial redundancy using a local redundancy measure by summing the squared deviations between local concentration estimates attained using data from all available sampling locations and the estimates based on a sampling plan at each location. Reed and Minsker [2004] used both global and local redundancy analysis and proposed the use of genetic algorithms for optimization of long-term groundwater monitoring network design. They combined variance methods and redundancy analysis metrics that utilized geostatistics in combination with a genetic algorithm to design a monitoring network to achieve multiple objectives. [5] Recent advances in machine learning theory motivate our application of these advanced concepts in water resources management. Asefa et al. [2004] used an approach from statistical learning theory, support vector machines (SVMs), to design a long-term groundwater head monitoring network in order to reduce spatial redundancy. The SVM method uses a uniquely solvable quadratic optimization problem that minimizes the bound on generalized risk rather than just the mean square error of differences between measured and predicted groundwater head values. The result of the optimization problem was a sparse groundwater monitoring network with the number and locations of monitoring wells defining the potentiometric surface. They showed that the SVM approach is unlike previous approaches where one defined the size of the long-term monitoring network a priori and then let the algorithms systematically search different realizations of the monitoring network in order to optimize an objective function. Another important SVM application by Asefa et al. [2005] was for an optimal groundwater monitoring network for detecting groundwater contamination. They used the SVM method to reproduce the behavior of Monte Carlo-based flow and transport models in the design of a groundwater contamination detection monitoring system and showed that the results obtained were identical with those produced from the physical model. The approach by Asefa et al. [2005] also gave very close estimates of reliabilities and further demonstrated that the results obtained from SVM modeling were better than those obtained from application of artificial

W08412

neural networks (ANNs). These SVM models gave results that were identical to those obtained from traditional physical models. Other important applications of SVMs in water resources and hydrology are the work on parameter estimation of a conceptual rainfall-runoff model and soil moisture prediction [Gill et al., 2006] and embedding SVMs and ANNs into a genetic algorithm framework to replace numerical models for optimization of possible locations to install pumping wells [Yan and Minsker, 2006]. [6] The good performance demonstrated by learning machines in hydrologic applications in groundwater quality modeling [Khalil et al., 2005a], real-time management of reservoir releases [Khalil et al., 2005b], and modeling of chaotic hydrologic time series [Khalil et al., 2006] motivate the use of these machines in the reduction of redundancy in groundwater quality monitoring networks. The rationale for selecting the RVM modeling approach for this study over SVMs and ANNs is that many studies have shown that RVMs perform better than either SVMs or ANNs in many applications in terms of their sparsity and the accuracy of their predictions [e.g., Tipping, 2001; Khalil et al., 2005a]. Proponents of RVMs propose their use as forecasting tools because they (1) are simple to use; (2) produce sparse predictive models (i.e., reduce redundancy), thereby allowing better generalization performance and avoiding overfitting; (3) infer information contained in the data because of their Bayesian framework; (4) derive accurate probabilistic prediction models (unlike SVMs); and (5) allow computation of confidence intervals for a prediction that accounts for the combined uncertainties in both the model parameters and the data. [7] The main contribution of this work is in introducing a new methodology for reducing the spatial redundancy in a groundwater quality monitoring network. This methodology employs a Bayesian framework in the form of the relevance vector machine that quantifies the collective uncertainty in both the data and model parameters, and it generates a model that is sparser than those produced by the SVM method that has been documented in many studies.

2. Case Study [8] The study used in this paper covers the West Bank, Palestinian territories, which has an area of approximately 5660 km2. The total Palestinian population living in the West Bank is about 2.5 million according to projections made by the Palestinian Central Bureau of Statistics [1997] for the year 2005. Land uses in the West Bank are categorized as urban areas, rural areas, irrigated areas, rain-fed areas, and natural reserved areas (Figure 1). Most of the area of the West Bank consists of rain-fed lands; irrigated areas are limited to about 10,000 ha. [9] Groundwater is the main water source for all uses in the West Bank. The aquifer system consists of three main basins (see Figure 2). These are delineated according to water divide, both in terms of structure and hydraulic water divide, into the Eastern, Northeastern, and Western basins. This aquifer system is of the Albian-Quaternary geologic age and is composed of karstic and permeable limestone and dolomite. From a modeling perspective, the aquifers in each basin can be defined as the shallow aquifer, the upper aquifer, and the lower aquifer. The study area has many structural features such as faulting and folding features

2 of 15

W08412

AMMAR ET AL.: BAYESIAN DEDUCTION FOR REDUNDANCY DETECTION

W08412

Figure 1. Land use map of the West Bank (between 34°570E and 35°300E, and between 31°N and 32°300N), Palestinian National Authority. which are the main factors that affect the groundwater flow direction, and accordingly affect the contaminant transport, and increase the possibility of groundwater contamination, especially by direct infiltration through the fault. In addition, there is high uncertainty in interaction between aquifers and basins because of limited data.

3. Existing Groundwater Quality Monitoring Network [10] The present water quality monitoring network in the West Bank consists of a total of 540 existing municipal and agricultural wells and springs, where there are no monitoring wells specifically dedicated to sampling water quality. As shown in Figure 2, monitoring wells and springs, distributed according to groundwater basin, are clustered in many areas, especially the agricultural wells which were used as monitoring wells in locations close to irrigated areas. There is no well documentation about the justification for selecting these particular sampling sites. The selection of those monitoring sites was probably based on subjective judgment, especially for those sampling sites that were

instituted prior to the establishment of the Palestinian Water Authority in 1996. This implies a possibility of the existence of spatial redundancy in the present monitoring network. Sampling is conducted twice a year, in spring and fall. Sampling depths range from springs at ground surface (where the source aquifer outcrops, as in the shallow aquifer of the Northeastern Basin) to depths as much as 464 m below the ground surface (such as in the upper and lower aquifers of the Eastern and Western basins and in the lower aquifer of the Eastern Basin). Most of the shallow wells of 100 – 150 m depth are agricultural wells. These are generally located close to irrigated areas (e.g., agricultural areas near Jenin, Tulkarem, Qalqilya, Wadi Al Fari’a, and Jericho), while municipal wells are generally deeper. Springs are used for both agricultural and domestic purposes, with some dedicated for either agricultural or domestic use. [11] Water quality sampling for chloride and electrical conductivity began in 1967, and sampling was extended in 1984 to cover many additional parameters. Data are available for a variety of water quality parameters, including descriptive determinants (temperature, dissolved oxygen, electrical conductivity, and total dissolved solids), major

3 of 15

W08412

AMMAR ET AL.: BAYESIAN DEDUCTION FOR REDUNDANCY DETECTION

W08412

Figure 2. Groundwater quality monitoring site locations in the West Bank.

ions (calcium, potassium, magnesium, sodium, sulfate, nitrate, bicarbonate, and chloride), and microbes (fecal coliform). Given the constituents, the uses of available water, and the impact on water resources in the West Bank and following procedures proposed by Harmancioglu et al. [1999], the principal constituent of concern was determined to be nitrate (nitrate concentrations are given in this paper in units of mg/L of NO 3 , not N).

4. Data Preprocessing and Analysis [12] Basic descriptive analysis of the maximum nitrate concentrations data set revealed that nitrate follows a lognormal distribution. The maximum nitrate concentration was transformed to follow a normal distribution using the log transformation. Nitrate concentration (NO 3 ) trend analyses conducted in this study have shown continuous groundwater quality degradation in some areas of the West Bank. This result is consistent with previous studies [CH2MHILL, 2001] and indicates that levels of nitrate exceed Palestinian drinking water standards (i.e., maximum contaminant levels of 50 mg/L NO 3 ) in various locations.

These general trends can be categorized according to four main characteristics: (1) depth to water level, a clear negative correlation between nitrate concentration and depth to water level as noted in the lower nitrate concentrations in the deeper aquifers, with some concentrations below the background level (10 mg/L NO 3 ), and in higher nitrate concentrations for springs compared to wells; (2) seasonal trend, higher nitrate concentration for the spring season associated with higher water levels in comparison to the fall season; (3) land use, higher nitrate concentrations in agricultural, rural, and urban areas where fertilizers and pesticides, leakage from septic tanks and leachate from landfills, and nitrate loading from urban sewage are a probable source of nitrate pollution in each land use category, respectively; and (4) aquifer type, higher nitrate concentrations in the shallow outcropping aquifers compared to deeper and confined aquifers. [13] From the analyses of nitrate contamination of West Bank aquifers, the problem of groundwater quality management can be in part described as one of degradation of groundwater quality that will likely worsen and potentially threaten the already severely limited water supplies by

4 of 15

AMMAR ET AL.: BAYESIAN DEDUCTION FOR REDUNDANCY DETECTION

W08412

W08412

Figure 3. RVM structure coupled with monitoring design structure and the associated parameters and hyperparameters. A high proportion of hyperparameters are driven to large values in the posterior distribution, and their corresponding weights are driven to zero, yielding a sparse model. For example, if a2 peaks to infinity, then the corresponding parameter and the relevance vector encircled by the dotted line will be pruned.

preventing their use for most purposes. This suggests a need to build a reliable and cost-effective groundwater quality monitoring network that has minimal redundant monitoring sites and that can better detect contamination and provide information for protection and management of water resources. In other words, what is needed is to select from the existing monitoring network a relevant subset of monitoring sites (i.e., number and location) that allows accurate detection of nitrate concentration in a cost-effective way. To do this, we propose a Bayesian approach on the basis of use of the relevance vector machine.

5. Methodology [14] The Bayesian framework of the RVM modeling approach is employed in this study to decrease the redundancy in the monitoring sites from an existing groundwater quality monitoring network. Figure 3 illustrates the RVM structure and associated parameters adapted to fit the monitoring design problem. As shown in Figure 3, the main input xn to the relevance vector machine consists of the location of the existing monitoring sites according to the source aquifer in each basin. The model output, or target yn represents the corresponding maximum NO 3 concentrations at the existing monitoring sites. The data set consists of N input-target pairs {xn, yn}Nn=1, where the targets yn are assumed to be independent. Therefore, model prediction for yn for any input vector xn can be given in the form yn = f (xn; w) + en, where en is Gaussian noise with zero mean

and noise variance s2. The corresponding likelihood is defined as       N 1 p yw; s2 ¼ 2ps2 2 exp  2 ky  Fwk2 ; 2s

ð1Þ

where y = (y1. . .yN)T are the targets; w = (w0. . .wN)T are the weights or relevance vector coefficients; F is the design matrix that contains the response of all basis functions to the inputs, with F = [f(x1), f(x2), . . ., f(xN)]T, wherein f(xN) = [1, K(xn, x1), K(xn, x2), . . ., K(xn, xN)]T. The kernel function K(xn,xN) is used to introduce nonlinearity in the mapping function. The kernel function defines a set of nonlinear fixed basis functions y N (xn) = K(xn,xN), where the kernel function is centered on each of the N training data points x. Selection of a suitable kernel type that best suits the data is done at this important step (Figure 3). An important feature of the kernel is the kernel width, a smoothing parameter that is also called the window width (width of normal probability function). For the RVM, this is the key parameter for precision and model sparsity. For example, a Gaussian kernel (or radial basis function kernel) width is represented here by the standard deviation parameter s or simply the variance over the noise s2, as shown in the Gaussian kernel equation:

5 of 15

! kxi  xjk2 K xi ; xj ¼ exp  : s2 



ð2Þ

AMMAR ET AL.: BAYESIAN DEDUCTION FOR REDUNDANCY DETECTION

W08412

W08412

[15] The maximum likelihood estimation of w and s2 from (1) leads to overfitting. To avoid this, Tipping [2001] recommended imposing a zero-mean Gaussian prior distribution over the model weights that is governed by a hyperparameter (a) associated with each weight (iteratively estimated from the data), moderating the prior strength. The individual hyperparameters control groups of weights and their associated basis functions F(x), which are associated with each input dimension x (solid circles in Figure 3). The association of a hyperparameter with each weight is the key feature of the relevance vector machine that is responsible for its sparsity properties. This Gaussian prior distribution is given as

‘‘Learning’’ becomes the search for the hyperparameter posterior mode. For uniform hyperpriors over log(a) and log(s), we only need to maximize p(yja, s2) to find aMP and s2MP. Maximizing the marginal likelihood is known as the ‘‘type II maximum likelihood’’ method. [18] RVM training is done iteratively using the iteratively reweighted least squares method by maximizing the marginal likelihood using a broad prior over the hyperparameters. This allows the posterior probability distribution to concentrate at very large values of a, causing convergence as a goes to infinity and as the weights concentrate around zero. This then removes from the existing monitoring network many redundant monitoring sites with low nitrate concentrations or sites that are less informative by assigning N Y    1  pðwjaÞ ¼ N wi o; ai : ð3Þ a value of zero to the weights and associated basis functions F(x), and corresponding inputs, and considering them to be i¼0 irrelevant (see the dashed lines in Figure 3). For example, if [16] On the basis of the previously defined prior and a2, as shown in Figure 3, peaks to infinity, then the likelihood distributions, the posterior over all unknowns is corresponding parameter and vector that are encircled by the dashed lines will be pruned (as a redundant site). As a defined using Bayes’ rule: result, the RVM model includes only the most relevant    monitoring sites in terms of high nitrate concentrations or 2 2   P yw; a; s Pðw; a; s Þ P w; a; s2 jy ¼ : ð4Þ monitoring sites that add valuable information (nonzero PðyÞ weights) in defining the monitoring network (solid circles in Figure 3). The posterior in (4) is intractable; an approximation is obtained by decomposing the posterior to P(w, a, s2jy)  5.2. Model Testing (Validation Process) [19] Predictions are made on the basis of the posterior P(wjy, a, s2)P(a, s2jy). As a consequence, the posterior distribution over the weight becomes analytically solvable. distribution over the weights, conditioned on maximizing values aMP and s2MP obtained at the convergence of the The analytical posterior distribution over the weights is hyperparameter estimation procedure. Having learned from    w; s2 PðwjaÞ the training values the target y, we now make a prediction    P y P wy; a; s2 ¼ for a new unseen target (y*) (testing part) given new input 2 Pðyja; s Þ   data x*, where the posterior from the training process is ð Nþ1Þ 1 1 ¼ ð2pÞ 2 jSj2 exp  ðw  mÞT S1 ðw  mÞ ; considered the prior for the validation process. Therefore, we 2 can compute the predictive distribution using the integral ð5Þ 2

T

1

with posterior covariance S = (s F F + A) , mean m = s2 SFTy, and A = diag (a1, a2, . . ., am). The estimated value of the model weights is given by the mean of the posterior distribution (5), which is also the maximum a posteriori (MP) estimate of the weights. The MP estimate of the weights depends on the value of the hyperparameters a and s2. The estimate of these two variables, a and s2, is obtained by maximizing the marginal likelihood. 5.1. Model Training (Learning Process) [17] The input-target pair data are split into training and testing parts. The RVM model is trained with the training data so that it can accurately predict maximum nitrate concentrations at previously unseen data in the testing process. Relevance vector learning is achieved by maximizing the marginal likelihood of P(a, s2jy) / P(yja, s2)p(a)p(s2) with respect to a and s2 through integrating out the weights to obtain the marginal likelihood    P ya; s2 ¼

Z

   P yw; s2 PðwjaÞ@w

1 N ¼ ð2pÞ 2 s2 I þ FA1 FT  2   1 1  exp  yT s2 I þ FA1 FT y : 2

ð6Þ

   p y* y; aMP ; s2MP ¼

Z

      p y* w; s2MP p wy; aMP ; s2MP dw: ð7Þ

This predictive distribution incorporates the uncertainty over the weights w taking all likely values, having seen y, into account. Since both terms in the integrand are Gaussian, this gives

     p y* y; aMP ; s2MP ¼ N y* Y* ; s2* ;

ð8Þ

with testing output (Y*) given by   Y* ¼ mT F X* :

ð9Þ

5.3. Local Uncertainty [20] The probabilistic output of the RVM is illustrated by the error bar or predictive variance s ^ 2 that is obtained. The predictive variance (error bars) consists of two variance components: the estimated noise variance s2MP (maximum a posteriori) in the data and variance due to uncertainty in the prediction of weights (uncertainty about the optimal value of the weights reflected by the posterior distribution (4)) represented in the second part of equation (10) (FT SF),

6 of 15

W08412

AMMAR ET AL.: BAYESIAN DEDUCTION FOR REDUNDANCY DETECTION

with F and S the basis function and posterior covariance, respectively, as defined in equation (4) [Tipping, 2001]: s ^ 2 ¼ s2MP þ FT SF:

ð10Þ

[21] The local uncertainty of predicted maximum nitrate concentration is then quantified here on the basis of the error bars as the width of the confidence interval (CI) of a specified probability for the predicted maximum nitrate concentration (i.e., the response or output of the RVM) at each individual location of the existing monitoring sites (where, again, the location of a monitoring site is the input to the RVM): 2 12 CI ¼ Fw tn;a=2 s ^ =n ;

ð11Þ

where Fw is the predicted maximum nitrate concentration (mg/L) (that is, the model output is calculated on the basis of the relevance vectors (RVs)), tn;ac=2 is the t test statistic at n = np degrees of freedom and confidence level ac (for 95% confidence level = 0.05), n is the number of monitoring sites in the existing network configuration, and p is the number of model parameters (p = 2 for weight and bias). 5.4. Model Reliability [22] Reliability of the model is defined in terms of goodness-of-fit statistics which also reflect the adequacy and significance of the predicted model. These key statistics are bias, root-mean-square error (RMSE), mean absolute error (MAE), index of agreement (IOA), coefficient of efficiency (COE), and correlation coefficient (r2). The definition of each of these statistics is given as follows [see Legates and McCabe, 1999]. XN Bias ¼ N 1 ðyt  ^yt Þ t¼1 ffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi XN 2 1 RMSE ¼ N ðyt  ^yt Þ t¼1 XN MAE ¼ N 1 jðyt  ^yt Þj t¼1 PN yt j t¼1 jyt  ^ IOA ¼ 1  PN ^ ^yt  E ð^yt Þj y  E ð y Þ þ j j j t t¼1 t PN 2 ðy  ^yt Þ COE ¼ 1  PN t¼1 t ^ y ð yt ÞÞ2 t  E ð^ t¼1 PN yt  Eð^yt ÞÞ t¼1 ðyt  E ðyt ÞÞð^ r2 ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi PN 2 yt  Eð^yt ÞÞ2 t¼1 ðyt  E ðyt ÞÞ ð^

where N is the sample size. 5.5. Spatial Redundancy Analysis [23] Specification of the spatial redundancy reduction level is achieved by conducting a trade-off analysis between cost of sampling, implicitly represented by the number of monitoring sites; uncertainty of predictions, as represented by the confidence interval of predictions; and model reliability, in terms of goodness-of-fit statistics. The kernel width s ^ 2 is the key to redundancy reduction (sparsity), implicitly reflected in the monitoring network context in terms of the accuracy of nitrate prediction. A smaller kernel

W08412

width causes the bias to be very low because the decision of each point is local (less scatter in the data or repeated measurements), and a perfect fit of training data is easily attainable. However, this will increase the number of desired monitoring sites. In contrast, a large kernel width causes the bias to be high, and the number of resulting monitoring sites will be low (see section 7). The simplicity of the model is that it needs only one parameter, which is the Gaussian kernel width, to optimize the resulting relevance vectors. 5.6. Spatial Mapping [24] Spatial mapping is used to better illustrate the RVM model results over the geographic domain of the groundwater aquifers and basins. It is important to mention that RVMs can predict a value at any spatial position and that the spatial interpolation that is discussed in this section is only for comparison of data and predictions. So as long as both the data and predictions are interpolated using the same method, a valid comparison can be made. The methodology used for interpolating these results is the inverse distance weight method (IDW) of ArcGIS of the Environmental Systems Research Institute (ESRI). IDW uses the measured values surrounding the prediction location. Those measured values closest to the prediction location will have more influence on the predicted value than those farther away. IDW assumes that each measured point has a local influence that diminishes with distance. It weights the points closer to the prediction location greater than those farther away.

6. Application of Relevance Vector Machines [25] The sparse Bayesian RVM methodology was applied in the West Bank, Palestinian National Authority, as a case study to detect the redundancy in the existing groundwater quality monitoring network for NO 3 . The RVM model used was obtained from the MATLAB implementation of Tipping [2001] (see http://www.research.microsoft.com/mlp/rvm/). [26] The input and associated target data were split into training and testing sets, with the training set consisting of 78% of the total data; the remaining 22% was used for testing. The next step was selecting an appropriate kernel type that best suits the data. The Laplace kernel type was adopted in this case study (see section 7.4). For more information about kernel types and usage, refer to work by Shawe-Tylor and Cristiani [2004]. The input data consisted of four inputs: (1) the spatial location of the sampling sites in the x direction (x coordinate), (2) the spatial location of the sampling sites in the y direction (y coordinate), (3) the source aquifer (shallow aquifer, upper aquifer, and lower aquifer), and (4) the groundwater basin of the site (Eastern Basin, Northeastern Basin, and Western Basin). The target data consisted of maximum annual nitrate concentration. The outcome of the model consisted of the predicted maximum nitrate concentration for each sampling site for both the current sampling points (i.e., the monitoring wells and springs) and the remaining or unsampled points in the study area. Several groundwater monitoring network configurations were obtained, each by running the RVM model with a different kernel width. Each run provided a new network configuration with a new number of monitoring wells. For each configuration and well, an associated

7 of 15

W08412

AMMAR ET AL.: BAYESIAN DEDUCTION FOR REDUNDANCY DETECTION

Figure 4. (a) Observed maximum nitrate concentration in the shallow aquifer of the Northeastern Basin (between 35°120E and 35°220E, and between 32°N and 32°300N). (b) Predicted maximum nitrate concentration in the shallow aquifer of the Northeastern Basin. (c) Observed maximum nitrate concentration in the upper aquifer of the Western Basin (northern part) (between 34°570E and 35°120E, and between 31°470N and 32°270N). (d) Predicted maximum nitrate concentration in the upper aquifer of the Western Basin (northern part).

8 of 15

W08412

W08412

AMMAR ET AL.: BAYESIAN DEDUCTION FOR REDUNDANCY DETECTION

W08412

Figure 5. Spatial distribution of standardized error as a percentage of MCL for a network monitoring configuration system consisting of 172 RVs: (a) shallow aquifer of Northeastern Basin and (b) upper aquifer of Western Basin (northern part). distribution of uncertainty in the forecast of maximum nitrate concentration was also provided by the RVM model.

7. Results and Discussion [27] In this section, we present and discuss the results obtained from running the RVM model, highlighting the main features and characteristics of the RVM in explaining the results and its inference capability and strengths, and elaborate on RVM limitations. As mentioned in section 5, a key feature of the RVM is its capability to understand the information content in the data. In this study, this is reflected in its ability to recognize the pattern of high and low maximum nitrate concentrations and comparable contour values as presented in Figure 4, which illustrates an example of the monitoring network configurations that were produced from the RVM analyses. Figure 4 shows a monitoring network configuration consisting of a total of 172 monitoring sites, which represents about 40% of the training sample size, i.e., 32% of the existing monitoring sites in all aquifers and basins. The observed and predicted maximum nitrate concentrations are shown in Figure 4 for the shallow aquifer of the Northeastern Basin and the tip of the Eastern Basin (Figures 4a and 4b) and the northern part of the upper aquifer of the Western Basin (Figures 4c and 4d). These maximum nitrate concentration contours were created using the ESRI ArcGIS inverse distance weight method tool. Spatial mapping is used only for better illustration of the results obtained. To select the appropriate interpolation method for creating these contours, both the IDW method and the Kriging method were used. These were compared in

terms of prediction accuracy, comparing the measured value with the predicted one. The inverse distance weight method showed better predictions for measured locations than the geostatistical Kriging method in this particular study. Even though the IDW method has acknowledged deficiencies and limitations, it demonstrated better performance in our case study. However, it should be noted that in principle the RVM model could be used to provide forecasts of contaminant concentration at any point in the aquifer system. A dense grid of such forecasted/predicted points could be produced and used to supplant any need of geographic information systems (GIS) interpolation or to minimize the estimation bias when coupled with GIS. This means of illustration needs to be explored in the future. 7.1. Standardized Error [28] To show the accuracy of the RVM model predictions in the example in section 7, the difference between observed and predicted maximum nitrate concentrations (residual error) was calculated for all sampling sites and was illustrated spatially using the IDW interpolation method. Example results of this for the shallow aquifer of the Northeastern Basin and the upper aquifer of the Western Basin (northern part) are shown in Figure 5. To be more understandable for decision makers or planners, these errors are standardized in terms of the maximum contaminant level (MCL), 50 mg/L NO 3 . The standardized error ranged approximately from 0.3 to 50% of the MCL (±20 mg/L), with some extreme values of as much as 100% of the MCL (50 mg/L). Note that there is no clear trend or correlation between the higher standardized error sites and higher maximum nitrate con-

9 of 15

W08412

AMMAR ET AL.: BAYESIAN DEDUCTION FOR REDUNDANCY DETECTION

Figure 6. Spatial distribution of local uncertainty based on a total of 172 monitoring sites for (a) the shallow aquifer of the Northeastern Basin, (b) the upper aquifer of the Eastern Basin (between 35°60E and 35°300E, and between 31°N and 32°190N), and (c) the upper aquifer of the Western Basin (northern part).

10 of 15

W08412

W08412

AMMAR ET AL.: BAYESIAN DEDUCTION FOR REDUNDANCY DETECTION

Table 1. Selected Cases of RVM Models and Their Related Statistics 83 RVs Statistic Bias RMSE MAE IOA COE Correlation coefficient

172 RVs

394 RVs

Training Testing Training Testing Training Testing 0.00 0.22 0.18 0.89 0.67 0.83

0.00 0.15 0.11 0.92 0.74 0.86

0.00 0.13 0.11 0.97 0.89 0.95

0.00 0.08 0.06 0.98 0.93 0.97

0.00 0.02 0.02 1.00 1.00 1.00

0.01 0.06 0.05 0.99 0.95 0.98

centrations (greater than 100 mg/L) or lower standardized error and observed low nitrate concentrations, as shown in Figure 4, where the main factor that affects the accuracy of prediction is the number and locations of monitoring sites, as will be discussed more in section 7.3 (i.e., the locations of the RVs). Reliability in terms of goodness of fit will be discussed in sections 7.3 and 7.4. 7.2. Uncertainty Results [29] The capability of the RVM model to capture information about uncertainty is illustrated by the plots of confidence intervals given in Figure 6. The local uncertainty was calculated on the basis of equation (11) using the predicted maximum nitrate concentration and t test statistic (t = 1.96 for n = 540 and infinite degrees of freedom). Some of the resulting local uncertainties are illustrated in Figure 6 for the shallow aquifer of the Northeastern Basin and upper aquifers of the Eastern and Western basins. As shown in Figure 6, narrower confidence intervals (i.e., small uncertainty regarding the forecast value of nitrate concentrations) were obtained for predictions at the locations recommended by the model for monitoring sites. This indicates greater accuracy at those locations. Wider confidence intervals were observed at unsampled locations, especially in areas with high nitrate concentrations. This implies that the RVM modeling approach succeeded in reducing the local uncertainty at the locations of the relevance vectors or desired monitoring sites. Note the consistency between the standardized error and confidence interval or uncertainty values. Higher standardized error is associated with wider confidence intervals, and lower standardized error is associated with narrower confidence intervals for the same location, as shown in Figures 5 and 6. 7.3. Results of Redundancy Analysis [30] Table 1 presents the statistics of three selected RVM models having 83, 172, and 394 model-recommended monitoring sites (or relevance vectors). These correspond to 15, 30, and 70%, respectively, of the total number of existing monitoring sites. As shown in Table 1, every statistical measure of goodness of fit (bias, RMSE, MAE, IOA, COE, and correlation coefficient) improves as the number of relevance vectors (monitoring sites) increases in the model. Further, the data in Table 1 imply that the performance of the RVM models begins to level off after a certain number of RVs has been reached. [31] The performance of the RVM model is illustrated in the predictions shown in Figure 7. This performance reflects

W08412

the model’s capability of generalization. The best prediction performance was achieved from the RVM model on the basis of 394 monitoring sites (RVs), good prediction performance was achieved with 172 monitoring sites, and the worst prediction performance was based on 83 monitoring sites. Note how the number and position of the monitoring wells affect the accuracy of the monitoring network. The greater the number of wells, the more accurate the model prediction will be and the smaller the local uncertainty will be. It is also important to note the locations of the desired monitoring sites (the stars in Figure 4), which are spatially distributed close to ‘‘hot spot’’ areas or high nitrate concentrations. This indicates how the model understood the information content in the data and predicted the groundwater nitrate concentration as accurately as if all available sampling sites had been used. This could be considered a primary advantage of the model. 7.4. Results of Sensitivity Analysis [32] Sensitivity analysis is conducted in this study to test the sensitivity of RVM model performance to selection of kernel parameters, i.e., kernel type and kernel width. A kernel type that best suits the structural properties of the data was selected by judging the performance of different kernel types (Gaussian, Laplace, spline, Cauchy, cubic, thinplate spline, and bubble). Figure 8 presents the performance of the RVM model for different kernel types. As shown in Figure 8, the best performance in terms of accuracy of prediction was given by the Laplace kernel, followed by the Cauchy kernel. [33] In the context of monitoring, it is the selection of the kernel width that prescribes the desired quality of the resulting mapping function as well as the number of relevance vectors (i.e., the size of the monitoring network) used in the prediction of nitrate levels. The kernel width was kept the same for the spring and fall maximum nitrate concentrations to test how the number and location of RVs change with fall and spring scenarios. The results showed that for the same kernel width, minimal differences in number and location of RVs were obtained. The reason behind that is that the RVM is a data-driven machine that considers spring and fall maximum nitrate data as completely new data sets and allocates the RVs according to understanding of the information content in the data. As an example of the results obtained, for a kernel width of 0.7, the number of RVs for the maximum annual nitrate concentration was 394 out of 424 training data sets, while for maximum spring nitrate concentration, the number of RVs was 400 out of 412 training data sets, and for maximum fall nitrate concentration, the number of RVs was 360 out of 379 training data sets as shown in Figure 9, where the green triangles representing maximum spring nitrate concentration, the black circles representing maximum fall nitrate concentration, and the red squares representing maximum annual nitrate concentration coincide in most of the places. To summarize, the overall results of kernel width sensitivity showed that the optimal monitoring network is a robust one in terms of the nonsignificant change in monitoring site locations under both spring and fall scenarios, which reflects the capability of the desired monitoring network design to withstand possible site condition change.

11 of 15

W08412

AMMAR ET AL.: BAYESIAN DEDUCTION FOR REDUNDANCY DETECTION

Figure 7. Actual versus predicted nitrate concentration based on RVM models having (a) 83 RVs, (b) 172 RVs, and (c) 394 RVs.

12 of 15

W08412

W08412

AMMAR ET AL.: BAYESIAN DEDUCTION FOR REDUNDANCY DETECTION

W08412

Figure 8. Kernel type selection based on RVM model performance.

Figure 9. Kernel width and sensitivity of RVM to maximum spring and maximum fall nitrate concentrations in the shallow aquifer of the Northeastern Basin. 13 of 15

W08412

AMMAR ET AL.: BAYESIAN DEDUCTION FOR REDUNDANCY DETECTION

Figure 10. Trade-offs between cost (as represented by number of monitoring wells) and uncertainty (represented by confidence interval width) and between cost and accuracy (as represented by coefficient of efficiency).

W08412

required memory and computational effort scale with the square and cube, respectively, of the number of basis functions (M. E. Tipping and A. C. Faul, Fast marginal likelihood maximisation for sparse Bayesian models, in Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, edited by C. M. Bishop and B. J. Frey, 2003, available at http://research. microsoft.com/conferences/AIStats2003/proceedings/ papers.htm). In addition, the RVM model assumes a Gaussian behavior in the data and parameters, which was not the case here; therefore, transformation of data was conducted prior to using the RVM model. From a monitoring design perspective, the RVM algorithm does not allow the user to choose the number or location of sampling sites a priori. The reason behind this limitation is due to the construction of the RVM model which could be avoided through adjustment of the kernel width parameter. Most important, in terms of monitoring design, the RVM model does not identify locations for new monitoring sites.

9. Conclusions 7.5. Specification of Redundancy-Level Reduction [34] Selection of an appropriate monitoring network configuration requires assessment of the trade-off between the number and location of monitoring sites, the cost required to achieve a given accuracy, and the uncertainty in the prediction that results from the network that is selected. The tradeoffs among these three objectives can be represented in a three-dimensional plot, as shown in Figure 10, where each axis represents one objective. The surface created by the intersection of the three axes represents the possible surface of solutions created along the three decision axes (cost, uncertainty, and accuracy). Therefore, the area outside this surface of solution, i.e., the flat area, is not part of the solution because the intersection was in two dimensions and did not take the accuracy dimension into account. One can achieve the highest accuracy and certainty by increasing the number of monitoring sites. As is clear from Figure 10, as the number of monitoring sites increases, both the accuracy and the certainty increase. Figure 10 also shows an example of selecting a monitoring strategy. For example, if 80% accuracy and 13 mg/L uncertainty are desired, then 180 monitoring wells are needed. The lower and upper limits of decision boundaries are represented by redundancy reduction percentages, which are the percentages of monitoring sites pruned from the existing monitoring network. This redundancy reduction (i.e., the percentage of monitoring sites removed from the original network) is equal to 27% (or (540 – 394)/540) for the 394 monitoring site strategy and 70% (or (540 – 172/540) for the 172 monitoring site strategy. Trade-offs within these two limits can produce different monitoring configuration alternatives that can be judged in terms of their accuracy (as measured by the standardized error expressed as a fraction of the maximum contaminant level) and level of uncertainty.

[36] The modeling results show that RVMs can provide an attractive, simple to use, and straightforward analytical forecasting model for application in groundwater quality monitoring network design in cases where pruning of the network might be desired. The generalization capability of RVMs and the sparse representation that they have shown in this study demonstrate good predictive performance and an ability to identify a reliable network configuration that is pertinent to the information contained in the available monitoring data. This is further reflected by the ability of the RVM approach to identify the minimal number of redundant sampling sites, with the most significant sampling locations identified in terms of their information content, maintaining a realistic sparsity level, that is, not too sparse. In addition, the probabilistic prediction capability of RVMs provided useful information about uncertainty captured inherently in the model because of its Bayesian framework. The RVM method extracted the most relevant monitoring sites that captured valuable information in terms of hot spot areas (i.e., relevance vectors are viewed as the fingerprints of the physical characteristics of the aquifer) of high nitrate concentration on the basis of number and location of existing monitoring sites and their related observed nitrate concentrations. [37] This analysis of the methodology presented here also reflects the potential of RVMs for capturing useful information for describing trade-offs in optimal groundwater monitoring network design. This is shown through the incorporation of a multiobjective dimension, as is simply illustrated here in the form of trade-offs among cost (e.g., number of monitoring sites), accuracy of the prediction, and uncertainty. This is a potentially important contribution to the decision making process in the management of groundwater quality and for planning, development, or protection of groundwater.

References

8. RVM Model Limitations [35] Every model has limitations that affect its performance. Since RVM basis functions are data-dependent, their

Asefa, T., M. Kemblowski, G. Urroz, M. McKee, and A. Khalil (2004), Support vectors-based groundwater head observation networks design, Water Resour. Res., 40, W11509, doi:10.1029/2004WR003304.

14 of 15

W08412

AMMAR ET AL.: BAYESIAN DEDUCTION FOR REDUNDANCY DETECTION

Asefa, T., M. Kemblowski, G. Urroz, and M. McKee (2005), Support vector machines (SVMs) for monitoring network design, Ground Water, 43, 413 – 422, doi:10.1111/j.1745-6584.2005.0050.x. CH2MHILL (2001), Aquifer modeling, in Water Resources Program Phase III, Task 2, pp. 31 – 68, U.S. Agency for Int. Dev., Washington, D. C. Gill, M. K., Y. H. Kaheil, A. Khalil, M. McKee, and L. Bastidas (2006), Multiobjective particle swarm optimization for parameter estimation in hydrology, Water Resour. Res., 42, W07417, doi:10.1029/2005WR004528. Harmancioglu, N. B., O. Fistikoglu, S. D. Ozkul, V. P. Singh, and M. N. Alpaslan (1999), Water Quality Monitoring Network Design, Water Sci. Technol. Libr., vol. 33, 304 pp., Kluwer Acad., Dordrecht, Netherlands. Khalil, A., M. N. Almasri, M. McKee, and J. J. Kaluarachchi (2005a), Applicability of statistical learning algorithms in groundwater quality modeling, Water Resour. Res., 41, W05010, doi:10.1029/2004WR003608. Khalil, A., M. McKee, M. Kemblowski, and T. Asefa (2005b), Sparse Bayesian learning machine for real-time management of reservoir releases, Water Resour. Res., 41, W11401, doi:10.1029/2004WR003891. Khalil, A., M. McKee, M. Kemblowski, T. Asefa, and L. Bastidas (2006), Multiobjective analysis of chaotic dynamic systems with sparse learning machines, Adv. Water Resour., 29(1), 72 – 88, doi:10.1016/j.advwatres. 2005.05.011. Legates, D. R., and G. J. McCabe Jr. (1999), Evaluating the use of ‘‘goodness-of-fit’’ measures in hydrologic and hydroclimatic model validation, Water Resour. Res., 35(1), 233 – 241, doi:10.1029/1998WR900018. Loaiciga, H., R. J. Charbeneau, L. G. Everett, G. E. Fogg, B. F. Hobbs, and S. Rouhani (1992), Review of ground-water quality monitoring network design, J. Hydraul. Eng., 118(1), 11 – 37, doi:10.1061/(ASCE)07339429(1992)118:1(11). Mogheir, Y., and V. P. Singh (2002), Application of information theory to groundwater quality monitoring networks, Water Resour. Manage., 16, 37 – 49, doi:10.1023/A:1015511811686. Nunes, L. M., M. C. Cunha, and L. Ribeiro (2004), Groundwater monitoring network optimization with redundancy reduction, J. Water Resour. Plann. Manage., 130(1), 33 – 43.

W08412

Palestinian Central Bureau of Statistics (1997), Results of population census of December 1997 in the West Bank and Gaza Strip, Ramallah, West Bank. Reed, P. M., and B. S. Minsker (2004), Striking the balance: Long-term groundwater monitoring design for conflicting objectives, J. Water Resour. Plann. Manage., 130(2), 140 – 149, doi:10.1061/(ASCE)07339496(2004)130:2(140). Reed, P., B. Minsker, and A. J. Valocchi (2000), Cost-effective long-term groundwater monitoring design using a genetic algorithm and global mass interpolation, Water Resour. Res., 36(12), 3731 – 3741, doi:10.1029/2000WR900232. Reed, P. M., B. S. Minsker, and D. E. Goldberg (2001), A multiobjective approach to cost effective long-term groundwater monitoring using an elitist nondominated sorted genetic algorithm with historical data, J. Hydroinf., 3(2), 71 – 89. Shawe-Tylor, J., and N. Cristiani (2004), Kernel Methods for Pattern Analysis, 476 pp., Cambridge Univ. Press, Cambridge, U. K. Tipping, M. E. (2001), Sparse Bayesian learning and the relevance vector machine, J. Mach. Learning Res., 1, 211 – 244, doi:10.1162/ 15324430152748236. Yan, S., and B. Minsker (2006), Optimal groundwater remediation design using an Adaptive Neural Network Genetic Algorithm, Water Resour. Res., 42, W05407, doi:10.1029/2005WR004303.

 

K. Ammar and J. Kaluarachchi, Department of Civil and Environmental Engineering, Utah State University, Logan, UT 84341, USA. (khalil@cc. usu.edu) A. Khalil, Department of Earth and Environmental Engineering, Columbia University, New York, NY 10027, USA. M. McKee, Utah Water Research Laboratory, Utah State University, Logan, UT 84341, USA.

15 of 15

Suggest Documents