Sparse Bayesian learning machine for realtime ... - Wiley Online Library

WATER RESOURCES RESEARCH, VOL. 41, W11401, doi:10.1029/2004WR003891, 2005

Sparse Bayesian learning machine for real-time management of reservoir releases Abedalrazq Khalil, Mac McKee, Mariush Kemblowski, and Tirusew Asefa Department of Civil and Environmental Engineering and Utah Water Research Laboratory, Utah State University, Logan, Utah, USA Received 11 December 2004; revised 5 May 2005; accepted 21 July 2005; published 1 November 2005.

[1] Water scarcity and uncertainties in forecasting future water availabilities present

serious problems for basin-scale water management. These problems create a need for intelligent prediction models that learn and adapt to their environment in order to provide water managers with decision-relevant information related to the operation of river systems. This manuscript presents examples of state-of-the-art techniques for forecasting that combine excellent generalization properties and sparse representation within a Bayesian paradigm. The techniques are demonstrated as decision tools to enhance real-time water management. A relevance vector machine, which is a probabilistic model, has been used in an online fashion to provide confident forecasts given knowledge of some state and exogenous conditions. In practical applications, online algorithms should recognize changes in the input space and account for drift in system behavior. Support vectors machines lend themselves particularly well to the detection of drift and hence to the initiation of adaptation in response to a recognized shift in system structure. The resulting model will normally have a structure and parameterization that suits the information content of the available data. The utility and practicality of this proposed approach have been demonstrated with an application in a real case study involving real-time operation of a reservoir in a river basin in southern Utah. Citation: Khalil, A., M. McKee, M. Kemblowski, and T. Asefa (2005), Sparse Bayesian learning machine for real-time management of reservoir releases, Water Resour. Res., 41, W11401, doi:10.1029/2004WR003891.

1. Introduction [2] For millennia, water has been harnessed in support of the achievement of social goals. Nevertheless, it is evident that many efforts to utilize water have been inadequate or misdirected [National Research Council, 2001]. In the future, moreover, available water resources will be subjected to greater pressure in the face of increasing demands. Thus there is an increasing need to more intensively manage water in order to achieve an increasingly diverse set of water-related social goals [Postel et al., 1996; Gleik, 1993]. Therefore successful management of river basins will require more systematic, comprehensive, and coordinated approaches that will need more, and better quality, information about the state of the water resources systems we manage. [3] Conceptual or physically based models are of importance in the understanding of hydrologic processes and can sometimes be implemented to address water resources information needs. Yet physically based modeling approaches are limited by the multitude and complexity of the processes involved and by a paucity of data. Recent decades have seen a growing interest in data-driven modeling. Such approaches might be made to evolve reliable forecasting models using measured historical data. Artificial neural networks (ANNs) are an example of such models that could be able to capture the behavior of the underlying physical processes and lend Copyright 2005 by the American Geophysical Union. 0043-1397/05/2004WR003891

themselves to real-time water systems operation problems [Khalil et al., 2005a, 2005b, 2005c; Moradkhani et al., 2004; Hsu et al., 2002]. Learning machines are characterized by their fundamental ability to deduce models of system behavior from measured data. Without sacrificing accuracy, they might provide a potentially valuable method for reducing the cost of data collection and modeling complex river basin systems in support of water management information needs [Velickov and Solomatine, 2000; ASCE Task Committee on the Application of ANNs in Hydrology, 2000a, 2000b]. [4] The objective of this paper is to provide a forecasting system that has the ability to recognize drift and evolve, when it is needed, a new forecasting machine. In other words, this paper introduces an empirical modeling approach built upon the observations of the input and output. This quantitative statistical system is developed to translate weather information and management inputs into reliable forecasts of system response under time-varying and adaptive operational decision policies. In this, a Bayesian extension of learning machines is used to allow characterization of uncertainties in both the model parameters and the data. The formulated framework emerges from an inductive modeling procedure where the model structure is identified parsimoniously and utilizes an efficient learning machine, called a relevance vector machine (RVM) that embodies these characteristics. Effective modeling of any dynamic behavior will require an adaptive paradigm. The required adaptive nature of the model is treated here as an integral dimension of a fully automated decision support system.

W11401

1 of 15

W11401

KHALIL ET AL.: REAL-TIME MANAGEMENT OF RESERVOIR RELEASES

This is done to underscore the fact that from the point of view of a water manager (e.g., a reservoir operator) effective forecasting of basin dynamic behavior will have to be able to detect concept drift and account for new trends in that behavior. For this purpose unsupervised classification based on the concept of support vector machines (SVMs) is employed. The complexity of the model and the resulting model structure will have a parameterization that is appropriate for the information content of the available data. [5] This should have wide application potential and benefits in water resources research and management [Khalil et al., 2005b, 2005c]. Providing that learning static management rules or neglecting to incorporate forecast uncertainty in the decision process is not expected to improve system management [Yao and Georgakakos, 2001; Carpenter et al., 2001], the resulting adaptive model will assist the decision makers with forecasts associated with uncertainty bounds. [6] In a concluding remark, the devised paradigm is designed to function in a self-adaptive manner and should be able to identify and reflect new behavioral characteristics of the system, which, in a broader sense, might be interpreted in physically or operationally meaningful contexts.

2. Model Description 2.1. Background [7] Suppose we are given a finite number, n 2 [1, N] of samples (training data), Zn (xn, tn), n = 1, N. One can write the targets as a vector t = (t1, , tN)T in which tn 2 R, and express it as the sum of an approximation vector y = (y(x1), , y(xN))T where x 2 Rd is a d-dimensional input vector of stimuli, and zero mean random error (noise) vector E = (e1, , eN)T where en N (0, s2). As a result, the general settings of a predictive model will be considered as a realization of the following function: tn ¼ yðxn ; WÞ þ en

ð1Þ

where W is the parameter vector. Suppose that p(tnjx) N(y(xn), s2), where this notation denotes a normal distribution over the target tn with mean y(xn) and variance s2. The goal of learning is to estimate an unknown continuous real-valued function y(x) that makes accurate predictions of t for previously unseen values of x. Commonly, in kernel-based machines y(x) is modeled as a linearly weighted sum of M nonlinear fixed basis functions, % = {f1 (xn), , fM (xn)}N n=1: yðx;WÞ ¼

M X

wi fi ðxÞ ¼ %W

ð2Þ

i¼1

[8] Tipping [2000] has detailed a Bayesian probabilistic approach for learning models of this form where the training objective involves estimation of appropriate values for the parameters (or weights) W. In this case the model parameters are generally estimated by minimizing a prespecified discrepancy measure (i.e., a process called ‘‘empirical risk’’ minimization). In addition to parameter estimation via empirical risk minimization, one also seeks an optimal selection of model complexity (i.e., the number of parameters).

W11401

[9] The purpose of learning is not only to be able to find a function y(x; W) that can learn the training set, but also to predict future system states (generalization ability). According to Vapnik [1998], generalization from finite data is possible if, and only if, the estimator has limited capacity (i.e., complexity). Therefore a key feature of a learning machine is to enjoy both good generalization accuracy and a sparse formulation (i.e., the machine must not be overparameterized). [10] Recently, Tipping [2000] introduced relevance vector machines (RVMs) that have already demonstrated comparable performance to traditional algorithms such as back propagation artificial neural networks (ANNs) and support vector machines (SVMs) [Khalil et al., 2005b, 2005c]. Relevance vector machines are inspired by the concept of automatic relevance determination (ARD) [MacKay, 1994; Neal, 1994]. ARD is implemented in problems where there are many input variables, and where potentially only some of the inputs are actually relevant to the prediction of the output variables [MacKay, 1994; Tipping, 2000]. The sparse Bayesian learning (SBL) framework depicted in ARD, and hence RVMs, is based on a combination of likelihood and prior knowledge. Likelihood describes the underlying processes in the training data. Prior models should be utilized to address model complexity and as a result, to enable generalization to nontraining data [Suykens et al., 2003]. Tipping [2000, 2001] and Tipping and Faul [2003] argued that the most advantageous feature of RVMs lies in their remarkable generalization capabilities because they generate the sparsest representation in terms of kernel functions in addition to their ability to characterize uncertainty in the associated predictions. 2.2. Relevance Vector Machines [11] The RVM model presented here is characterized by the available variables and has the formulation depicted in equation (1). Following equations (1) and (2) likelihood of the complete data set can be written as: N=2 1 p tjW; s2 ¼ 2ps2 exp 2 kt %Wk2 2s

ð3Þ

where t = (t1 , tN)T, W = (wo, , wN)T, and, %T = 2 3 1 kðx1 ; x1 Þ kðx1 ; xn Þ 6 .. 7 .. .. .. 4. 5 where k(x i , x n ) is a . . . 1 kðxn ; x1 Þ kðxn ; xn Þ prespecified kernel function (i.e., radial basis function). Without imposing the hyperparameters on the weights, W, the maximum likelihood of equation (3) will suffer from severe overfitting. Therefore imposing some additional constraints on the parameters, W, by adding a complexity penalty to the likelihood or the error function is recommended. A way to constrain the weights is to parameterize them with higher-level hyperparameters so as to produce a smooth (or less complex) function. An explicit zero-mean Gaussian prior probability distribution over the weights, W with diagonal covariance of A is proposed as follows:

2 of 15

pðW j AÞ ¼

N Y N wi j 0; a 1 i i¼0

ð4Þ


W11401

with A = {ao, , aN} a vector of N + 1 hyperparameters associated independently with each individual weight. The hyperpriors over A and the noise variance, s2, have to be defined to complete the specification of the hierarchical prior. A plausible guess for a conjugate family of prior distributions over these parameters is a noninformative (improper) prior gamma distribution [Berger, 1985]. The gamma distribution conjugate prior is chosen to obtain a closed form expression of the posterior in a hierarchical Bayesian implementation. [12] The evidence of the data will allow the posterior probability distribution to concentrate at very large values of A. The posterior probability of the associated weight will be concentrated at zero. Therefore one could deem the corresponding inputs irrelevant [Tipping, 2001]. Strictly speaking, therefore, the relevance vectors could be thought of as a representation of the data prototypical structure [Tipping, 2000; Li et al., 2002]. A data set prototypical structure is represented by the subset (e.g., the ‘‘relevance vectors’’) that exhibits the most essential features of the full data set, or in other words, one might view relevance vectors as the fingerprints of the data set. [13] Tipping [2000, 2001] proposed that by using Bayes’ rule the posterior probability over all unknowns could be computed given the predefined noninformative prior distributions of equation (3). Then, for new observations to carry out Bayesian inference, the distribution to produce predictions is Z pðtNþ1 j tÞ ¼

p tNþ1 j W; A; s2 p W; A; s2 j t dWdAds2 ð5Þ

where tN+1 is the corresponding target of a new test point xN+1. [14] Full analytical computation of this integral is intractable, so an approximation is essential. The posterior p(W, A, s2jt) = p (tjW, A, s2) p(W, A, s2)/p(t) R is also intractable due to its normalizing constant, p (t) = p(tjW, A, s2) p (W, A, s2) dW dA ds2. With p (Wjt, A, s2) p (tjW, s2), we perform decomposition of the posterior p(W, A, s2jt) in order to facilitate the solution: p W; A; s2 j t ¼ p W j t; A; s2 p A; s2 j t

Markov Chain Monte Carlo Sampling could be used without the need for the approximation in equation (8) if analytical solution is not desirable. Tipping [2001] argued that all the evidence from several experiments suggests that this predictive approximation is very effective. As a consequence, learning becomes a search for the most probable hyperparameter posterior, i.e., with respect to A and s2 the maximization of p(A, s2jt) / p(tjA, s2) p(A) p(s2). For uniform hyperpriors, one need only to maximize the term p(tjA, s2) which is a convolution of two normal distributions, namely p(tjW, s2) and p(WjA), thus the corresponding variances add up as follows: p t j A; s2 ¼

j2j 1 exp ðW MÞT 2 1 ðW MÞ 2

¼ ð2pÞ

1=2

ð7Þ

The posterior covariance and mean are, respectively, 2 = (s 2 %T % + A) 1, with A = diag (ao, a1, , aN), and M = 2%T s 2. For prediction purposes, let’s suppose that the hyperparameters posterior p(A, s2jt) could be approximated by a delta-Dirac function at its most probable values (e.g., the mode) AMP, s2MP, given that p(tjA, s 2) p(A) p(s 2) is peaked around its mode: Z

pðtNþ1 j A; sÞdðAMP ; sMP ÞdAds2 Z

pðtNþ1 j A; sÞpðA; s j tÞdAds2

ð9Þ

where S = s2I + %A 1 %T. Maximization of this quantity is known as the type II maximum likelihood method [Berger, 1985; Wahba, 1985] or the ‘‘evidence for hyperparameter’’ [MacKay, 1992a]. The solution of equation (9) does not only recover the relevance of each vector, but the estimated noise on the target values as well. Hyperparameter estimation is conducted using either gradient ascent or expectation maximization. In this paper an iterative gradient ascent algorithm is utilized. For A, differentiation of equation (9), equating to zero, and rearranging, will produce the result [MacKay, 1992b]: anew ¼ i

1 ai Sii ; m2i

ð10Þ

where mi is the ith posterior mean weight, Sii is the ith diagonal element of the posterior weight covariance, and the quantity 1 aiSii is a measure of the degree to which the associated parameter wi is determined by the data. Likewise we obtain the noise variance: 2 new kt %Mk X s ¼ g N

i i

where the posterior distribution of the weights is

ð N þ1Þ=2

Z

p t j W; s2 pðW j AÞdw 1 T 1

N=2 1=2 ¼ ð2pÞ exp t S t =jSj 2

ð6Þ

pðt j W; s2 Þ:pðW j AÞ p W j t; A; s2 ¼ pðt j A; s2 Þ

W11401

ð11Þ

where N refers to the number of training data examples. [15] Repeated updating of the previous values along with posterior M and 2 is performed until stability is reached. The previous procedures start by utilizing all the training data set, and then pruning of the vectors is performed where ai ! 0 as the training proceeds. [16] The predictive distribution for a new query xN+1 becomes p tNþ1 j t; AMP ; s2MP ¼

Z

p tNþ1 j W; s2MP p w j t; AMP ; s2MP dW;

ð12Þ

which is readily computed, giving ð8Þ 3 of 15

p tNþ1 j t; AMP ; s2MP N tNþ1 j yNþ1 ; s2Nþ1 ;

ð13Þ

W11401


W11401

Figure 1. Optimization of model complexity. (a-c) Interpolation of y = sinc (x) at different Gaussian parameters. (d and e) Number of relevance vectors and marginal likelihood as a function of kernel parameter. with, yN+1 = MT %(xN+1), and s2N+1 = s2MP + %(xN+1)T 2%(xN+1). For more details about RVM specifications, interested readers are referred to Tipping [2000, 2001]. 2.3. Model Selection [17] One reason for seeking a solution using a Bayesian approach is the fact that parameters are data evident, yet parameters within the kernel specification ought to be selected to improve the marginal likelihood [QuinoneroCandela and Hansen, 2002]. Both the kernel type and the kernel parameters are influential in defining the design matrix % and in specifying the trade-off between training accuracy and model complexity (Figure 1). This trade-off is a nontrivial aspect of model development, as illustrated in Figure 1. In general, a more complex machine will learn a training set better than a simpler one due to its higher flexibility (which is known as overfitting, Figures 1d and 1e); however, the simpler model may actually be superior in the sense that it generalizes better in the case of new samples. Therefore, to obtain an optimal level of performance a considerable number of design choices with respect to the RVM kernel must be done. [18] Cross validation techniques have been used extensively in the literature for the purpose of exploring the tradeoff between model complexity and performance. However, they are expensive in terms of computation and data [Law and Kwok, 2001; Moradkhani et al., 2004; Khalil et al., 2005b]. [19] A Bayesian framework has been proposed to achieve model selection goals. However, its application presents considerable difficulties and is prohibitively slow [MacKay, 1992a, 1992b; Tipping, 2001]. Moreover, there is a notable benefit if one considers the case of multiple input-scale

parameters, in which traditional cross validation is not an option, and considerable performance gains may be achieved [Tipping, 2001]. In this manuscript the basis function for input x 2 Rd with its corresponding scale parameters, H = (h1, , hd), will take the form %ðxÞ ¼ F

pffiffiffi pffiffiffi pffiffiffi h1 x1 ; h2 x2 ; . . . ; hd xd

ð14Þ

Thus the objective of model selection is to find the optimal kernel parameter set. [20] It is important to select a model that accounts for the input space kernel parameters, h. A very narrow kernel width results in a diagonal covariance (not revealing any structure at all in the data), while a large width implies isotropic noise [Williams, 1999; Yoshua et al., 2003]. The approach adopted for this level of inference is to rank different models by p(HjZ), where Z is the training data set and H defines the space of all possible states of nature for the model (e.g., kernel parameters, H). [21] Owing to the assumption of noninformative prior probabilities, p(H), for all models, different models are compared and ranked according to the maximal evidence p(ZjH) while assuming a Gaussian approximation for p (ZjAMP, s2MP, H) at AMP and s2MP [Suykens et al., 2002; Van Gestel et al., 2001]. Equivalently and more straightforwardly, to find the optimal set of kernel parameters, H, one maximizes the log of the marginal likelihood presented in Equation (9) and the priors over A and s2: p Z j AMP ; s2MP ; H / log p t j A; s2 ; H

4 of 15

þ

N X i¼1

log pðai Þ þ log p s2 ;

ð15Þ

W11401


Figure 2. Novelty detection: samples outside the boundary are viewed as novel.

[22] Since the terms

N P

log p(ai) and log p(s2) are

i¼1

related to noninformative priors, the maximization problem requires only the maximization of log p(tjA, s2, H) which is equivalent to 12 [N log 2p + log jSj + tS 1t] where S = s2I + %A 1 %T and H responsible for constructing % as seen in equation (14). [23] Since there is no definitive method for optimizing H, a plausible mechanism for doing so has been adopted in this manuscript by maximizing the marginal likelihood using adaptive simulated annealing (ASA), a recently developed global optimization algorithm. It is not the objective of this paper to explain different optimization algorithms. The motivation to use ASA is illustrated in the results of Ingber and Rosen [1992], where they concluded that ASA strongly outperformed genetic algorithms (GAs) on a set of standard benchmark optimization testing functions. The ASA algorithm was developed in 1989 by Ingber [1989, 1993] to overcome limitations of simulated annealing (SA) for multidimensional optimization problems. Unlike the purely random SA and GA methods, ASA adaptively guides each parameter toward the most promising solution set by establishing an annealing schedule for each parameter. So, in our approach the admissible model is selected by maximizing the marginal likelihood over the training data set using ASA.

3. Novelty Detection [24] Learning machine – based predictions depend on encapsulating the hidden contexts in a set of parameters for the sake of successfully predicting a target concept. Changes in the target concept, the hidden context or the underlying distributions lead to the necessity of revising the current model to account for the new trends in the data. Machines that adapt to deal with dynamic input distributions or, in other words, are able to detect ‘‘concept drift’’ are a potentially useful facility for adaptive management of river basins. Henceforth this process will be referred to as novelty detection. This is of paramount importance particularly when we consider dynamic data sets, where the distribution of inputs is likely to change over time to a new regime. A typical example is the

W11401

nonstationary nature of weather prediction rules that vary with time. Blum [1998] argued that a natural variation of the input space scenarios is strongly relevant to practice. [25] Novelty (abnormality) detection could be used for condition monitoring and fault diagnosis and will provide insightful information to the decision maker about new trends in the incoming data. [26] Most of the previous work to approach this dilemma does not devise a functional algorithm that works on a high-dimensional real-world problem [Rasmussen, 2001; Perreault et al., 2000a, 2000b]. Rasmussen [2001] proposed a Bayesian analysis to general linear models for locating change points (e.g., drifts), but it involves a certain amount of subjectivity and cannot be adopted in an online fashion. [27] One could use statistical learning theory [Vapnik, 1995; Scholkopf et al., 2000], which is never used to solve a problem more general than the one we are dealing with here. The following section presents the concepts embodied in SVMs to define the algorithm so as to function as an online method and hence to detect changes and noise in the new data [Blum, 1998]. It is here where use of support vector machines, which have received great attention and have been extensively used for pattern recognition, regression estimation, and solution of inverse problems [Scholkopf et al., 2000, 2001], can be effectively applied. [28] Given a set... of examples of input vectors {xn}N n=1, where x 2 Rd, let % be a feature map x ! F, i.e., a map into inner product space, F ...(Hilbert space), where the inner product in the image of % can be computed by evaluating some kernel [Boser et al., 1992; Vapnik, 1995; Scholkopf et al., 1999]. For more information we refer the reader to the literature concerning Reproduced Kernel Hilbert Space (RKHS) [Vapnik, 1998]. In the projected feature space, the strategy is to separate this region by constructing a hyperplane, which is maximally distant from the origin with all data points lying on the opposite side from the origin (see Figure 2). In other words, the SVM training algorithm will find a decision surface with optimal division from the origin of the feature space. This mapping to the feature space corresponds to a highly nonlinear decision boundary in the input space. Accordingly, the returned Heaviside function, f, will hold the value +1 in a region containing the set of patterns, and 1 or 0 elsewhere [Scholkopf et al., 2001; Bennett and Campbell, 2000]. This geometric picture corresponds to the following constrained minimization problem (see Figure 2) (for detailed description of SVM, interested readers are referred to Asefa et al. [2004], Khadam and Kaluarachchi [2004], and Khalil et al. [2005b, 2005c]): 2 6 4

min w;x;r

Subject to

3 1 1 X JP ðw; xÞ ¼ wT w þ x

r 7 n n 2 nN 5 ... N T 8n¼1 : w F ðxn Þ r xn ; xn 0

ð16Þ

Given the illustration depicted Figure 2 and equation (16) the decision hyperplane is a distance r/wTw away from the origin where w is a normal vector to the hyperplane that has an offset r. By introducing the positive slack P variables xn one allows some training error. It follows that n xn is an upper bound on the number of training error. Appropriate choice of the constant n N allows trading off training error

5 of 15


W11401

versus model complexity. Therefore one can view n as the trade-off term between maximizing the distance from the origin and containing most of the training data in the region created by the hyperplane. It also corresponds to the percent of outliers in the training data set [Scholkopf et al., 2001; Bennett and Campbell, 2000]. [29] This will result in an error-tolerant SVM and will be responsible for its good generalization capability. Equation (16) is in the primal space and its solution is obtained by solving this optimization problem in the dual space [Vapnik, 1998; Klinkenberg and Joachims, 2000]. [30] The Lagrangian form of equation (16) is introduced by using the multipliers gn, Jn > 0: Lðw; x; r; a; bÞ ¼

X 1 1 X xn r

kwk2 þ n n 2 nN ...

gn

w: % ðxn Þ r þ xn Jn xn

ð17Þ

After applying the optimality conditions on equation (17), one obtains the dual problem: 1X g g k xn ; xj 2 nj n j 1 X ; Subject to 0 gn g ¼1 n n nN min a

ð18Þ

[31] The solution of equation (18) can be readily obtained by quadratic programming. The sample data with corresponding gj 0 are called support vectors. The resulting decision function will be positive for most of the examples in the training set, and negative for novel or outlier data: f ðxÞ ¼ sign

X

X

r and r ¼ g k x; x g k xk ; xj ð19Þ j j j j j

This algorithm will be combined with RVMs to provide a convenient framework for accounting for drifts associated with the system.

4. Application for Real-Time Management [32] A significant amount of research has focused on the application of expert systems to problems of reservoir management. Neural networks and fuzzy rule-based systems for inferring reservoir system operating rules have been employed for real-time operation and management [Labadie, 2004; Shrestha et al., 1996; Mousavi et al., 2004]. Intelligent reasoning systems, ‘‘expert systems,’’ have been applied to real-time operation of reservoirs in a hybrid manner in which one uses inference systems along with rule-based techniques [Armijos et al., 1990]. Shepherd and Ortolano [1996] used a critique expert system approach that learns the operating plan and provides accommodation for operators’ individual styles. The common feature of the techniques applied for real-time simulation is the use of prior information and present conditions of the system [Armijos et al., 1990; Hajilal et al., 1998; Jan et al., 1999; Labadie, 2004]. A comprehensive review of additional reservoir system simulation models can be found in works by Yeh [1985], Wurbs [1993], and Simonovic [1992]. [33] In this manuscript, the applicability of the adopted methodologies will be tested to capture the behavior of a dynamic system, the operations of the Piute Reservoir. Note

W11401

that this paradigm plays a crucial role in integrated reservoir operation and management, yet full management benefits can only be realized by using this adaptive reliable forecasting system along with quantification of optimal measures and operational risk based tradeoffs [Yao and Georgakakos, 2001]. Although, for a reservoir that operates on on-demand basis, learning static management rules or neglecting to incorporate forecast uncertainty in the decision process is not expected to improve reservoir operations. 4.1. Description of the Study Area [34] The Sevier River Basin in south central Utah is one of the state’s major drainages. A closed river basin, it encompasses 12.5% of the state’s total area. The Sevier River Basin has five subwatersheds and is divided into two major divisions, the upper and lower basins, for the administration of water rights. Average annual precipitation ranges from 6.4 to 13.0 inches in the valleys, and the growing season ranges from 60 to 178 days [Utah Board of Water Resources, 2001; Berger et al., 2002]. Most of the surface water runoff comes from snowmelt during the spring and early summer months. The primary use of water in the basin is for irrigation. The average annual amount of water diverted for cropland irrigation is 903,500 acre-feet. Of this amount, approximately 135,000 acre-feet are pumped from groundwater. The irrigation season in the basin generally extends from April to the end of October. About 40 percent of the diversions are return flows from upstream use [Berger et al., 2002]. (For a detailed description of the basin and much of the real-time database utilized in this research, refer to http://www.sevierriver.org.) 4.2. Reservoir Control Simulation [35] The algorithm adopted in this paper is applied to the Piute Reservoir in the Sevier River Basin wherein an extensive basin-wide automated system has been installed that records and stores data on an hourly basis to enable real-time information processing. Moreover, internet-based communications and control systems are in place to allow managers to remotely manipulate all reservoir releases and most canal diversion gates at will. [36] Piute Reservoir is the principal storage facility used in the upper basin to supply water to numerous canals that divert directly from the Sevier River. Real-time operations of the reservoir must face uncertainties in the form of yearto-year and seasonal variations in losses and gains on the river main stem [Khalil et al., 2005a]. These result in travel times from the reservoir to downstream canal diversion points that are uncertain and vary as a function of the quantity of flow in the river and antecedent flow conditions. To contend with this uncertainty, the Piute Reservoir operator would benefit from a tool that would help decide on a near real-time basis how much water to release to meet water orders to canal operators located downstream of the reservoir. In other words, a common requirement for managing the reservoir that is operated on an ‘‘on-demand’’ basis is the anticipation of the quantity of water that must be released while accounting for losses or gains along the river and changing travel times to each downstream canal diversion point. [37] Operation of the Piute Reservoir constitutes a case study where fine resolution decisions have to be made to

6 of 15

W11401


W11401

meet the downstream demands. Reservoir releases and canal diversions in a highly instrumented and controlled river basin, such as the Sevier, could be made on an hourly (or more frequent) basis if sufficient detail is available in the information describing the present and desired future system state. For the operator of the Piute Reservoir, therefore, the desired output of the model is simply the hourly quantity of water that should be released from the reservoir that will be sufficient to meet the needs of nine irrigation canals that divert water from the river downstream of the reservoir. If too little water is released, it is likely that the lower canals will not receive enough water. If too much water is released, some might be spilled to the lower basin; water that is spilled is considered ‘‘lost’’ by the water users, who, in accordance with the complicated system of water rights on the Sevier, are entitled to it. Efficient management decisions about the operation of the reservoir, then, can result in reduction of water losses and improved deliveries to users.

Creek is an example of an uncontrolled tributary stream that discharges into the river in such a way that its spring and early summer diurnal fluctuations make downstream water management more difficult. The streamflow inputs, Qt 24, are averages of the vectors of hourly streamflow at the three main stem gages and the Clear Creek outflow for the 24 hours previous to the prediction time t. [43] The climate index, CIt, is the first principal component of temperature, relative humidity, solar radiation, wind speed, and precipitation, and captures 98 percent of the variance in these variables. This index is calculated over the 24-hour time period prior to t. The inclusion of the first principal component produced a more accurate model than accounting for each of the climatic variables independently. [44] The inputs of the model are then expressed as

4.3. Identification of Inputs [38] Piute Reservoir operations are affected by interactions between many water cycle components, such as precipitation, snowmelt, evapotranspiration, infiltration, and groundwater recharge, with anthropogenic influences, such as irrigation diversions. Therefore the information that should be made available to the model should include the data that describe current and perhaps recent historical flow conditions, various climate indices in the basin, and desired downstream canal diversions. [39] The collection of hourly data in the Sevier River Basin has been ongoing since 2001. The available data lead to approximately 16,000 input-output pairs. The data from the 2001, 2002, and 2003 irrigation seasons were used to build the machine, while data from the 2004 irrigation season were used to check the validity of the machine when functioning in real-world conditions. [40] In this paper, the short-term anticipations of required reservoir releases are supported. Irrigation demands represent the quantities of water that farmers request be delivered to their head gates. Such requests are made one day in advance of when the deliveries are to take place, and are expressed to the reservoir operator by the various canal operators in the form of hourly measurements of downstream canal diversions. Weather information can directly influence the behavior of farmers and canal operators in the basin. The inclusion of temperature, relative humidity, wind speed, solar radiation, and total precipitation data as predictors can enhance the model performance. In addition, continuous historical streamflow data are available in the main stem. [41] A cross correlation of diversion orders, climate indices, and streamflow with respect to reservoir releases is performed to explore the degree of relevancy and the time dependency among the variables. The qualitative assessment of these results encourages the use of eight inputs for the machine in the form of diversion orders at the different downstream canals. This is represented in Ot Dt which is a vector of water orders to be delivered at time t + Dt. [42] Three inputs are in the form of hourly streamflow measurements made at three points along the river downstream of the reservoir, and one input provides inflow data from Clear Creek, a major gauged tributary that is approximately 1-day travel time downstream of the reservoir. Clear

The target output of the model is the quantity of water to be released at the time of prediction.

x ¼ Qt 24

Ot Dt

CIt

ð20Þ

4.4. Organization of the Adaptive Model Paradigm [45] Establishing a reliable dynamic model for providing the necessary information to support advanced water resources decision making has always been an important concern for researchers and field operators. In particular, effective real-time management of the water of the Sevier River Basin is seen by the managers of the irrigation systems in the basin to be extremely important in achieving an optimum allocation of their scarce water resources. [46] Decision making is normally established as a multiobjective dynamic problem and the solution is found by solving an optimization problem [Bhattacharya et al., 2003]. Real-time control ensures a close match of the various time-varying requirements of different water uses, and may become an effective tool for water managers. However, in this paper, the framework could be viewed as an expert system, where the model is attempting to capture knowledge as to how the reservoir operators make decisions. Once trained the framework attempts to forecast decisions the reservoir operators would make. In other words, the established framework aims to estimate or predict entity states that provide insightful information to decision makers. This level of induction could be obtained by non-Bayesian methods, e.g., feed forward back propagation artificial neural networks (ANNs), fuzzy logic models, and support vector machines (SVMs), or with Bayesian solutions such as relevance vector machines (RVMs). Owing to their remarkable features and superior performance over non-Bayesian algorithms, RVMs have been adopted. Since there are a considerable number of design choices for RVMs, after selection of inputs and outputs, the second step of machine learning is to select optimal model specifications. For the objective of a fully automated machine, adaptive simulated annealing (ASA) is used to perform this level of inference (i.e., model selection). [47] Having built the most plausible model, the second level of the framework is to use the machine as a decisionsupport system. Hence its performance must be monitored and the model should be updated when needed. There are two events that might occur that indicate the model should be updated. One is the presence of a new data set that has

7 of 15

W11401


Figure 3. Model structure and process flow to achieve high-level inference. not been previously exploited; the other is the case where there is concept drift, i.e., new trends in the data that the machine has not learned. For this purpose, SVMs have been used to detect an abnormality or novelty and thereby trigger the machine to adapt to these events by retraining. Figure 3 [Kandola, 2001] illustrates the processes flow from the raw data to the decision-making level, and where the concepts of model building (RVM), model selection (ASA), and novelty detection (SVM) fit in a fully automated datadriven paradigm. The proposed framework will ultimately be integrated into a water resources information management system to be used by the operators of the Sevier River water systems, especially in terms of prediction tools to help manage their limited water resources. [48] It is worth mentioning here that RVM is a robust and reliable learning machine and could be used confidently to capture the behavior of fairly complicated systems. In spite of the simplicity of the proposed application, the authors are convinced that the suggested paradigm could be borrowed to simulate and forecast nonstationary dynamic systems.

5. Results and Discussion [49] Reservoir releases must be made to address the conflicting goals of satisfying downstream demands with high certainty, while at the same time conserving water in the reservoir for use later in the season. The proposed fully automated machine has been implemented for real-world data describing the Sevier River Piute Reservoir water delivery system and will constitute a major building block toward integrated reservoir management practices. [50] While ANNs have been extensively employed in water resources planning, management, and simulation [ASCE Task Committee On the Application of ANNs in Hydrology, 2000a, 2000b], the recently celebrated RVM approach brings up many potentially advantageous features, especially generalization performance and sparse representation. Therefore the performance of a feed-forward ANN is included for comparison. Figure 4 illustrates the performance of ANN and RVM models in the training phase. [51] Figure 4a shows the performance of RVM when trained using optimal multiple kernel parameters that resulted from the proposed model selection procedure. The training phase was performed using data from the

W11401

2001, 2002, and 2003 irrigation seasons. Again, model selection involves the selection of values for the parameters so that the model output matches the measured behavior of the system as closely as possible. Kernels present parameters that are generally determined heuristically. Douglas et al. [2000] argued that for the purpose of using the model for online forecasting, it is desirable to select a single representative parameter set that provides an acceptable trade-off in fitting of the different parts. The adaptive simulated annealing (ASA) algorithm is employed to automate this step. This is done in a fashion to account for the input data structure by choosing multiple kernel parameters. The machine removes the redundant features to improve the generalization abilities and it only utilizes 43 relevance vectors (RVs) from the full data set that was used for training. The RVs are a subset of the training data set that is used for prediction. In other words, they are the state vectors that represent the center location of the kernel function. [52] Figure 4b shows the performance of the RVM using only a single kernel parameter (i.e., associated with Gaussian kernel function) obtained using tenfold cross-validation error. Cross validation is prohibitive in the case of multiple parameters. This formulation produces a model having 72 relevance vectors. Thus there is a noticeable influence of multiple kernel parameters on the model performance in terms of model parsimony. This sparsity is due to the selection of an objective function that maximizes the type II maximum likelihood [Tipping, 2001]. So, RVM ignores the irrelevant inputs to reduce complexity and spurious overfitting. Therefore it can be used to summarize the information by maintaining the major features of the data set via RVs. [53] Figure 4c depicts the performance of ANN. See Table 1 for some statistics of efficiency to better judge the performance multiple kernel RVM, single kernel RVM, and ANN. One may conclude that if the ANN model is cautiously structured against overfitting and is well trained, it could be able to provide competitive performance similar to RVMs. However, when trained using a scarce data set, RVMs are expected to provide better generalization capabilities. The advantage of RVMs lies in their ability to provide uncertainty bounds. In addition, unlike ANN the structure of RVM model is dictated by the information content of the data set and thus the necessity for heuristic assumptions is tempered. For more details about the comparison between these learning machines, the reader is referred to Khalil et al. [2005b]. [54] Figures 5e, 5b, and 5c show performance results of the testing phase for multiple-kernel RVM, single-kernel RVM, and ANN, respectively. The testing was carried out for the 2004 irrigation season. Note in the case of RVM, there is a 95% confidence interval associated with the predicted values. Some statistics of interest have been evaluated (Table 1) to test the machine performance. For more details about goodness of fit measures, see David and Gregory [1999] and Willmott et al. [1985]. It is evident in Figure 5 that both ANN and RVM performance is rather inadequate to capture the behavior of the system throughout the 2004 irrigation season. Thus adaptation to account for this nonstationarity or novelty in the system behavior is necessary.

8 of 15


W11401

W11401

Figure 4. Time series plot of actual versus predicted releases with confidence intervals for the training phase: (a) RVM performance with different scale parameters obtained by linking RVM to ASA, (b) RVM with fixed kernel parameter, and (c) ANN.

Table 1. Machine Performance Using Different Statisticsa RVM + ASA

RVM

ANN

Statistics

Training

Testing

Adaptive System

Training

Testing

Training

Testing

Correlation coefficient r Coefficient of efficiency E Bias, cfs Root mean square error, cfs Mean absolute error, cfs Index of agreement d

0.974 0.948 1.303 43.529 30.993 0.987

0.901 0.809 5.810 55.616 37.889 0.944

0.961 0.923 1.386 35.217 25.152 0.979

0.973 0.947 1.467 43.974 31.970 0.986

0.893 0.791 10.583 58.083 38.554 0.938

0.968 0.936 0.311 48.190 34.168 0.983

0.875 0.756 12.510 62.730 41.531 0.926

a

Italics indicates best performance.

9 of 15

W11401


W11401

Figure 5. Time series plot of actual versus predicted releases with confidence intervals for the testing phase (2004 irrigation season): (a) RVM performance with different scale parameters obtained by linking RVM to ASA, (b) RVM with fixed kernel parameter, and (c) ANN. [55] SVM has been reformulated to provide a new algorithm in line with Vapnik’s principle in detecting outliers and novelty [Scholkopf et al., 2000]. As illustrated (Figure 2), we construct a hyperplane so as to maximize its distance from the origin whilePallowing nN outliers (refer to equation (17)). The distance i ai k(x, xi) r is estimated for each input vector from the hypothetical origin in the feature space. The illustration is modified from Scholkopf and Smola [2002] and Mu¨ller et al. [2001]. We used a Gaussian kernel, which has the advantage that the data are always separable from the origin in feature space [Burges,

1998; Cristianini and Shawe-Taylor, 2000; Vapnik, 1995]. The novelty detection model was trained using the 2001, 2002, and 2003 data set and tested over the 2004 irrigation season data set. Figure 6a represents equation (19), wherein f(x) takes the value +1 in a region capturing most of the data points and 0 to identify outliers and new trends in the data. In other words, Figure 6a depicts the correspondence of the new trends or outliers to the reservoir releases. This triggers the machine to retrain to account for the new data that have not been previously used for training. The abrupt changes from 0 to 1 in the Heavside function behavior

10 of 15

W11401


W11401

Figure 6. Time series plot of actual versus predicted releases with confidence intervals for the full paradigm (RVM plus SVM machine) and novelty detection functions. (i.e., equation (19)) shown in Figure 6a represent noisy input signals that provide insight for the purpose of condition monitoring and fault diagnosis; however, the smooth trends that pass the decision boundary represent concept drift. [56] Figure 6b illustrates the results of the machine when linked to the SVM where it has been retrained to account for novelty according to the flowchart depicted in Figure 3. The paradigm triggers adaptation and retraining of the RVM model and novelty detection model 5 times during the 2004 season. There is a pronounced improvement in the behavior of such a paradigm, as reflected in the estimated statistic of interest (see column 3 in Table 1). Because of this adaptation, the number of RVs increased to 54. The model performs remarkably well, and Table 1 provides some statistics of interest for the full paradigm. These statistics further show that the new combination achieves a low error rate. The coefficient of efficiency increased from 0.809 to 0.923. For interpretation purposes, for any machine a coefficient of efficiency of 0.9 indicates that the machine has a mean square error of 10% of the variance in the data. [57] Figure 7a illustrates the probability plot of the observed versus the predicted reservoir releases. Since

residuals are thought of as an unexplained element of variation, one expects them to have a roughly normal distribution with zero mean. Figure 7b depicts a histogram of the residuals with a superimposed normal density function. It shows an approximately normal distribution of residuals produced by the paradigm. [58] It is known that abundant data that accurately characterize the underlying distribution provide robustness for machine learning applications. To ensure good generalization of the inductive learning algorithm given scarce data, the machine has been built on many bootstrap samples from the original data set in order to explore the implications of the assumptions made about the nature of the data. Figure 8 shows the results of training and testing using different bootstrap samples. The training phase was conducted utilizing data from 2001, 2002, and 2003, while the testing was evaluated over the 2004 data set. Note that adaptations of the framework over 2004 were not carried out in this analysis. The bootstrapping result provides a way to evaluate the significance of some indices and thus draw conclusions about model reliability. From Figure 8 one also could deduce rough confidence bounds [Kandola, 2001], which are more revealing of model performance

11 of 15

W11401


W11401

Figure 7. (a) Probability plot of observed versus predicted reservoir releases for the full paradigm. (b) Residuals histogram (2003 irrigation season). than single values [Willmott et al., 1985]. The width of the bootstrapping confidence intervals indicates implicit uncertainty in the machine parameters. A wide confidence interval indicates that the available training data set is

inadequate to find a robust parameter set [Kuan et al., 2003]. The RVM model shows fairly narrow confidence bounds in both the testing and training phase and hence one could conclude that the model is rather robust.

Figure 8. Statistical analysis of different bootstrapping samples. 12 of 15

W11401


[59] The performance results have demonstrated the successful combination and implementation of Bayesian principles (RVM), model selection (ASA), and novelty detection (SVM). To utilize the model in near real-time, the predicted reservoir releases can be provided to the reservoir operator, and then it is possible for the operator and experts to analyze, judge, and evaluate the results of the machine according to their own knowledge and experience.

6. Summary and Conclusions [60] In this paper, we have proposed an operational approach to a decision aid for managing near real-time reservoir releases in the Sevier River Basin. The principal water problem in the basin is twofold: inadequacy and uncertainty of supplies. In this context, a reliable water supply planning policy, specifically during the dry season, necessitates acceptably accurate anticipations of future water states. The approach presented here constitutes a fully adaptive paradigm within the Bayesian context, which if utilized in an integrated management framework, could offer a significant gain for real-time water resources management. [61] Primarily, the results show that the coupling of relevance vector machines and support vector machines is promising since it allows the model to adapt to changes in the behavior of the basin. Growing evidence that there are streamflow variations during the season and from season to season, as well as shifts in climate indices [Kahya and Dracup, 1993] and shifts in farmer and system operator behavior, has led us to incorporate a novelty detection algorithm in the real-time decision support system. [62] This paper explored the use of unsupervised support vector machines in a sequenced learning technique to recognize behaviors that are outside the norm and then to trigger the RVM to recognize new patterns. Estimating many statistics of interest in conjunction with data display graphics and bootstrapping analysis accomplishes a broad operational evaluation of the model. In summary, the framework described in this manuscript represents a first attempt to exploit the real-time database available on the Sevier River to address the range in information needs of stakeholders and managers and improve on-demand flexibility in reservoir operation. [63] Whereas machine-learning techniques have great potential to be used in decision-support systems, we believe they have not been fully exploited in water-related issues. Strictly speaking, it is reasonable to conclude that one could view the present work as an adaptive paradigm in which different algorithms have been fused to provide better estimates of operational decisions and detect novelty trends. The approach presented uses a concrete paradigm that is mathematically sound with well-behaved computational complexity. Beven and Binley [1992] suggested that many models are overparameterized and therefore result in equifinality. Equifinality is associated with the multiplicity of different possible combinations of values of model parameters. We argue therefore that the model structure proposed in this manuscript was formulated so as to alleviate equifinality, the curse of dimensionality, and ensure parameter uniqueness. Parametrically efficient RVMs (sparseness and parsimony), inclusion of many measures each emphasizing a different aspect of model

W11401

behavior in model selection, and using structural risk minimization via SVMs conjunctively conclude robustness and reliability of the proposed paradigm. [64] Finally, one of the shortcomings of this approach is that regardless of the parsimony of the model structure that reflects the most dominant characteristics of the system, it cannot be seen how a meaningful physical interpretation can be extracted from the resulting model definition. In spite of this, we believe that the imposed novelty detection tool ensures the persistence of the basic dominant characteristics and reflects any abnormalities. One might be able to interpret such events in physically meaningful contexts. In addition, this framework could be fused within an optimization context or a real-time control system to quantify the benefits and risks associated with the decisions made to enhance the reservoir operations. Another seemingly unavoidable disadvantage of the algorithm that handles novelty detection is that it detects an abnormality only after new data are available, that is after the learning algorithm performance starts to deteriorate [Klinkenberg and Joachims, 2000; Olivier et al., 1999]. [65] Acknowledgments. The authors are grateful for the thoughtful and insightful comments provided by three anonymous reviewers that helped to significantly improve this manuscript. We wish to thank the Utah Water Research Laboratory and the Utah Water Resources Research Center at Utah State University for providing funding in partial support of the work reported here. We also wish to acknowledge Roger Hansen of the Provo, Utah, office of the U.S. Bureau of Reclamation and the Seiver River Water Users Association for their long-standing commitment to this research.

References Armijos, A., J. R. Wright, and M. H. Houck (1990), Bayesian inferencing applied to real-time reservoir operations, J. Water Resour. Plann. Manage., 116(1), 38 – 51. ASCE Task Committee on the Application of ANNs in Hydrology, (2000a), Artificial neural networks in hydrology, I: Preliminary concepts, J. Hydrol. Eng., 5(2), 115 – 123. ASCE Task Committee on the Application of ANNs in Hydrology, (2000b), Artificial neural networks in hydrology, II: Hydrologic application, J. Hydrol. Eng., 5(2), 124 – 137. Asefa, T., M. W. Kemblowski, G. Urroz, M. McKee, and A. Khalil (2004), Support vectors – based groundwater head observation networks design, Water Resour. Res., 40, W11509, doi:10.1029/2004WR003304. Bennett, K. P., and C. Campbell (2000), Support vector machines: Hype or hallelujah?, SIGKDD Explor., 2(2), 1 – 13. Berger, B., R. Hansen, and A. Hilton (2002), Using the World-Wide-Web as a support system to enhance water management, paper presented at the 18th ICID Congress and 53rd IEC Meeting, Int. Comm. on Irrig. and Drain., Montreal, Quebec, Canada. Berger, J. O. (1985), Statistical Decision Theory and Bayesian Analysis, 2nd ed., Springer, New York. Beven, K. J., and A. M. Binley (1992), The future of distributed models: Model calibration and predictive uncertainty, Hydrol. Processes, 6, 279 – 298. Bhattacharya, B., A. H. Lobbrecht, and D. P. Solomatine (2003), Neural networks and reinforcement learning in control of water systems, J. Water Resour. Plann. Manage., 129(6), 458 – 465. Blum, A. (1998), On-line algorithms in machine learning, in Developments From A June 1996 Seminar on Online Algorithms: The State of the Art, Lect. Notes Comput. Sci., vol. 1442, edited by A. Fiat and G. J. Woeginger, pp. 306 – 325, Springer, New York. Boser, B. E., I. M. Guyon, and V. N. Vapnik (1992), A training algorithm for optimal margin classifiers, in Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory, edited by D. Haussler, pp. 144 – 152, PACM, Pittsburgh, Pa. Burges, C. J. C. (1998), A tutorial on support vector machines for pattern recognition, Data Min. Knowl. Discovery, 2(2), 121 – 167. Carpenter, T. M., P. Konstantine, and A. Georgakakos (2001), Assessment of Folsom Lake response to historical and potential future climate scenarios: 1. Forecasting, J. Hydrol., 249, 148 – 175.

13 of 15

W11401


Cristianini, N., and J. Shawe-Taylor (2000), An Introduction to Support Vector Machines, Cambridge Univ. Press, New York. David, R. L., and M. J. Gregory (1999), Evaluating the use of ‘‘goodnessof-fit’’ measures in hydrologic and hydroclimatic model validation, Water Resour. Res., 35(1), 233 – 241. Douglas, P. B., H. V. Gupta, and S. Sorooshian (2000), Toward improved calibration of hydrologic models: Combining the strengths of manual and automatic methods, Water Resour. Res., 36(12), 3663 – 3674. Gleik, P. H. (1993), Water in Crisis, Oxford Univ. Press, New York. Hajilal, M. S., N. H. Rao, and P. B. S. Sarma (1998), Real time operation of reservoir based canal irrigation systems, Agric. Water Manage., 38(2), 103 – 122. Hsu, K. L., H. V. Gupta, X. Gao, S. Sorooshian, and B. Imam (2002), Self-organizing linear output map (SOLO): An artificial neural network suitable for hydrologic modeling and analysis, Water Resour. Res., 38(12), 1302, doi:10.1029/2001WR000795. Ingber, L. (1989), Very fast simulated re-annealing, Math. Comput. Model., 12(8), 967 – 973. Ingber, L. (1993), Adaptive simulated annealing: practice versus theory, Math. Comput. Model., 18(12), 29 – 57. Ingber, L., and B. Rosen (1992), Genetic algorithm and very fast simulated annealing: Comparison, Math. Comput. Model., 16(11), 87 – 100. Jan, T. K., N-S. Hsu, W. Chu, S. Wan, and Y-J. Lin (1999), Real-time operation of Tanshui River reservoirs, J. Water Resour. Plann. Manage., 116(3), 349 – 361. Kahya, E., and J. A. Dracup (1993), U.S. streamflow patterns in relation to the El Ninõ/Southern Oscillation, Water Resour. Res., 29(8), 2491 – 2504. Kandola, S. (2001), Interpretable modeling with sparse kernels, Ph.D. thesis, Univ. of Southampton, Southampton, U. K. Khadam, I. M., and J. J. Kaluarachchi (2004), Use of soft information to describe the relative uncertainty of calibration data in hydrologic models, Water Resour. Res., 40, W11505, doi:10.1029/2003WR002939. Khalil, A., M. McKee, M. Kemblowski, and T. Asefa (2005a), Basin-scale water management and forecasting using multisensor data and neural networks, J. Am. Water Resour. Assoc., 41(1), 195 – 208. Khalil, A., M. N. Almasri, M. McKee, and J. J. Kaluarachchi (2005b), Applicability of statistical learning algorithms in ground water quality modeling, Water Resour. Res., 41, W05010, doi:10.1029/2004WR003608. Khalil, A., M. McKee, M. Kemblowski, T. Asefa, and L. Bastidas (2005c), Multiobjective analysis of chaotic systems using sparse learning machines, Adv. Water Resour., in press. Klinkenberg, R., and T. Joachims (2000), Detecting concept drift with support vector machines, in Proceedings of ICML-00, 17th International Conference on Machine Learning, edited by P. Langley, pp. 487 – 494, Elsevier, New York. Kuan, M. M., C. P. Lim, and R. F. Harrison (2003), On operating strategies of the fuzzy ARTMAP neural network: A comparative study, Int. J. Comput. Intell. Appl., 3, 23 – 43. Labadie, J. W. (2004), Optimal operation of multireservoir systems: State-of-the-art review, J. Water Resour. Plann. Manage., 130(2), 93 – 111. Law, M. H., and J. T. Kwok (2001), Applying the Bayesian evidence framework to v-support vector regression, in Machine learning ECML 2001: Proceedings of the 12th European Conference on Machine Learning, pp. 312 – 323, Springer, New York. Li, Y., C. Campbell, and M. Tipping (2002), Bayesian automatic relevance determination algorithms for classifying gene expression data, Bioinformatics, 18(10), 1332 – 1339. MacKay, D. J. (1992a), Bayesian methods for adaptive models, Ph.D. thesis, Dep. of Comput. and Neural Syst., Calif. Inst. of Technol., Pasadena. MacKay, D. J. (1992b), Bayesian interpolation, Neural Comput, 4(3), 415 – 447. MacKay, D. J. (1994), Bayesian methods for backpropagation networks, in Models of Neural Networks III, edited by E. Domnay, J. L. Van Hemmen, and K. Schulten, chap. 6, pp. 211 – 254, Springer, New York. Moradkhani, H., K. Hsu, H. V. Gupta, and S. Sorooshian (2004), Improved streamflow forecasting using self-organizing radial basis function artificial neural network, J. Hydrol., 295(1 – 4), 246 – 262. Mousavi, J. S., K. Mahdizadeh, and A. Afshar (2004), A stochastic dynamic programming model with fuzzy storage states for reservoir operations, Adv. Water Resour., 27, 1105 – 1110. Mu¨ller, K. R., S. Mika, G. Ra¨tsch, K. Tsuda, and B. Scho¨lkopf (2001), An introduction to kernel-based learning algorithms, IEEE Trans. Neural Networks, 12(2), 181 – 201.

W11401

National Research Council (2001), Envisioning the Agenda for Water Resources Research in the Twenty-First Century, Natl. Acad. Press, Washington, D. C. Neal, R. (1994), Bayesian learning for neural networks, Ph.D. thesis, Univ. of Toronto, Toronto, Quebec, Canada. Olivier, C., V. Vapnik, and J. Weston (1999), Transductive inference for estimating values of functions, Neural Inf. Process. Syst., 12, 421 – 427. Perreault, L., J. Bernier, B. Bobeé, and E. Parent (2000a), Bayesian change point analysis in hydrometeorology timeseries. part 1, The normal model revisited, J. Hydrol., 235, 221 – 241. Perreault, L., J. Bernier, B. Bobeé, and E. Parent (2000b), Bayesian change point analysis in hydrometeorology timeseries. part 2, Comparison of changepoint models and forecasting, J. Hydrol., 235, 242 – 263. Postel, S. L., G. C. Daily, and P. R. Ehrlich (1996), Human appropriation of renewable fresh water, Science, 271, 785 – 788. Quinonero-Candela, J., and L. K. Hansen (2002), Time series prediction based on the relevance vector machine with adaptive kernels, paper presented at International Conference on Acoustics Speech and Signal Processing, IEEE Signal Process. Soc., Orlando, Fla. Rasmussen, P. (2001), Bayesian estimation of change points using the general linear model, Water Resour. Res., 37(11), 2723 – 2731. Scholkopf, B., and A. J. Smola (2002), Learning With Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, MIT Press, Cambridge, Mass. Scholkopf, B., C. Burges, and A. J. Smola (1999), Advances in Kernel Methods: Support Vector Learning, MIT Press, Cambridge, Mass. Scholkopf, B., R. Williamson, A. Smola, J. Shawe-Taylor, and J. Platt (2000), Support vector method for novelty detection, Adv. Neural Inf. Process. Syst., 12, 582 – 588. Scholkopf, B., J. Platt, J. Shawe-Taylor, A. Smola, and R. Williamson (2001), Estimating the support of a high-dimensional distribution, Neural Comput., 13(7), 1443 – 1471. Shepherd, A., and L. Ortolano (1996), Water supply system operations: Critiquing expert-system approach, J. Water Resour. Plann. Manage., 122(5), 348 – 355. Shrestha, B. P., L. Duckstein, and E. Z. Stakhiv (1996), Fuzzy rule-based modeling of reservoir operation, J. Water Resour. Plann. Manage., 22(4), 262 – 269. Simonovic, S. P. (1992), Reservoir systems analysis: Closing the gap between theory and practice, J. Water Resour. Plann. Manage., 118(3), 262 – 280. Suykens, J. A. K., T. V. Gestel, J. D. Brabanter, B. D. Moor, and J. Vandewalle (2002), Least Squares Support Vector Machines, World Sci., Hackensack, N. J. Suykens, J. A. K., G. Horvath, S. Basu, C. Micchelli, and J. Vandewalle (Eds.) (2003), Advances in Learning Theory: Methods, Models and Applications, NATO Sci. Ser., Ser. III, vol. 190, IOS Press, Washington, D. C. Tipping, M. (2000), The relevance vector machine, Adv. Neural Inf. Process., 652 – 658. Tipping, M. E. (2001), Sparse Bayesian learning and the relevance vector machine, J. Mach. Learn., 1, 211 – 244. Tipping, M., and A. Faul (2003), Fast marginal likelihood maximization for sparse Bayesian models, paper presented at Ninth International Workshop on Artificial Intelligence and Statistics, Soc. for Artif. Intel. and Stat., Key West, Fla. Utah Board of Water Resources (2001), Utah’s water resources planning for the future, report, 88 pp., Div. of Water Resour., Salt Lake City, Utah. Van Gestel, T., J. A. K. Suykens, D. Baestaens, A. Lambrechts, G. Lanckriet, B. Vandaele, B. De Moor, and J. Vandewalle (2001), Financial time series prediction using least squares support vector machines within the evidence framework, IEEE Trans. Neural Networks, 12(4), 809 – 821. Vapnik, V. N. (1995), The Nature of Statistical Learning Theory. Springer, New York. Vapnik, V. N. (1998), Statistical Learning Theory, John Wiley, Hoboken, N. J. Velickov, S., and D. P. Solomatine (2000), Predictive data mining: Practical examples, in AI Methods in Civil Engineering Applications, 2nd Joint Workshop on AI Methods in Civil Engineering Applications, edited by O. Schleider and A. Zijderveld, pp. 3 – 19. Wahba, G. (1985), A comparison of GCV and GML for choosing the smoothing parameter in the generalized spline-smoothing problem, Ann. Stat., 4, 1378 – 1402.

14 of 15

W11401


Williams, C. K. I. (1999), Prediction with Gaussian processes: From linear regression to linear prediction and beyond, in Learning in Graphical Models, edited by M. I. Jordan, pp. 599 – 621, MIT Press, Cambridge, Mass. Willmott, C. J., S. G. Ackleson, R. E. Davis, J. J. Feddema, K. M. Klink, D. R. J. Legates, J. O’Donnell, and C. M. Rowe (1985), Statistics for the evaluation and comparison of models, J. Geophys. Res., 90(C5), 8995 – 9005. Wurbs, R. A. (1993), Reservoir-System simulation and optimization models, J. Water Resour. Plann. Manage., 116(1), 52 – 70. Yao, H., and A. Georgakakos (2001), Assessment of Folsom Lake response to historical and potential future climate scenarios: 2. Reservoir management, J. Hydrol., 249, 176 – 196.

W11401

Yeh, W. W.-G. (1985), Reservoir management and operations models: A state-of-the-art review, Water Resour. Res., 21(12), 1797 – 1818. Yoshua, B., P. Vincent, and J. Paiement (2003), Learning eigenfunctions of similarity: Linking spectral clustering and kernel PCA, Tech. Rep. 1232, Dep. d’Inf. et Rech. Oprationnelle, Univ. de Montreal, Montreal, Quebec, Canada.

T. Asefa, M. Kemblowski, A. Khalil, and M. McKee, Department of Civil and Environmental Engineering, Utah State University, Logan, UT 84322-8200, USA. ([email protected])

15 of 15

Sparse Bayesian learning machine for realtime ... - Wiley Online Library

Sparse Bayesian learning machine for realtime ... - Wiley Online Library

Suggest Documents

SPARSE MACHINE LEARNING METHODS FOR ...

SPARSE MACHINE LEARNING METHODS FOR ...

Realtime functional MRI using ... - Wiley Online Library

Machine learning classifiers for glaucoma ... - Wiley Online Library

Bayesian Machine Learning Classifiers for

Bayesian Weighting for Macromolecular ... - Wiley Online Library

Data Mining and Machine Learning Tools for ... - Wiley Online Library

Sparse Bayesian extreme learning machine and ... - Semantic Scholar

sparse Bayesian Learning and the Relevance Vector Machine ...

sparse Bayesian Learning and the Relevance Vector Machine ...

Sparse Bayesian structure learning with

Multi-frequency sparse Bayesian learning for matched

Realtime PCR assaybased strategy for ... - Wiley Online Library

Sparse Extreme Learning Machine for Classification - CiteSeerX

Accelerator for Sparse Machine Learning - Technion - Electrical

Sparse Extreme Learning Machine for Classification - CiteSeerX

IMRT QA using machine learning: A multi& ... - Wiley Online Library

Machine learning identifies specific habitats ... - Wiley Online Library

Applications of machine learning and ... - Wiley Online Library

Coping with complexity: Machine learning ... - Wiley Online Library

Machine learning algorithms improve the power ... - Wiley Online Library

Bayesian approaches in evolutionary ... - Wiley Online Library

Bayesian regression discontinuity designs ... - Wiley Online Library

Bayesian networks elucidate interactions ... | Wiley Online Library