A computer-based system for modelling the stage ... - CiteSeerX

15 downloads 20 Views 1MB Size Report
data such as stage and discharge are measured at these stations and used for the development of S-D rating curves. These in turn are cross- referenced with ...
Hydrological Sciences -Journal- des Sciences Hydrologiques,39,5, October 1994

487

A computer-based system for modelling the stage-discharge relationships in steady state conditions KEVIN D. GAWNE & SLOBODAN P. SIMONOVIC Department of Civil Engineering, The University of Manitoba, Winnipeg, Manitoba R3T 2N2, Canada

Abstract In order to support the effort of Environment Canada directed to the automation of surface water quantity data management, an Intelligent Decision Support System (IDSS), to be used to create, develop and maintain stage-discharge (S-D) rating curves, is in the final stage of development. Environment Canada is the agency responsible for establishing gauging stations and monitoring streamflow in Canada. Hydraulic data such as stage and discharge are measured at these stations and used for the development of S-D rating curves. These in turn are crossreferenced with continuous water level recordings to generate daily stream discharges. The current procedure for establishing S-D curves used by Environment Canada is subjective, due to the manual labour involved in the plotting of measurements, analysing of data points and decision making to alter or update an existing curve. The IDSS computer system is based on the concept of representing rating curves with mathematical models or equations. The paper describes the main characteristics of the modelling module of the IDSS. Linear regression is the modelling technique used within the IDSS together with power and BoxCox transformations introduced to achieve linearity of the data. Within the module presented in the paper, the generation of multiple S-D curves is combined with comprehensive statistical and multi-objective analyses in order to select an optimal mathematical model or equation. Un système informatique pour la modélisation des relations de niveau-débit en régime permanent Résumé Dans le but d'appuyer les efforts d'Environnement Canada pour la mise en place d'une gestion automatisée des données concernant la quantité des eaux de surface, on a conçu un système intelligent d'aide à la décision destiné à établir et à mettre à jour les courbes de tarage qui en est présentement au stade final de son développement. Environnement Canada est l'organisme responsable de l'établissement des stations de jaugeage destinées à suivre les débits des cours d'eau du Canada. Les données hydrauliques telles que le niveau et le débit y sont mesurées et sont utilisées pour établir les courbes de tarage. Ces dernières sont à leur tour utilisées pour estimer les débits journaliers à partir des niveaux. La procédure actuellement en vigueur à Environnement Canada pour établir les courbes de tarage est subjective, qu'il s'agisse de l'analyse préliminaire des données, du tracé manuel des courbes de tarage ou de leur mise à jour. Un système informatique (sytème intelligent d'aide à la décision) dont le principe est de modéliser les courbes de tarage selon des équations a été proposé. Le présent article présente les principales Open for discussion until I April 1995

488

K. D, Gawne & S. P. Simonovic caractéristiques du module de modélisation de ce sytème intelligent d'aide à la décision. La technique de modélisation adoptée est la régression linéaire, des tranformations de type puissance ou Box-Cox permettant de linéariser les données. Dans le module présenté ici la génération de multiples courbes de tarage est combinée avec des analyses statistiques complètes et à objectifs multiples permettant de sélectionner le modèle ou l'équation optimale.

INTRODUCTION A "stage-discharge" curve is a relationship between stream stage (water level) and discharge for a particular section of a stream channel. Continuous water level data are recorded by the Water Resources Branch of Environment Canada. Making continuous discharge measurements is impractical and therefore a stage-discharge (S-D) relationship is commonly used to estimate stream discharges at measured stage values. To achieve a stage-discharge relationship, periodic measurements of stream discharge and corresponding stage are commonly made at each gauging station. In practice these data points are then plotted on linear coordinate paper with stage on the ordinate. Data are plotted using different scales to best represent the stream for various ranges of stage. This is done to maintain a level of accuracy for all flow regimes contained in the rating tables which are derived later from the plotted S-D measurements. A smooth curve is drawn through the points. Graphical procedures are used to determine the zero-flow stage and to extrapolate the curve over higher stage values since S-D measurements in high water conditions are typically infrequent. The curve is then entered into a computer using a digitizer and a rating table is produced. S-D measurements rarely form a distinct curve since streams are often influenced by non-ideal conditions. This results in measurement points which deviate significantly from the general trace of the curve. Thus, the technician requires knowledge of the hydraulic and hydrological influences on the S-D relationship. The particular practice of hand fitting S-D curves is purely subjective and considerably time consuming. Other methods include mathematical approaches, similar to those to be discussed herein, that avoid the use of hand curve fitting but lack thorough statistical analysis. For these reasons, an Intelligent Decision Support System (IDSS) has been developed for creating, using and updating stage-discharge relationships (Simonovic, 1989b; Douglas & Simonovic, 1992; Douglas, 1992). The IDSS is currently being developed on a Sun Microsystems Sparc 1 + Workstation. It combines three major computer-based tools used to support the overall framework of S-D analysis. The data management component of the computer system is mainly concerned with the storage of S-D measurements and other associated information. The Oracle Relational Database Management System (RDBMS) was chosen for this function. An expert system tool, Nexpert Object (Neuron Data, 1991) is used as the main tool for the development of a system integrator (the brain of the system). Within this function, all the

Modelling stage-discharge relationships

489

activities related to the S-D analysis are coordinated. This function is created to assist the user in several ways: in selecting the user mode; in extracting the data to be used; in performing mathematical modelling of a relationship; in presenting final results; and in deriving the discharge values from measured water levels. Use of an expert system tool establishes an environment suitable for representing human knowledge, expertise and experience. To facilitate the presentation of numerical results in a graphical form, a graphing utility, XGRAPH, is included in the system as the last of the three main tools. The sub-system discussed in this paper, entitled Automated Statistical Modelling (ASM), constitutes only a portion of the overall IDSS. ASM incorporates mathematical modelling techniques coupled with a statistical analysis of the models to produce relationships for given stage-discharge records. A multi-objective procedure based on a number of criteria is then used to determine the most representative model. Thus, the ASM module of the IDSS is divided into three main components: (a) generation of multiple S-D mathematical models (automated); (b) statistical analysis of S-D models (user interactive); and (c) ranking of S-D models (automated). A linear regression analysis is the modelling technique used to create the S-D relationship. The analysis requires a linear relationship between the data and so the system includes a procedure to transform the data in an attempt to achieve linearity of the data prior to modelling. The power and Box-Cox transformations (Box & Cox, 1964) are used, stepping through predetermined ranges of transformation parameters to generate a number of modelling data sets. A linear regression analysis is performed on each of the data sets. The ASM module includes a thorough statistical analysis of each model produced, screening out all linear regression models that are statistically incorrect. That is, all S-D models that do not follow the requirements of a linear regression are discarded. The linear regression requires that criteria be met regarding the residuals, which are simply the difference between the model predicted discharges and the measured discharges. The residuals must be independent, of uniform variance, and follow a normal distribution (Kleinbaum & Kupper, 1978). The analysis includes both automated and user interactive testing procedures. Finally, the ASM includes a procedure that determines the most representative S-D relationship from the remaining set of models following the residual analysis screening process. The objectives used in the analysis are the values of three calculated statistics viz. the Standard Error of Estimate (SEE), the sample coefficient of determination (r2), and the number of outliers. It is possible that different models will be deemed optimal on the basis of different statistics. For this reason, a multi-objective programming technique, compromise programming, is incorporated into the system to combine all three objectives in order to select the most representative S-D model. In general, the IDSS assists the user in reaching the problem solution by: (i) applying human knowledge, expertise, and experience;

K. D. Gawne & S. P. Simonovic

490 (ii)

providing a structured approach, maintaining consistency, and minimizing possible human error; and (iii) efficiently managing and displaying all information relevant to the specific problem. As with the whole IDSS, the ASM is developed within a complex computer environment. Expert System (ES) software provides processing control and knowledge representation in the ASM. Storage and retrieval of data are handled by a relational data base management system accessed using Structured Query Language (SQL). XGRAPH, is also used for frequent presentation of various graphs created by the system. To create a user friendly environment, an on-line help file manager is run in parallel with the main system. The main contribution of the research results presented in the paper is in the development of an integrated methodology for modelling, testing, and selecting the most appropriate mathematical model of a S-D relationship in steady state conditions. As such, the methodology developed: • provides a rigorous method of stage-discharge modelling; • will enhance a quality control of discharge data produced by Environment Canada; and • provides a powerful training tool that can be used to illustrate a broad range of conditions and the consequences of those conditions on the accuracy of discharge estimates. The next section of the paper presents the theoretical background of the processes included in the ASM. The following section is used to describe the architecture and functional characteristics of the ASM. The paper ends with the presentation of one case study in order to illustrate the use of the ASM. The S-D relationship for the Red Deer River near the Bindloss station in Alberta, Canada, is developed using the ASM.

THEORETICAL BACKGROUND Linear regression A simple linear regression is the modelling technique currently used in the ASM. However, the framework of the IDSS has been designed to accommodate other methods without making major alterations to the overall system (Douglas, 1992; Douglas et al., 1994; DeGagne et al., 1994). The main assumption used in the development of the IDSS is that the S-D relationship can be linearized for certain ranges of stage and/or discharge values. In the case when this is not true (e.g. where there is a shift of control as discharge increases, or where the stream cross-section changes radically) a set of linearized relationships in the form of multiple S-D curves or a segmented S-D curve is suggested (DeGagne etal., 1994). The ASM modelling technique produces a linear function that best fits

Modelling stage-discharge relationships

491

the relationship between the independent data (x) and the dependent data (y). The form of the simple linear regression model is as follows: y = bn+b,x

(1)

where ym is the model prediction of the dependent variable at x; b0 is the regression constant; and bx is the regression coefficient. The method of least squares (Walpole & Myers, 1989) is the basis of the linear regression. A linear regression analysis requires a linear relationship between the data being modelled. Typically, stream stage-discharge data traces a concave down shape, i.e. a non-linear relationship. Thus, two types of data transformations are used in attempt to create a data set that is linearly related: (a) a power transformation; and (b) the Box-Cox transformation. The power transformation is of the form: St = S'pow

(2)

where S is stage; 5t is transformed stage; and tpow is the transformation exponent. The regression analysis is carried out and the model format is as follows: D

= b(l + b.Sf

(3)

where Dm is the model predicted discharge value. Numerous stage-discharge models are created by stepping through a range of data transformations and performing a linear regression analysis on each data set. Past hydrological data has suggested a range of exponents from 1.50 to 3.50 to be sufficient. Chester (1986) suggests a particular transformation, an exponent power of 2.5, by noting an assumption of the channel geometry. Power transformations of the stage data are parametrically evaluated between 1.50 and 3.50 in increments of 0.05, resulting in forty-one rating curves. These models are then forwarded to the next level of the system, viz. a residual analysis of the linear regression models. The Box-Cox transformation was included in the system with the intent of obtaining a more uniform variance of residuals (Neter et al., 1990), an assumption of linear regression. The general idea of the transformation is to restrict attention to transformations indexed by a parameter X. Unlike the power transformation where the stage (independent) data are transformed, the discharge (dependent) data are subject to the transformation. The transformation is as follows: Dt = ln(D), D, =

Z ) M

X = 0.00

(4)

, X < > 0.00 (5) X where D is measured discharge; Dt is transformed discharge; and X is the BoxCox parameter. As with the power transformation, the X values are varied parametrically. A typical range of X values is from - 2 . 0 0 to 2.00. However,

K. D. Gawne & S. P. Simonovic

492

recent implementation of the transformation for S-D data has suggested a smaller range of X values from 0.00 to 1.00. Thus, by using the same increment (0.05) as applied to the power transformation, twenty-one Box-Cox transformed linear regression models are created. The model for Box-Cox transformed data is of the form: D>

(6)

= vv is the model predicted transformed discharge value. It is important

where Dm to note that the discharge values predicted by models created from Box-Cox transformed data are in the transformed format. Therefore, the model predictions must be transformed back to obtain discharge in its regular format.

Residual analysis A linear regression model must comply with three requirements regarding its residuals. The residuals of a linear regression are simply the difference between the predicted dependent and the measured values, Dm and D respectively, in the case of the rating models. Uniform variance and independence of residuals are the first two requirements of a linear regression. The magnitude and sign of the residual should not be a function of the independent data (stage) nor should the residual be a function of other residuals. That is, an equal number of positive and negative residuals over the range of stage (for the Box-Cox transformation) or transformed stage (for the power transform case) is expected. The third requirement of a linear regression model is that the residuals follow a normal probability distribution. Thus, the expectation of the residuals about the transformed discharge measurement is to be normally distributed. The skewness and kurtosis coefficients, the third and fourth moments of the residuals, provide an indication of a model's deviation from a normal distribution. The skewness coefficient is calculated as:

(7)

±LW where er is the residual; and \xe is the mean residual. The distribution of g, is N(0, 6/N), the first term being the expected mean, the second the standard deviation and iVthe number of data points. The kurtosis coefficient is calculated as: (8)

^>'~^

) 2

Modelling stage-discharge relationships

493

where g2 has the distribution iV(0, 24/./V), with a mean of zero, and the standard deviation of kurtosis being 24/N. For cases with a relatively small number of measurements, use of unbiased forms of equations (7) and (8) is advised.

Multi-objective comparison of mathematical models To determine the most representative model, three statistics are calculated for each linear regression model and used in the multi-objective analysis. The three statistics used are the SEE, the r2 value, and the number of outliers. SEE is a statistic based of the sum of the squares of the residuals and is given as: SSE n-2

SEE =

1/2

(9)

where n is the number of S-D measurements and: SSE = KD.-DJ2

(10)

1=1

A minimum SEE value is desirable, as it represents the magnitude of error inherent in the model. The second statistic used is r2, also termed the sample coefficient of determination. The r2 value expresses the proportion of the total variation in the predicted discharge value that can be accounted for or explained by a linear relationship with the values of the random stage values (Walpole & Myers, 1989). As an example, if r2 = 0.89, then 89% of the total variation of the discharge values is described by a linear relationship with stage values. Therefore, the model yielding the r2 value nearest unity would be considered optimal based on this statistic. The statistic is a function of both the regression sum of squares (SSR) and the sum of squares of the residuals (SSE). It is given as: r2 =

SSR

(11)

SEE + SSR where: SSR = l(Dj-Di)2

(12)

Dp is the mean discharge or mean transformed discharge depending on the method of data transformation used. The final statistic calculated in the system is the number of outliers, which is defined as the number of model predictions that deviate from the measured discharge by a percentage greater than an allowable error. A prediction is considered an outlier if the following condition is true: D-D i

mi

D.

xlOO > e

(13>

K. D. Gawne & S. P. Simonovic

494

where (D, - Dm) is the z'th residual and ea is the allowable error. Currently, the value of the allowable error is subjective. The model resulting in the least amount of total outliers is considered optimal with respect to this statistic. To find a model simultaneously yielding the minimum SEE, the maximum r2, and the minimum number of outliers from the set of all models generated is very unlikely (if not impossible). Therefore, a multi-objective technique named Compromise Programming has been used to find the best compromise between the three criteria and so define the most appropriate model. Compromise Programming is a method appropriately used in a multiple linear objective context (Zeleny, 1974). However, a variation of the method has also been used in the analysis of discrete objective problems similar to the one of ranking S-D models. The method of Compromise Programming identifies solutions which are closest to the ideal solution. They are called compromise solutions and constitute the compromise set. For a more explicit understanding of what is meant by a compromise solution, an ideal solution must be defined and a particular distance measure specified. The ideal solution is defined as the vector z* = (z,*, z2*, •••, zp*) where each z* is the solution of the following problem: max z,(x) subject to: x G X i = 1,2,

...,p

where X is a feasible region; x a set of decision variables; z an objective function (criterion); andp is the number of objective functions. If there were a feasible solution vector, x*, common to all p problems, then this solution would be the optimal one since the non-dominated set (in objective space) would consist of only one point, namely: z*(x*) = (z,*(x*),z2*(x*), ..., Zp*(x*)) Obviously, this is most unlikely, and the ideal solution is generally not feasible. However, it can serve as a standard for evaluation of the attainable solutions. Since all would prefer the ideal point if it were attainable (as long as the individual underlying utility functions are increasing), then it can be argued that finding solutions that are as close as possible to the ideal solution is a reasonable surrogate for maximization. The procedure for evaluation of the set of non-dominated points is to measure how close these points come to the ideal solution. One of the most frequently used measures of closeness, and the one used in this work, is a family of metrics, Ls, defined as:

Modelling stage-discharge relationships -\ î/î

L =

495 (14)

2X(z,-* ~zt(x)y 1=1

where 1 < s < oo and {a,, a2, ..,, ap} are a given set of weights used to emphasize the importance of different objectives. Finally, a compromise solution with respect to s is defined as x* such that: min L,(x) ---

p

2>-

1=1

z,-* -z,{x)

£,(*, * )

(15)

Z- * ~Z- * *

ject to: x E X Elements of the vector x in this context are the different S-D models. Statistics, introduced earlier and calculated for each S-D model, are the values of the objective function z*(x). The vector of worst values defines the minimum objective function values, z** (the largest SEE, the smallest r2 value, the largest number of outliers). Following recommendations from the literature on Compromise Programming (Zeleny, 1974) the parameter s is set to 2. By assigning weights to different objectives, the algorithm is able to rank the alternative solutions according to their distance from the ideal point.

DESCRIPTION OF COMPUTER BASED SYSTEM The IDSS approach has been implemented to assist Environment Canada in achieving representative stage-discharge relationships. Data management systems constituted the early form of decision support. Recently, decision support has been further refined by integrating human and computer abilities to operate in problem solving situations of less than structured frameworks. Thierauf (1988) emphasizes that an operational decision support system must: (a) concentrate on problem finding and problem solving; (b) incorporate interactive processing; and (c) use a comprehensive systems approach to problem solving. Simonovic (1992) explains three approaches to IDSS design. They are: (i) afunctional approach which is decision process oriented; (ii) a tool-based approach which is tool oriented; and (iii) a combined approach which incorporates components of the previous two approaches. The ASM has been developed using the combined approach. Within the ASM, classical mathematical modelling has been integrated with database management, the graphical presentation of results and human knowledge and experience through the use of expert system (ES) technology.

496

K. D. Gawne & S. P. Simonovic

The addition of ES technology to decision support broadens the problem solving ability of the system. ES application to the field of operational hydrology has been studied by Simonovic (1990, 1991). The technology is applicable to the S-D analysis, with the ability to make numerous inferences from user actions or preceding system conclusions. It is used to govern the overall ASM system. It is organized in a number of knowledge bases, each of them constituting a distinct task in S-D analysis. The required input data as well as developed S-D models are stored and manipulated using the Relational Database Management System (RDBMS). The database is accessed using the standard Structured Query Language (SQL). Classical programming is extensively used in the S-D modelling, statistical and multi-objective analysis, the focus of this paper. Various programs are executed by the ES, producing data files either retrieved by the ES or other programs. The S-D curve development requires the display of several different types of plots. To facilitate this task, a graphing utility, XGRAPH, is included in the system. Finally, the on-line help utility, XMAN, is run in parallel with the ASM to provide a user-friendly environment. Help files have been developed to provide the user with a description of the various activities of the ASM system as well as information to aid the user in making the required decisions during execution of the system.

Architecture of the ASM Three basic processes involved in the ASM component of the IDSS and the processes linking the ASM to the overall IDSS are shown in Fig. 1. The Figure also shows the tools used for each particular process. Following station and data selection, the ASM is initiated by the user selecting "Automated Statistical Modelling" as the modelling procedure. Figure 2 displays the modelling procedure selection stage of curve development. The three main components of the ASM are discussed in the following sections. Linear regression analysis The ASM contains a FORTRAN program that provides the base information for the entire modelling process. Initially, the program reads the data file created by the ES of S-D measurements retrieved from the database. The program creates a large number of rating models by varying the transformation parameters and then performing a linear regression analysis on each data set, as described earlier. The same program includes a subroutine which calculates the three performance statistics discussed earlier, viz. SEE, r2, and number of outliers. The ES displays a window holding a created data file, containing all the linear regression models, their respective transformations and statistics. The multiple regression procedure is accompanied by a number of text

Modelling stage-discharge relationships

497

Modelling Procedure Selection

ASM

Multiple Linear Regression Analysis

Residual Analysis

Ranking of Rating Models

Model Management Procedure N - Expert System F Classical Programming G - Graphing Utility H - Help Utility

Fig. 1 Architecture of the ASM. windows (called by the ES) as well as various XMAN help files which provide more detailed information pertaining to the current step of the modelling procedure. Figure 3 displays a portion of the XMAN file on the data transformations. Statistical testing The second process of the ASM is statistical testing for the requirements of a linear regression. Three levels of testing make up this process, viz. skewness and kurtosis, residual plot analysis, and probability plot analysis. The ASM system contains the routine "SKEW-KURT" that calculates skewness and kurtosis coefficients for each linear regression model generated. The coefficients provide an indication of a model's deviation from a normal distribution. Three standard deviations have been chosen as acceptable bounds about zero for the skewness and kurtosis coefficients. Thus, the model in question is rejected if the skewness coefficient is greater than 13(6/N)'/21 or the kurtosis value exceeds 13(24/JV)1/! |. All models passing the skewness and

K. D. Gawne & S. P. Simonovic

498

' 1}

Xman

Ffartunl Browser

'' zi

) )

Hanual POEB

Kgr«ph

j c l o ^ e [iHardcopyJ! Abouti STAGE,

( Help } Ç Quit (

O)

4.00

STATION RED DEER RIVER NEAR

|

3.50

%-0_«srs

-

j--



[ • •

j

3.00

--

DATA RETRIEVAL SUMMARY

2 . 5 0

eval

Type

\ ,^

••••



= TIME 89 2.00 on has not been SIZE Or CURRENT

established. DAT ft SET =

80

-.#

1.S0

|

1.00

Û.S0

j-

0(1



2Ui!. 00





»•—

Fig. 2 Modelling procedure selection screen (Red Deer River near Bindloss, Alberta). kurtosis tests are passed to the residual plot analysis for further testing, those yielding unacceptable values being discarded. The ES supplies the user with a test status sheet providing the number of S-D models created, the number of models failing the first testing procedure, and the number of models considered normal, based on the skewness and kurtosis tests. The testing procedure is completely automated and is designed to select a maximum of ten models for the next level of testing. Thus, if more than ten models produce acceptable coefficients, the program will select the models yielding the ten best values of the skewness coefficient of that acceptable set. This procedure efficiently minimizes the amount of user involvement in the following statistical tests where the user must analyse plots for each model passing the skewness and kurtosis tests. Residual plot analysis The second testing process consists of the user analysing the residual plots for each S-D model passing the skewness and kurtosis tests. Through an analysis of the residuals the user can detect violations of all the assumptions of a linear regression, viz. normality, independence, and uniform variance of residuals. The residual plot, the

COTS)

Modelling stage-discharge relationships Options Sections |

TRANSttv)

499

The current nanual page is; 4_transfornations

Suggest Documents