development of stage-discharge rating curve in ... - Semantic Scholar

2 downloads 20 Views 870KB Size Report
ABSTRACT. Discharge measurement in rivers is a challenging job for hydraulic engineers. A graph of stage versus discharge or the line through the data points ...
International Workshop ADVANCES IN STATISTICAL HYDROLOGY May 23-25, 2010 Taormina, Italy

DEVELOPMENT OF STAGE-DISCHARGE RATING CURVE IN RIVER USING GENETIC ALGORITHMS AND MODEL TREE by

Bhola N.S. Ghimire(1) and M. Janga Reddy(2) (1) (2)

Research Scholar ([email protected])

Assistant Professor ([email protected])

Department of Civil Engineering, Indian Institute of Technology, Bombay, India

ABSTRACT Discharge measurement in rivers is a challenging job for hydraulic engineers. A graph of stage versus discharge or the line through the data points represents the stage-discharge relationship, also known as rating curve. The stage-discharge relationship is an approximate method employed for estimating discharge in rivers, streams etc. For various hydrological applications such as water and sediment budget analysis, operation and control of water resources projects, the accurate information about flow value in rivers is very important. Stages are easy to measure as compared to the measurement of discharge in rivers. The stage-discharge relationship at a particular river cross-section, even under conditions of meticulous observation, it is not necessary unique as rivers are often influenced by several other factors which are neither always understood, nor easy to quantify. This is due to the fact that in reality, discharge is not a function of stage alone. Discharge also depends upon longitudinal slope of river, geometry of channel, bed roughness etc. However, the measurement of these parameters at even and every time step and section is not possible. Hence there is a need to establish the accurate relationship between stage and discharge. The conventional parametric regression methods usually fail to model these relationships. This paper presents the use of genetic algorithms (GA), a search procedure based on the mechanics of natural selection and natural genetics, and Model Tree (M5), a data driven technique for dealing with continuous class problems, that provides structural representation of the data and piecewise linear fit of the classes, for river hydrology to establish the stage-discharge relationship. The results obtained are compared with the other methods such as gene-expression programming (GEP), multiple linear regressions (MLR) and classical stage-discharge rating curve (RC). To measure the performance of models, statistical measures such as coefficient of determination and root mean square error are used. The results obtained from the GA based model as well as MT based model are found to be much better than the other methods. Keywords: Genetic algorithms, Model tree, Gene-expression programming, Multiple linear regression, rating curve.

1

INTRODUCTION

Hydraulic Engineers needs the discharge measurement in rivers for various purposes. It is one of the challenging jobs for them. Discharge is solely depends upon the nature of rainfall in the catchment areas which is purely stochastic. Due to stochastic nature of discharge, stage varies accordingly. A graph of stage versus discharge and the line through the data points represents the stage-discharge relationship habitually called as rating curve. The rating curve is a fundamental technique employed in discharge calculation. For various hydrological applications such as water resources planning, reservoir operation, sediment handling as well as hydrologic modelling, the accurate information about discharge and stage are very important. Stages are measurable at any time but it needs sufficient preparation to measure the discharge which may not be handy. Hence, to predict the discharge from measured stage, there should be specified relation with them. The stage-discharge relationship at a particular river cross-section, even under conditions of meticulous observation, is not necessary unique as rivers are often influenced by factors neither always understood nor easy to quantify (Sefe, 1996). This is due to the fact that in reality, discharge is not a function of stage alone. Discharge also depends up on longitudinal slope of river, geometry of channel, bed roughness etc. However, the measurement of these parameters in every time steps and sections is not reliable. So it is in the practice that usually discharge is forced to show the dependency with stage. Hence it is clear that there need to establish the accurate relationship between discharge and stage. The conventional Ghimire and Reddy, Development of Stage-Discharge RC in River using GA and MT

1

International Workshop ADVANCES IN STATISTICAL HYDROLOGY May 23-25, 2010 Taormina, Italy

parametric regression methods usually fail to model these relationships (Habib and Meselhe, 2006). They have specified the two distinct approaches for stage-discharge modelling techniques- numerical solutions and data driven technique. They developed stage-discharge relationship for coastal low-gradient streams using neural networks and nonparametric regression as a second approach. The first approach uses for the data from accurate boundary condition sites. Tawfik et al. (1997) introduced an approach based on multilayer artificial neural network (ANN) for modelling stage-discharge relationship. Same approached was followed by Jain and Chalisgaonker (2000), Sudheer and Jain (2003) and Bhattacharya and Solomatine (2005). Bhattacharya Solomatine (2005) used model tree M5 in addition to ANN to show the relation between stage and discharge in rivers. PetersonOverleir (2006) introduced a methodology based on the Jones formula and nonlinear regression as a solution to situations where stage-discharge relationship is affected by hysteresis due to unsteady flow. Tyafur and Singh (2006) used ANN and fuzzy logic tool to model the rainfall-runoff laboratory data. The relationships for estimating the two coefficients of the stage-discharge equations were obtained and presented after some experimental runs carried out by using flumes characterised by different values of the contraction ratio (ranging from 0.17 to 0.81) and of the flume slope ( ranging from 0.5 to 3.5%) (Baiamonte and Ferro, 2007). Using compound neural network, Jain (2008) developed an integrated relationship between stage-dischargesuspended sediment. Soft-computing technique like ANN is sufficiently used in water resource engineering whereas GP and GA is used only by few researchers. Researchers (Savic et al., 1999; Babovic and Keijzer, 2002) have developed GP model to define the relation between rainfall and runoff in separate places. Dorado et al.(2003), applied GP and ANN in hydrology for runoff prediction using rainfall in urban areas. Giustolisi (2004) used GP to determine the Chezy resistance coefficient for full circular corrugated channels. Cheng et al. (2005) used GA used for calibration of rainfall run-off model developed from fuzzy methods. Rabunal et al.(2007) used GP and ANN to derive the unit hydrograph for a typical urban basin. Kumar and Reddy (2007) used GA for optimization of multipurpose reservoir operation. Sivapragasan et al.(2008) demonstrated the storagedischarge relationship adopted for the non-linear Muskingum model using an evolutionary algorithm-based modelling approach as GP. While compared the results with particle swarm optimization technique, they found same optimum values from both techniques. Recently, Aytek and Kisi (2008) used GEP for suspended sediment modelling and Guven and Aytek (2009) used GEP for stage-discharge modelling in American rivers. Similarly, another data driven tool, Model tree (MT) have been used by few researchers in hydrology. MT gives better accuracy over ANN in the field of water management problems, rainfall-runoff modelling, canal sedimentation etc. (Solomatine, 2002; Solomatine and Dulal, 2003; Bhattacharya et al., 2005). Reddy and Ghimire (2009) used model tree successfully on the field of Suspended Sediment Load (SSL) estimation in American rivers. The objective of this article is to support the use of soft computing technique, GA and MT in the field of Water resource engineering especially to show the strong relation between stage and discharge. The model results are compared with the results obtained from conventional methods like stage rating curve (SRC) and multi-linear regression (MLR) as well as the result predicted from GEP model.

2 2.1

MODELLING TECHNIQUES Genetic Algorithms (GAs)

Genetic Algorithms (GAs) are a particular class of evolutionary algorithms that use techniques inspired by evolutionary biology to solve a problem. In other words, GAs are one of the population-based search techniques, which works on the concept of “Darwin’s principle: survival of the fittest” (Goldberg, 1989). The idea in all these evolutionary algorithms is to evolve a population of candidate solutions to a given problem, using operators inspired by natural genetic variation and natural selection such as inheritance, mutation, selection, and crossover. Genetic algorithms (GAs) were invented by John Holland in the 1960s and were developed himself and his students and colleagues at the University of Michigan (Goldberg, 1989). According to their principle, GA is a method for moving from one population of "chromosomes" (e.g., strings of ones and zeros, called "bits") to a new population by using a kind of "natural selection" together with the genetics inspired operators of Ghimire and Reddy, Development of Stage-Discharge RC in River using GA and MT

2

International Workshop ADVANCES IN STATISTICAL HYDROLOGY May 23-25, 2010 Taormina, Italy

crossover, mutation, and inversion. Each chromosome consists of "genes" (e.g., bits), each gene being an instance of a particular "allele" (e.g., 0 or 1). The selection operator chooses those chromosomes in the population that will be allowed to reproduce, and on average the fitter chromosomes produce more offspring than the less fit ones. Crossover exchanges subparts of two chromosomes, roughly mimicking biological recombination between two single chromosome organisms; mutation randomly changes the allele values of some locations in the chromosome; and inversion reverses the order of a contiguous section of the chromosome, thus rearranging the order in which genes are arrayed. The in-depth details about GA can be found in (Goldberg, 1989). 2.1.1 Elements of GA. In GA, search starts with an initial set of random solutions known as population. Each chromosome of population is evaluated using some measure of fitness function which represents a measure of the success of the chromosome. Based on the value of the fitness functions, a set of chromosomes is selected for breeding. In order to simulate a new generation, genetic operators such as crossover and mutation are applied. According to the fitness value, parents and offspring are selected, while rejecting some of them so as to keep the population size constant for new generation. The cycle of evaluation–selection–reproduction is continued until an optimal or a near-optimal solution is found. The fundamental procedural algorithms steps are shown in Figure 1. Initial Population Generation Next generation

Evaluates fitness of all individuals in population Crossover And mutation

Termination Criteria met?

No

Select individual For next generation

Yes Stop the search

Figure 1 – Schematic diagram of genetic algorithms (Tung et al., 2006)

Selection. Selection attempts to apply pressure upon the population in a manner similar to that of natural selection found in biological systems. Before making it into the next generation’s population, selected chromosomes may undergo crossover or mutation (depending upon the probability of crossover and mutation) in which case the offspring chromosome(s) are actually the ones that make it into the next generation’s population. Poorer performing individuals (evaluated by a fitness function) are weeded out and better performing, or fitter, individuals have a greater than average chance of promoting the information they contain to the next generation. Out of several selection methods, tournament selection is applied in this study. In tournament selection, operator which uses roulette selection N times to produce a tournament subset of chromosomes. The best chromosome in this subset is then chosen as the selected chromosome. Crossover. Crossover allows solutions to exchange information in a way similar to that used by a natural organism undergoing reproduction. In other words, crossover is a genetic operator that combines (mates) two chromosomes (parents) to produce a new chromosome (offspring). This operator randomly chooses a locus and exchanges the subsequences before and after that locus between two chromosomes to create two offspring. The idea behind crossover is that the new chromosome may be better than both of the parents if it takes the best characteristics from each of the parents. Crossover occurs during evolution according to a user-definable crossover probability. For examples, if two parents (chromosomes) A and B having four Ghimire and Reddy, Development of Stage-Discharge RC in River using GA and MT

3

International Workshop ADVANCES IN STATISTICAL HYDROLOGY May 23-25, 2010 Taormina, Italy

genes in each, formed two children (offspring) by exchanging gene at the end of second gene (Figure 2), then it is said to be single point crossover whereas if it exchanges two points, than it said to be two point crossover. In this study two point cross over is considered.

CrossOver Point

Figure 2 – Single point cross-over operator

Mutation. Mutation is used to randomly change (flip) the value of single bits within individual strings to keep the diversity of a population and help a genetic algorithm to get out of a local optimum. It is typically used sparingly. For example in Figure 3 parent became new child by mutated gene number two.

Figure 3 – Mutation operator 2.1.2 Fitness function used in GA. To carryout the better estimation of parameters, there are many fitness functions can be used in GA. For this study, least root mean square error function was taken. The fitness function is given in Equation (1), where Qoi and Qpi are observed values in the field and predicted values from developed GA model respectively. Where n is the total no of observations and F is the function gives error.

 n  Min F = Sqrt. ∑ (Qoi − Q pi )2 / n   i =1  2.2

(1)

Model Tree (MT)

Model tree is a data driven technique for dealing with continuous class problems, that provides structural representation of the data and piecewise linear fit of the classes. Model tree is a kind of decision tree, which has the capability to predict the numeric values with linear regression function at the leaves. Model tree classifies the data according to their similarity and then fits local regression equations thereby helps to minimize the error in the model. Quinlan (1992) and Wang and Witten (1997) explained these popular techniques. The flow chart of Model Tree M5 (Reddy and Ghimire, 2009) showing fundamental steps is follows to carryout the processing the data for this study. Initially it splits the parameter space into sub-spaces. Then it builds linear regression model to each sub-spaces. It uses the information theory in splitting the data and helps to fit on appropriate model. During model formulation each splitting section follows the idea of decision tree integration of several models. Finally it uses computational intelligence techniques for possible solutions to each model. The major advantages of model trees over regression trees are: (a) model trees are much smaller than regression trees, (b) the decision strength is clear and (c) regression functions normally do not involve many variables. Computational requirements for model trees grow rapidly with dimensionality. Hundreds of attributes involve in the tasks of computing which helps to give better formulation. Tree based models will be developed by a divide and defeat method. The standard deviation reduction (SDR) is the main criteria for model selection which is given by Equation (2).

SDR = sd (T ) − ∑ i

| Ti | sd (Ti ) |T |

Ghimire and Reddy, Development of Stage-Discharge RC in River using GA and MT

(2)

4

International Workshop ADVANCES IN STATISTICAL HYDROLOGY May 23-25, 2010 Taormina, Italy

Where, T represents set of examples that reaches the node; Ti represents the subset of examples that have the ith outcome of the potential set (i.e. the sets that result from splitting the node according to the chosen attribute); and sd(.) represents the standard deviation.

Pruning and smoothing. If the generated trees have more than sufficient leaves, the prediction may be ‘too accurate’ and over fits the existing data which makes a poor generalization. It is possible to make tree healthier by simplifying it. This merging process of the lower sub-trees into one node is called pruning. The process used to compensate for the sharp discontinuities that will occur between adjacent linear models at the leaves of the pruned trees is called smoothing. Hence the smoothing is difficult for constructed models from a small number of training samples.

Advantages of Model trees. Model trees constitute actually a set of local linear models. They may serve as an alternative to ANNs, are often almost as accurate as ANNs (Solomatine, 2002). It have following advantages: (a) MT trains much faster than ANN, (b) The results given by Model tree are transparent and can be easily understood by decision makers, and (c) Sing pruning it is possible to easily generate a range of MTs as a simple linear regression to a much more accurate but complex combination of local models (many branches and leaves).

2.3

Multiple linear regression (MLR)

Many engineering and scientific problems are concerned with determining a relationship between a set of variables. Usually, a single response variable Y (the dependent variable) as a function of a set of independent variables x1, x2, x3……. xn. It can be written as-

Y = a1 x1 + a2 x2 + a3 x3 + ......... + an xn + a0

(3)

Where coefficient ‘ai’ is the regression coefficient for ith independent variable (xi) computed by using least square methods. When n=1, Equation (3) become a linear regression equation form. Similarly, while n=2, the function corresponds to a plane in three dimensions and the values of n greater than 2, the function is a hyper plane of n+1 dimensional plane. If Yi is the observed dependent variable and Ypi is the predicted value of dependent variable using Equation (3), then the sum of least square error e yi2 is given by Equation (4). N

∑e i =1

2.4

N

2 yi

= ∑ (Yi − Ypi ) 2

(4)

i =1

Stage-Discharge Rating Curve (RC)

A stage-discharge rating curve (simply: rating curve, RC) is describes a relationship between the water level (stage) a channel cross section with the rate of discharge at that section. Ideally, a rating curve describes a unique functional relationship between stage and discharge; therefore, it is obtained as a smooth and continuous curve with reasonable degree of sensitivity. Unfortunately there cannot be a unique stagedischarge relationship unless the flow is uniform. And due to stochastic nature of rainfall, river flow also not uniform. Hence ideal relation to show between stage and discharge is not truth and it is only for approximation (Henderson, 1966). The sufficient number of measured value of discharges when plotted against the corresponding stages gives relationship that represents the integrated effect of a wide range of channel and flow parameters. The control (combined effect of these parameters) is usually categorized as permanent and shifting. In shifting control, the parameters are not fixed and it changes with time. In the permanent control the parameters are constants (Subramanya, 2006). A majority of streams and rivers, especially non-alluvial rivers exhibit permanent control. For this permanent control case, the relationship between the stage and the discharge is a single-valued relation which is expressed as in Equation (5), which is the equation of parabola where Q = discharge in m3/s, G = gauge height (stage) in m, a = a constant which represent the gauge reading corresponding to zero discharge, β and C are rating curve constants.

Q = C (G − a ) β

Ghimire and Reddy, Development of Stage-Discharge RC in River using GA and MT

(5)

5

International Workshop ADVANCES IN STATISTICAL HYDROLOGY May 23-25, 2010 Taormina, Italy

Traditionally, the best values of a, β and C in Equation (5) for a given range of stage are obtained by the least square error method. For this, by taking logarithms of Equation (5), we can get the Equation (6).

log Q = β log (G − a ) + log C

(6)

or

Y = β X +c'

(7)

Equation (6) is the form of the equation of straight line equivalent to that of Equation (7). Where, the dependent variable Y = log Q, independent variable X=log (G-a) and c’ = log C. To get the best fit straight line of n observations of both independent and dependent variables (X and Y), normally regression have to be done for independent variable on dependent variable. Depending upon the nature of data, often two or more straight lines may be required to fit the given data. While analyzing the data primarily, it can be possible to find out the approximate position of the break points for each range of data. The actual break points may be determined by solving the two equations for Q and G or graphical ways. Sometimes the curve changes from a parabolic to a complex curve and vice versa, and sometime the constants and exponents vary through the range (Guven and Aytek, 2009). So it is not easy to find out the values of parameters (a, β and C) for each case and some times it may completely impossible to get the true values. Considering this tedious situation, this study is mainly focused to optimize the parameters (a, β and C) involved in this Equation (5) using GA as well as developed the piece wise linear equations using MT. The methodology applied for case studies gave sufficiently good results and it is believed that, the developed methodology will solve the many practical problems related to stage-discharge relations.

3 3.1

CASE STUDIES Stage – Discharge Data

For the application demonstration of GA and MT, the time series daily data set containing stage and discharge from two stations in Schuylkill River at Berne (Station no: 01470500, Lat. 40º31'21'' and Long. 75º59'55'') and Philadelphia (Station no: 01474500, Lat. 39º58'04'' and Long. 75º11'20''), USA are taken. The catchments area of Berne station is about 919.45 km2 and that of Philadelphia station is 4902.85 km2. This information was obtained from (USGS website). The data from the period October 01, 2000 to September 30, 2006 were taken for both of the stations. Initial five years data were taken for training purpose and last one year data (October 01, 2005 to September 30, 2006) were used for testing purpose for both the stations. Some of the statistical parameters for these sites are shown in Table I for training and testing sets. The parameters µ, σ, σ/µ, Csx, Xmax, Xmin are mean, standard deviation, variance, skew-ness, maximum and minimum values respectively. The discharge limits of Berne station are 2.125 to 972.014 m3/s and that of Philadelphia station are 2.239 to 1484.943 m3/s. Similarly, the corresponding stages of these discharges are 1.384, 5.088, 1.686 and 3.463 m respectively. The developed models are valid for those specified ranges. Table I – The daily statistical parameters for training and testing data set for two stations at Schuylkill River Basin AreaData (Km2) Type

Data Set

Station

µ

Training

Berne 919.45 01470500 Philadelphia 4902.85 01474500

Stage* 1.65 Flow* 21.95 Stage 1.96 Flow 97.15

Testing

Berne 919.45 01470500 Philadelphia 4902.85 01474500

Stage Flow Stage Flow

σ

σ/µ

Csx

Xmax

0.22 26.72 0.18 111.18

0.13 1.22 0.09 1.14

2.25 5.23 2.04 3.88

3.418 1.384 399.574 2.125 3.338 1.686 1312.07 2.239

1.66 0.32 24.32 61.88 1.98 0.20 109.42 147.86

0.19 5.09 5.088 2.54 11.46 972.014 0.10 3.39 3.463 1.35 5.63 1484.943

Xmin

1.396 2.522 1.774 17.258

*The units of stage and flow are (m) and (m3/s) respectively.

Ghimire and Reddy, Development of Stage-Discharge RC in River using GA and MT

6

International Workshop ADVANCES IN STATISTICAL HYDROLOGY May 23-25, 2010 Taormina, Italy

3.2

Development of Models based on conventional methods

Stage-Discharge Rating curve (RC) and Multiple Linear Regression (MLR) are considered for conventional methods. The RC also developed into two forms: One in a simple power equation form (RC-1: without considering the stage height corresponding to the zero discharge) and other little complex than the former (RC-2: considering the stage value corresponding to the zero discharge). The developed models for these methods (RC and MLR) are shown in following Equations (8) to (10) for Berne station and Equations (11) to (13) for Philadelphia station. During development of complex rating curve (RC-2), the stage corresponding to zero discharge are fixing with the help of scatter plot diagrams for training periods. The reference stages data taken for the Berne station to fix the stage corresponding to zero discharge are 1.418, 1.628 and 2.064. Similarly, reference stages 1.765, 1.945 and 2.396 were taken to fix the stage corresponding to zero discharge for the Philadelphia station. The values adopted for stages corresponding to zero discharges for the stations Berne and Philadelphia are 1.223 and 1.645 m respectively. During development of MLR models, single independent variable was used for comparing the performance with other models, so it became simple linear models as shown in Equations (10) and (13).

Q = 0.441 H 7.036

(8)

Q = 93.951( H − 1.223)1.9885

(9)

Q = 116.069 H − 170.356

(10)

Q = 0.055 H 10.512

(11)

Q = 670.039( H − 1.645)1.841

(12)

Q = 602.645 H − 1084.43

(13)

In Equations (8) to (13), Q is discharge in m3/s and H is stage height in m taken above from the reference datum.

3.3

Development of Models based on Genetic Algorithms (GAs)

The parameters (a, β and C), involved in basic Equation (5) are optimized with GA. Initially, the “training set” is selected from the whole data and parameters are found. Finally, the relation is used to predict the discharge values in “testing set”. The predicted values are compared with the measured values with the help of statistical performance measure tools such as coefficient of determination and root mean square error. 600 600

40 35 30 25 20 15 10 5 0

500 500

Fitness Fitness

400 400 300 300 200

200

40 35 30 25 20 15 10 5 0 10 10 12 12 14 14 16 16 18 18 20 20

100 100

0 0

0

2

4

6

8

0

2

4

6

8

10

12

14

16

18

20

10

12

14

16

18

20

Generations

Generations

Figure- 4. Fitness convergence of Philadelphia station

A function program has been written in Matlab environment and optimization is done. The population size is fixed as 200 with uniform creation function. Similarly, tournament selection option having size 4 with rank scaling is selected during program execution. Mutation function is used as adaptive feasible. Two point Ghimire and Reddy, Development of Stage-Discharge RC in River using GA and MT

7

International Workshop ADVANCES IN STATISTICAL HYDROLOGY May 23-25, 2010 Taormina, Italy

crossover and forward migration nature were set in the program. The program was run for five times and the parameters are recorded for the best fitness value in both cases for Berne and Philadelphia stations. The sample fitness for training sets for Philadelphia station is shown in Figures 4. Similar observations were found for Berne station. From Figures 4 it can be noticed that, the function value is reached minimum 19.63 m3/s at 14th generation in Philadelphia station. Similarly, for Berne station it found 3.67 m3/s at 14th generation. Values of the parameters for these fitness values are used for final relations between stage and discharge. The value of parameters (a, β and C) are: 1.262, 1.765 and 94.848 for Berne station and 1.695, 1.526 and 630 for Philadelphia station. The explicit formulations of GA models for the stations Berne and Philadelphia are given in Equations (14) and (15) respectively.

3.4

Q = 94.848 ( H − 1.262)1.765

(14)

Q = 630 ( H − 1.695)1.526

(15)

Development of Models based on Model Tree (MT)

MT models are formulated based on the fitness function given in Eqation (4). Minimum instances are taken as four during formulation. The training and testing sets are used same to that used in GA model formulation. The logic sets given by the programs are shown in Table II. This logic sets tested the time series data feeding to the computer and decides the value according to the fitness function. Table II - Model tree logic sets. Berne Station (01470500) Rules: If elseif elseif elseif else end

Ht

Suggest Documents