Evolving an Information Diffusion Model Using a Genetic Algorithm for ...

3 downloads 8902 Views 1MB Size Report
Dec 19, 2014 - Most of the missing data recovery methods, such as kriging interpolation, polynomial interpolation, optimal interpolation, Kalman filtering, the ...
2236

JOURNAL OF HYDROMETEOROLOGY

VOLUME 15

Evolving an Information Diffusion Model Using a Genetic Algorithm for Monthly River Discharge Time Series Interpolation and Forecasting CHENGZU BAI AND MEI HONG Research Center of Ocean Environment Numerical Simulation, Institute of Meteorology and Oceanography, People’s Liberation Army University of Science and Technology, and Key Laboratory of Surficial Geochemistry, Ministry of Education, Department of Hydrosciences, School of Earth Sciences and Engineering, State Key Laboratory of Pollution Control and Resource Reuse, Nanjing University, Nanjing, China

DONG WANG Key Laboratory of Surficial Geochemistry, Ministry of Education, Department of Hydrosciences, School of Earth Sciences and Engineering, State Key Laboratory of Pollution Control and Resource Reuse, Nanjing University, Nanjing, China

REN ZHANG Research Center of Ocean Environment Numerical Simulation, Institute of Meteorology and Oceanography, People’s Liberation Army University of Science and Technology, Nanjing, China

LONGXIA QIAN Research Center of Ocean Environment Numerical Simulation, Institute of Meteorology and Oceanography, People’s Liberation Army University of Science and Technology, Nanjing, China (Manuscript received 16 November 2013, in final form 17 July 2014) ABSTRACT The identification of the rainfall–runoff relationship is a significant precondition for surface–atmosphere process research and operational flood forecasting, especially in inadequately monitored basins. Based on an information diffusion model (IDM) improved by a genetic algorithm, a new algorithm (GIDM) is established for interpolating and forecasting monthly discharge time series; the input variables are the rainfall and runoff values observed during the previous time period. The genetic operators are carefully designed to avoid premature convergence and ‘‘local optima’’ problems while searching for the optimal window width (a parameter of the IDM). In combination with fuzzy inference, the effectiveness of the GIDM is validated using long-term observations. Conventional IDMs are also included for comparison. On the Yellow River or Yangtze River, twelve gauging stations are discussed, and the results show that the new method can simulate the observations more accurately than traditional IDMs, using only 50% or 33.33% of the total data for training. The low density of observations and the difficulties in information extraction are key problems for hydrometeorological research. Therefore, the GIDM may be a valuable tool for improving water management and providing the acceptable input data for hydrological models when available measurements are insufficient.

1. Introduction Scanty and missing data are insufficient to meet the needs of the hydrological modeling of the physical Corresponding author address: M. Hong, Research Center of Ocean Environment Numerical Simulation, Institute of Meteorology and Oceanography, People’s Liberation Army University of Science and Technology, 60 Shuanglong Road, Nanjing 211101, China. E-mail: [email protected] DOI: 10.1175/JHM-D-13-0184.1 Ó 2014 American Meteorological Society

process. In addition, the rainfall–runoff relationship is one of the most complex hydrologic phenomena to comprehend because of the tremendous spatial and temporal variability of watershed characteristics and precipitation patterns (Sedki et al. 2009). Therefore, how to fill the existing observational data gaps and establish an acceptable model for rainfall–runoff forecasting is a crucial precondition for hydrometeorological research and operational flood forecasting, especially in some undermonitored river basins.

DECEMBER 2014

BAI ET AL.

Tremendous efforts have been made over the last few decades to recover missing data and to improve hydrological predictions. Most of the missing data recovery methods, such as kriging interpolation, polynomial interpolation, optimal interpolation, Kalman filtering, the successive corrections method, fractal interpolation, and phase space reconstruction prediction have been widely applied to hydrologic and oceanographic interpolation. However, these methods may not achieve acceptable results when the known data are less than 60% of the total one (Wang et al. 2008). Regarding how to estimate the relationship between rainfall and runoff accurately, many models have been proposed and have obtained many good results. These models can be broadly divided into three groups: regression-based methods, physical models, and artificial neural network (ANN) methods. The first group, which includes autoregressive moving-average models, has been widely used for reservoir design and optimization (Carlson et al. 1970; Chen and Rao 2002; Komorník et al. 2006; Salas 1993). However, this method generally assumes that the observations obey normal distribution or needs to make an assumption of the equation in advance. Therefore, it is very difficult to obtain a reasonable result for a small sample without any information about the population shape. For the physical models (Sorooshian et al. 1993; Todini 1996; Whigham and Crapper 2001), equations of mathematical physics are developed into a popular approach to describe the relationships in physical systems. However, the parameters of these models need to be estimated by minimizing objective functions, which generally lead to groups of unrealistic parameters incorporating both data measurement errors and the errors present in the structure of the model itself; parameter observability conditions could not always be guaranteed either. In recent years, the ANN technique has been of particular interest in operational hydrology. It is capable of simulating a nonlinear system that is hard to describe using traditional physical modeling. ANNs and improved ANNs have been applied in many fields of hydrology and water resource research (Alvisi et al. 2006; Cheng et al. 2005; Keskin et al. 2006; Muttil and Chau 2006; Tao et al. 2008; Zounemat-Kermani and Teshnehlab 2008). However, ANNs generally require sufficient samples to set optimal connection weights and thresholds, which may become challenging when data information is incomplete. Aiming at shortcomings of the existing interpolation methods and hydrologic models mentioned above, the information diffusion model (IDM) is introduced into this paper. IDM is an effective method of dealing with

2237

the small-sample issue; it can capture complex nonlinear relationships without detailed knowledge of the physical processes (Huang 1997, 2001). Based on an IDM, an incomplete dataset can be regarded as a piece of fuzzy information; through some diffusion methods, some additional information can be extracted by spreading the observations. The diffusion coefficient [simple window width (SWW)] can be easily determined according to nearby criteria (Huang 1997) with incomplete data. Greater reliability for risk assessment and pattern recognition can be achieved using this method (Feng and Huang 2008; Huang 2002; Li et al. 2012). However, the IDM equipped with SWW (SIDM) is unable to precisely analyze meteorological, oceanic, and hydrological data that follow an asymmetrical and abnormal distribution. To solve this issue, under the principle of least-meansquared errors, Xinzhou et al. (2003) proposed the optimal window width (OWW)-based IDM (OIDM), which displays a better performance for estimating the nonnormal population than SIDM. OWW uses the mean value of observations to iteratively compute an approximate result instead of taking all observations into account in one step, which may result in ‘‘local optima’’ and inconsistencies. Genetic algorithms (GAs) are implemented based on the ideas of natural genetics and biological evolution; the algorithm works with a number of solution sets over the search domain rather than a single one so that local optima can effectively be avoided (Goldberg and Holland 1988; Hong et al. 2013b). Hence, this paper presents a new method that obtains global optima diffusion coefficients by using a GA to interpolate and forecast monthly discharge time series. To the best of our knowledge, no study has been reported in the hydrological literature that has used the IDM for hydrological modeling. Therefore, to facilitate our discussion, the principle of information diffusion and the algorithm to calculate the SWW and OWW is explained in section 2; in the same section, the new method to obtain the diffusion coefficients based on the GA is also discussed. Interpolation and prediction of river runoff using the IDM coupled with fuzzy inference is examined in section 3. To substantiate the new method, a step-by-step implementation of IDMs for the monthly river runoff interpolation and forecasting at real gauging stations is presented in section 4. Finally, in section 5, some conclusions are presented and future work is proposed.

2. IDM and the window width a. Principles of information diffusion and SWW Information diffusion refers to making an affirmation: when a knowledge sample is given, it can be used to

2238

JOURNAL OF HYDROMETEOROLOGY

compute a relationship between population and sample. Let X 5 fx1 , x2 , . . . , xn g be a given sample (where curly brackets indicate a series of values) and let the universe of discourse be U. If and only if X is incomplete, there must be a reasonable information diffusion function m(xi , u), u 2 U, which can accurately estimate the real relation R1 . This is called the principle of information diffusion (Huang 1997). Let xi (i 5 1, 2, . . . , n) be n independent identically distributed observations drawn from a population with density p(x), x 2 R1 . Suppose m(x) is a Borel measurable function in (2‘, 1 ‘): 1 f^(x) 5 nd

n

x 2 x  i d

åm

i51

(1)

is called an information diffusion estimator about p(x), where m(x) is the diffusion function and d is the window width or diffusion coefficient (Huang 1997). According to molecule diffusion theory and nearby criteria, Huang (1997) obtained the normal diffusion function as   1 x2 (2) m(x) 5 pffiffiffiffiffiffi exp 2 2 2d d 2p

and the d (SWW) as [b 5 max(X), a 5 min(X)]: 8 0:8146(b 2 a) , n55 > > > > > 0:5690(b 2 a) , n56 > > > > > 0:4560(b 2 a) , n57 < n58 . d 5 0:3860(b 2 a) , > > > 0:3362(b 2 a) , n59 > > > > > 0:2986(b 2 a) , n 5 10 > > : 2:6851(b 2 a)/(n 2 1), n $ 11

The SWW-based information diffusion method maximizes the amount of useful data extracted from the sample, thus improving the accuracy of system recognition and natural disaster risk evaluation (Huang 2001; Palm 2007). However, the method is invalid when the population from which observations are drawn does not follow a normal distribution. To obtain a more accurate information diffusion estimator for the abnormal relationship, based on the principle of least-mean-squared error, Xinzhou et al. (2003) proposed an iterative method to obtain the optimal diffusion coefficient d (OWW), which can be expressed by

max(X) 2 min(X) n21

where d0 is the initial iterative value; i and j denote the ordinal number of records and iterations, respectively; and max(), min(), and mean() return the largest, smallest, and mean elements in X, respectively. When «dk . jdk 2 di j (i 5 1, 2, . . . , k 2 1; « # 0:0001) is determined, the iterative computations end with the OWW: d5

1 k2i11

k

å d j.

(3)

b. Optimal window width

9 > > > > > > > > 90:2 8 > > > > > > 4 j ^ = < > f (x )(d ) > i j > j11 > (i 5 1, . . . , n) di 5 = pffiffiffiffi ^j 2> j j ^ ^ > :2n p[ f (xi 1 di ) 2 2f (xi ) 1 f (xi 2 di )] ; , > > > > > j11 > d j11 5 mean(di ) > > > > > > k x 2 x  > > 1 > j11 i ^ > f (xi ) 5 pffiffiffiffiffiffi m å ; j j d 2pnd j d0 5

VOLUME 15

(5)

j5i

The IDM in conjunction with the OWW, that is, the OIDM, applies to research data that follow both normal and abnormal distributions and can estimate a real

(4)

relationship more accurately than a traditional one (Xinzhou et al. 2003). The OWW is obtained using the mean value of observations for iterative computation instead of including all observations in one step; however, the local optima problem (Goldberg and Holland 1988) may emerge. To avoid the problem, GAs are employed.

c. Searching the general optimal window width using the GA GAs are established based on the ideas of natural genetics and biological evolution; the algorithm works with a number of solution sets over the search domain so that local optima can effectively be avoided (Goldberg

DECEMBER 2014

2239

BAI ET AL.

and Holland 1988). Hence, this paper uses GA to search global optimal diffusion coefficients. The combination of the GA and IDM for window width searching consists of three major phases. In the first phase, the GA initializes a population that compounds random codes from the search domain (0, fb 2 ag/3] (Xinzhou et al. 2003), where b and a are the maximum and minimum values of the samples, respectively. Since there are too many variables using binary-encoded GAs to solve such optimization problems, this paper selects a real code GA, which means each chromosome is encoded not with binary numbers but with real ones. The second phase is the evaluation of the fitness of all chromosomes. According to Xinzhou et al. (2003), the window width d can be obtained by d5i 5

f^(xi ) , pffiffiffiffi ^00 2n p[ f (xi )]2

(6)

where f^(x) is the information diffusion estimator and xi denotes different records from sample (i 5 1, 2, . . . , n). Motivated by second-order schemes (Flajolet and Sedgewick 1995), f^(xi 1 di ) 2 2f^(xi ) 1 f^(xi 2 di ) f^00 (xi ) 5 . d2i

The Cartesian product U 3 V is called the illustrating space, and uj and yk are fuzzy sets of U and V, respectively. Recalling Eqs. (1) and (2) to deal with their membership functions, the following can be obtained: 2 3 2 t (u 2 u ) 1 j 7 6 pffiffiffiffiffiffi å exp42 (11) mU (uj ) 5 5, u 2 U 2d2U tdU 2p j51 and

(8)

which is the criteria for determination of the window width. Thus, different records have their own different window widths. To take all records into account at the same time and search the only one global optimal d, we select CalFitness(d)      n  ^ f (xi )   5 å d 2 pffiffiffiffi  2 ^ ^ ^  2n p[ f (xi 1 d) 2 2f (xi ) 1 f (xi 2 d)]  i51

The IDM coupled with fuzzy inference is an approach that processes samples using a set numerical method (Huang 2002). Let (X, Y) 5 f(xi , yi ) j i 5 1, 2, . . . , ng be a training set of observations on R2 (R is the real line), where input x denotes the index of records sorted by chronological order or precipitation data, y is the river discharge, and curly brackets indicate a series of values. Let U be the domain of x and V be the range of y. The element of U will be denoted by u, the same for y by V, and curly brackets indicate a series of values. Let ( U 5 fuj , j 5 1, 2, . . . , tg . (10) V 5 fy k , k 5 1, 2, . . . , lg

(7)

Then f^(xi ) di 5 pffiffiffiffi , 2n p[ f^(xi 1 di ) 2 2f^(xi ) 1 f^(xi 2 di )]2

3. IDM for runoff estimation using fuzzy inference

1 pffiffiffiffiffiffi mV (yk ) 5 tdV 2p

# (y 2 y k )2 å exp 2 2d2 , y 2 V . j51 V t

"

(12)

An illustrating point is given by (uj , y k ), and dU and dV are window widths that can be obtained by X and Y based on the algorithms discussed in section 2. The information gain of (uj , y k ) is qjk (xi , yi ) 5 mU (xi ) 3 mV (yi ) . j

(13)

k

Then we have (9)

as the fitness value function. The third phase is to apply evolutionary processes, such as selection, crossover, and mutation operations by a GA according to its fitness, which is discussed in Goldberg and Holland (1988) and Hong et al. (2013a). The evolution stops when the fitness is smaller than a predefined value. Finally, the improved window width (IWW) can be adopted for the chromosome with the lowest fitness value. After a brief overview of IDM and improved IDM techniques is presented, the procedure for interpolating and predicting river discharge based on the IDM is described in the next section.

Qjk 5

n

å qjk (xi , yi ) ,

(14)

i51

which consists of the information matrix (Huang 2002) y1 y2 ⋯ yl u1 0 1 Q Q12 ⋯ Q1l u2 B 11 Q21 Q22 ⋯ Q2l C C. Q5 . B .. B .. C .. .. B .. C @ . . A . . ut Qt1 Qt2 ⋯ Qtl

(15)

According to the theory of factor space (Pei-Zhuang 1990), a fuzzy relation matrix R,

2240

JOURNAL OF HYDROMETEOROLOGY

VOLUME 15

FIG. 1. Location of study sites. R1 refers to the Yellow River. R2 refers to the Yangtze River. The numbers 1, 2, 3, 4, 5, and 6 refer to Lijin, Lanzhou, Huayuankou, Zhutuo, Yichang, and Datong stations, respectively. The numbers 7, 8, 9, 10, 11, and 12 refer to Jinan, Zhengzhou, Tangnaihai, Yibin, Wanzhou, and Wuhu stations, respectively.

y1 y2 ⋯ yl u1 0 1 R11 R12 ⋯ R1l u2 B R21 R22 ⋯ R2l C C, R5 . B .. B .. C .. .. B .. C @ . . A . . ut Rt1 Rt2 ⋯ Rtl

(16)

can be obtained from an information matrix Q by using TABLE 1. The input data of experiment 1. X

Y

41 79 155 24 9 71 44 18 153 83 87 174 25 147 161 90 170 92

21.64 101.5 43.29 20.06 84.45 22.06 139 25.4 101.9 18.27 13.5 17.13 12.64 30.8 69.37 5.08 22.79 157.2

X

Y

172 36.81 54 27.22 3 26.09 81 36.29 60 27.05 21 59.72 137 22.04 46 75.8 111 0.01 149 44.73 91 113.8 74 13.86 57 105 12 20.06 139 36.69 125 28.12 168 42.85 19 66.69

X

Y

173 41.52 66 41.21 67 98.83 114 0 11 41.81 163 101.5 179 29.29 94 73.12 127 75 88 15.84 113 0 175 60 118 3.72 142 76.6 143 49.51 177 26.44 110 1.73 59 53.14

X

Y

132 57.32 37 16.34 38 15.75 1 11.78 107 19.47 156 34.55 22 39.02 49 10.39 134 8.06 69 58.58 135 55.44 109 4.26 152 96.69 106 20.97 95 59.1 13 15.86 10 75.91 154 104.7

8 > R 5 (rjk )t3l 5 [r(uj , y k )]t3l > > < rjk 5 Qjk /sk . > > max Qjk > : sk 5 1#j#t

X

Y

40 26.18 165 183 100 19.18 140 98.57 145 15.8 78 26.96 35 39.22 68 116.2 131 94.35 14 14.26 62 22.58 51 25.95 27 22.63 112 0 101 8.84 104 79.82 6 25.92 2 18.89

(17)

To calculate the output fuzzy set B, we use A to denote the input fuzzy set,  t 1, z 5 j q mA (uj ) 5 å , q 5 . (18) u 0, z 6¼ j z51 z Using the fuzzy inference formula B 5 A+R,

(19)

where operator ‘‘+’’ denotes the maximum 2 minimum fuzzy composition rule, mB (y k ) 5 max fmin[mA (uj ), r(uj , y k )]g, uj 2U

yk 2 V ,

(20)

where r(uj , y k ) 2 (0, 1]; thus, we can obtain TABLE 2. The corresponding diffusion coefficients of experiment 1.

X Y

SWW

OWW

IWW

5.3702 5.5210

14.2057 32.0010

0.6968 0.3146

DECEMBER 2014

BAI ET AL.

2241

FIG. 2. Distribution of input data in experiment 1.

mB (yk ) 5 max [r(uj , y k )]. uj 2U

(21)

Finally, the gravity center of the fuzzy set is generated as the output: l

å yk mB (yk )

y~j 5 k51 l

.

experiments divided into two groups have been made. An application of GIDM for monthly discharge reconstructing and forecasting is compared with SIDM and OIDM using the same sparse observed data at Lijin station and neighboring stations in section 4b. In section 4c, more validation at other gauging stations is discussed.

(22)

å mB (yk )

k51

In general, we use the given sample (X, Y) to construct a relationship between the river discharge and its antecedent values or its meteorological influencing factor in the following form: Y~ 5 f (Xi ) ,

(23)

where Xi is an input vector consisting of x1 , . . . , xi , . . . , xm and Y~ is the flow in the next period or the lack of measurement. Thus, the value of river runoff occurring in a particular moment can be estimated by IDM with the help of fuzzy inference.

4. Case study To investigate the effectiveness of the proposed model, an IDM improved by a genetic algorithm (GIDM), for runoff prediction and interpolation,

FIG. 3. Comparison of observed runoff (solid blue line) and interpolated runoff forced by SIDM (solid black line), OIDM (solid red line), and GIDM (solid green line) at gauge Lijin from (a) January 1951 to December 1965, (b) January 1966 to December 1980, (c) January 1981 to December 1995, and (d) January 1996 to December 2010 using 50% of the data.

2242

JOURNAL OF HYDROMETEOROLOGY

VOLUME 15

TABLE 3. The RMSE, R, E, and MAPE values for different models (50% data, where boldface font indicates the best performance). Time

Model

RMSE

R

E

MAPE

Jan 1951 ; Dec 1965 SIDM OIDM GIDM Jan 1966 ; Dec 1980 SIDM OIDM GIDM Jan 1981 ; Dec 1995 SIDM OIDM GIDM Jan 1996 ; Dec 2010 SIDM OIDM GIDM

31.4551 31.9570 21.0446 23.6351 24.7525 20.4493 21.7965 21.5027 15.7308 12.6430 12.6026 8.9801

0.7784 0.7675 0.9138 0.7947 0.7847 0.8580 0.7413 0.7465 0.8740 0.7502 0.7453 0.8816

0.6019 0.5891 0.8218 0.6205 0.5838 0.7159 0.5448 0.5570 0.7629 0.5522 0.5551 0.7741

34.14 43.85 31.50 30.43 40.80 28.69 33.43 45.58 27.88 37.65 45.61 23.00 FIG. 4. As in Fig. 3, but for 33.33% of the data.

a. Study area and data The Tangnaihai, Lanzhou, Zhengzhou, Huayuankou, Jinan, and Lijin stations on the Yellow River, along with the Yibin, Zhutuo, Wanzhou, Yichang, Wuhu, and Datong stations on the Yangtze River, have been selected for this study (see Fig. 1). The Yangtze River, the longest river in China with a total length of 6380 km and a drainage basin of 1.8 3 106 km2, covers 20% of China’s land area. The surface runoff of the Yangtze River is about 9.616 3 1011 m3, which accounts for 36% of the

total runoff in China. The Yellow River is equally vital in China’s hydrological cycle, with a mainstream length of 5464 km and an area of 752 443 km2. It originates from the Tibetan Plateau and flows eastward, crossing six Chinese provinces and two autonomous regions on its course to Bo Hai. The Lijin station is the farthest downstream on the Yellow River and is the master regulation station for river discharge and sediment. Lijin is selected for its importance to evaluate the performance of GIDM for simulating changes in runoff in detail, with the other

FIG. 5. Distribution of input data in experiment 3.

DECEMBER 2014

TABLE 4. The RMSE, R, E, and MAPE values for different models (33.3% data, where boldface font indicates the best performance). Time

2243

BAI ET AL.

Model

RMSE

R

E

MAPE

Jan 1951 ; Dec 1965 SIDM OIDM GIDM Jan 1966 ; Dec 1980 SIDM OIDM GIDM Jan 1981 ; Dec 1995 SIDM OIDM GIDM Jan 1996 ; Dec 2010 SIDM OIDM GIDM

31.1543 30.9841 23.3636 26.7887 25.8631 22.1373 22.8579 20.7114 17.3896 13.0969 12.6384 10.9313

0.7901 0.7859 0.8881 0.7354 0.7447 0.8312 0.7082 0.7690 0.8517 0.7309 0.7435 0.8208

0.6094 0.6137 0.7804 0.5125 0.5456 0.6971 0.4994 0.5890 0.7103 0.5195 0.5525 0.6653

41.86 52.71 36.38 34.90 47.45 30.90 43.98 52.87 36.88 47.99 56.53 40.45

stations as auxiliary evaluation sites. The monthly runoff data are published by the Yellow River Conservancy Commission (YRCC) and in the Bulletins of Chinese River Sediment complied by the Ministry of Water Resources from January 1951 to December 2010. Precipitation data from January 1981 to December 2010 are collected from the China Meteorology Administration.

b. Experiments estimating monthly runoff time series at Lijin

TABLE 5. The input data of experiment 3. X

Y

X

Y

X

Y

X

Y

531 314 188 275 422 1555 0 58 160 1615 290 474 0 2872 861 65 881 1528 467 51 99 205 1740 1281 575 207 30 1634 101

5.49 22.71 12.24 6.64 4.50 36.69 14.76 11.92 20.11 17.01 13.19 6.29 1.14 30.27 13.32 13.20 38.57 5.79 33.96 0.84 22.86 1.04 0.00 10.18 20.73 0.09 9.92 17.89 4.96

68 3866 11 5 2082 46 125 290 237 1408 1202 143 324 589 95 2248 138 265 222 97 246 297 225 1481 184 917 60 684 65

7.54 39.64 35.09 15.05 8.62 12.51 10.77 1.91 3.29 0.00 53.84 0.04 10.34 40.69 1.30 23.09 12.45 12.08 1.42 7.26 6.43 2.65 7.19 2.08 13.50 1.72 1.05 21.32 4.85

450 1215 999 110 2869 3869 169 994 109 2394 1234 974 1610 1188 403 912 25 105 226 348 1773 89 86 3 135 191 684 159 719

12.43 0.00 22.10 7.23 11.85 40.98 15.86 30.27 2.67 89.19 5.97 3.63 20.60 33.96 1.40 0.42 11.20 12.56 5.03 22.87 38.03 0.77 8.79 3.67 10.77 55.44 0.49 0.19 2.49

4 390 1571 180 764 2 240 1824 272 56 171 129 2414 82 2 99 1103 1989 3 192 656 535 654 565 396 206 235 86 87

8.88 9.62 68.57 3.89 1.93 4.63 28.77 42.85 27.99 16.04 2.84 0.24 35.89 17.34 7.34 15.32 17.52 24.53 2.55 26.96 3.70 7.44 92.94 11.65 14.84 5.85 22.15 4.64 17.13

1) EXPERIMENT 1: INTERPOLATING RUNOFF TIME SERIES USING 50% OF THE TOTAL DATA more illustrating points should be installed into it. Therefore, the illustrating space (where curly brackets indicate a series of values) is designed as 8 U 5 f1, 2, . . . , 180, D 5 1g > > 8 > > y 1 5 f0, 0:5, . . . , 39:5, D 5 0:5g > > > > > > < > > < y 2 5 f40, 41, . . . , 79, D 5 1g . U 3V 5 V 5 y 3 5 f80, 82, . . . , 90, D 5 2g > > > > > > > > y 5 f91, 92, . . . , 110, D 5 1g > > > > : : 4 y 5 5 f112, 116, . . . , 200, D 5 4g

A real example of the monthly runoff data (3108 m3) from January 1951 to December 1965, taken at Lijin station, is presented to illustrate the step-by-step implementation of different IDMs. Step 1. Let records measured from January 1951 to December 1965 be (X, Y) 5 f(1, 11:78), (2, 18:89), . . . , [(1965 2 1951 1 1) 3 12 5 180, 2:759]g. Step 2. Based on the Monte Carlo method, 50% of the observation data are pseudorandomly selected as the input data (see Table 1) and the remaining 50% are missing data or lack of measurements. Step 3. Calculate the window width of IDM according to the input and the algorithm described in section 2. The values of SWW, OWW, and IWW are listed in Table 2. Step 4. In the information diffusion technique, the selection of appropriate illustrating points is crucial for successful implementation because it provides the basic information about the system. Through a statistical analysis of the data series, the illustrating space can be well established. The input data are analyzed with respect to their distribution in Fig. 2. The more elements the spaced container has, the

(24) Step 5. Calculate the final output using Eqs. (11)–(22). To measure the consistency of SIDM, OIDM, and GIDM, the monthly runoff of years 1966–80, 1981–95, TABLE 6. The corresponding diffusion coefficients of experiment 3.

X Y

SWW

OWW

IWW

90.3361 2.1700

79.7712 6.0754

46.0498 1.2359

2244

JOURNAL OF HYDROMETEOROLOGY

VOLUME 15

FIG. 6. Comparison between predicted and measured flow values at Lijin and rainfall data at Jinan (33.33% data).

and 1996–2010 are reviewed following the experiment discussed above. Figure 3 displays the aggregated time series of observed and interpolated runoffs. It can be observed that the GIDM outperforms the other two models. For example, in Fig. 3a, there are some obvious undersimulations in July 1951, June 1955, November 1958, October 1961, and July 1964, which are the same as oversimulations in March 1956, January 1957, and October 1965 obtained by SIDM and OIDM. However, the GIDM exhibits a good correlation with them. Although some discrepancies exist between observed and simulated data using GIDM (e.g., from June to November 1953), the general tendency could be acceptable, considering the limited number of training samples. The root-mean-square error (RMSE; Wang et al. 2009; Nayak et al. 2004), the coefficient of correlation R (Wang et al. 2009), the Nash–Sutcliffe efficiency coefficient E (Nash and Sutcliffe 1970), and the mean absolute percentage error (MAPE; Hu et al. 2001) are employed as objective functions to calibrate the model. Table 3 shows the RMSE, R, E, and MAPE values for different models. It is clear from Table 3 that GIDM performs better than the traditional SIDM and OIDM. For example, in the years 1966–80, considering a high value of 145.2000 3 108 m3 and a very low value of 0.4692 3 108 m3 at the Lijin gauging station, the GIDM with an RMSE value of 20.4493 3 108 m3 performed satisfactorily up to the interpolation. Moreover, the GIDM obtained the best R, E, and MAPE statistics of

0.8580, 0.7159, and 28.69, respectively. Coefficient R evaluates the linear correlation between the observed and computed flow, E evaluates the capability of the model in predicting flow values deviating from the mean, and MAPE measures the mean absolute percentage error of the forecast. Therefore, according to the values in Table 3, it can be concluded that the GIDM has reliable robustness and consistency.

2) EXPERIMENT 2: INTERPOLATING RUNOFF TIME SERIES USING 33.33% OF THE TOTAL DATA Subsequent to experiment 1, in step 2 only 33.33% of the monthly discharge data are selected as the input, with the remaining 66.67% used for testing. Figure 4 shows a plot of observed and reconstructed discharges using different models. It can be observed that the GIDM still correlates well with the recorded discharges, although there are some slight oversimulations and undersimulations. Table 4 presents a comparison for using different models in terms of various performance statistics and can be interpreted as follows. For example, in the years 1981–95, the GIDM improved the SIDM interpolation by about 23.92% and gave a 16.14% reduction in RMSE and MAPE, respectively; improvements of the results regarding R and E were approximately 20.23% and 40.26%, respectively. The RMSE and MAPE values obtained by the GIDM decreased by 16.04% and 30.24% compared with the OIDM, while the R and E values increased by 10.75% and 20.59%. Overall, GIDM is able to obtain better

y64 (122)

0.0004 0.0004 .. . 0.0003 0.0003 0.0002 0.0002 0.0002 0.0002 .. . 0.0000 0.0000

y 63 (118)

0.0004 0.0004 .. . 0.0003 0.0003 0.0002 0.0002 0.0002 0.0002 .. . 0.0000 0.0000



⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯ ⋯

y 7 (3.0)

0.4548 0.4619 . .. 0.1124 0.1143 0.1162 0.1183 0.1204 0.1227 . .. 1.46 3 102228 7.46 3 102229

y6 (2.5)

0.4396 0.4465 . .. 0.1094 0.1113 0.1133 0.1154 0.1176 0.1199 . .. 7.07 3 102229 3.60 3 102229

y5 (2.0)

0.4242 0.4310 . .. 0.1064 0.1083 0.1104 0.1125 0.1147 0.1170 . .. 3.40 3 102229 1.73 3 102229

y4 (1.5)

0.4088 0.4154 . .. 0.1034 0.1054 0.1074 0.1096 0.1118 0.1141 . .. 1.63 3 102229 8.32 3 102230 0.3934 0.3999 . .. 0.1004 0.1024 0.1045 0.1067 0.1089 0.1113 . .. 7.81 3 102230 3.98 3 102230

y2 (0.5)

0.3781 0.3845 . .. 0.0975 0.0995 0.1016 0.1038 0.1061 0.1084 . .. 3.73 3 102230 1.90 3 102230

y1 (0.0)

0.3631 0.3693 . .. 0.0946 0.0967 0.0988 0.1010 0.1033 0.1056 . .. 1.78 3 102230 9.05 3 102231

y3 (1.0)

2245

BAI ET AL.

u1 (0) u2 (1). .. u841 (840) u842 (841) u843 (842) u844 (843) u845 (844) u846 (845) .. . u4300 (4299) u4301 (4300)

TABLE 7. Step 5 table.

DECEMBER 2014

TABLE 8. Forecasting performance indices of models for Lijin on average for the period from January 2009 to December 2010 (33.33% data, 30 times, where boldface font indicates the best performance). Model

RMSE

R

E

MAPE

SIDM OIDM GIDM

7.4143 9.2753 6.7190

0.8658 0.8605 0.8653

0.7698 0.6648 0.8011

45.1697 46.9517 31.4092

accuracy in terms of different evaluation measures. In addition, discharges have been interpolated using more sparse datasets for training, and all model simulations gradually deteriorate. This is because the samples contain less information about the river runoff for modeling runoff values.

3) EXPERIMENT 3: 12- AND 24-MONTH LEAD TIME FORECASTING

A new framework is proposed using IDM to investigate the relationship between upstream rainfall and predicted discharges at Lijin station. Assuming that the discharges at Lijin from January to December 2010 are ungauged, the only observations we have are the broken time series of monthly flow data from January 1981 to December 2009 measured at Lijin station and broken precipitation data at neighboring Jinan (see Fig. 1) station from January 1981 to December 2010. Step 1. Let records from January 1981 to December 2009 be denoted by (X, Y) 5 f(13:3, 8:73), (7, 7:11), . . . , (8:6, 8:79)g, where X and Y are referred to as the rainfall and flow data and curly brackets indicate a series of values, respectively. Step 2. Based on the Monte Carlo method, 33.33% of the observation data (see Table 5) are pseudorandomly selected for training the window widths, with the remaining 66.67% as missing data. Step 3. Calculate the window widths. The values of SWW, OWW, and IWW are listed in Table 6. Step 4. The corresponding illustrating space (where curly brackets indicate a series of values) with respect to the input distribution (see Fig. 5) is designed as 8 U 5 fui j ui 5 0, 1, . . . , 4300, D 5 1g > > > 8 > > y 1220 5 f0, 0:5, . . . , 9:5, D 5 0:5g > > > > < > > < y 21230 5 f10, 11, . . . , 19, D 5 1g . U 3V 5 > V 5y > > > y 31245 5 f20, 22, . . . , 48, D 5 2g > > > > > > > : : y 46264 5 f50, 54, . . . , 122, D 5 4g (25)

2246

JOURNAL OF HYDROMETEOROLOGY

VOLUME 15

FIG. 7. As in Fig. 6, but for rainfall data at Zhengzhou.

Step 5. Calculate the final predicted values using Eqs. (17)–(22). Taking GIDM as an example, first, calculate the fuzzy relationship matrix R using Eq. (17) (see Table 7): Then calculate for the amount of precipitation at Jinan in June 2010, u842 5 841. By Eq. (18), it can be obtained that 0 0 0 1 0 1 ⋯1 . mA (841) 5 1 1 ⋯ 1 0 1 840 841 4300

y~j 5

Second, we use Eq. (20) and R to calculate mB (y k ): mB (y k ) 5

0:0967 0:0995 0:1024 1 1 1⋯ 0 0:5 1 0:0003 0:0003 1 . 1 118 122

Finally, by Eq. (22), the runoff at Lijin in June 2010 is calculated:

0:0967 3 0 1 0:0995 3 0:5 1 0:1024 3 1 1 ⋯ 1 0:0003 3 122 5 26:5343 . 0 1 0:5 1 1 1 ⋯ 1 122

Following the steps discussed above, the results of SIDM, OIDM, and GIDM for runoff forecasting at Lijin can be obtained (see Fig. 6). Figure 6 shows that the variation of runoff at Lijin is closely related to changes in upstream precipitation. Some slightly different tendencies between rainfall and runoff may be illustrated and suggest that the river flow is not only a response to rainfall but also to other physical factors, such as evaporation and soil moisture, or intensive human activities. In addition, Fig. 6 indicates a good match between the model output and observed runoff, especially in peak discharge forecasting using GIDM, which means the new method may be used as an operational flood forecasting tool.

To validate the effectiveness of SIDM, OIDM, and GIDM for runoff forecasting, 24-month lead time prediction also has been made for the period from January 2009 to December 2010. The above-mentioned experiment is repeated 30 times using different rainfall and runoff data from Zhengzhou (see Fig. 1) and Lijin for training, respectively. Table 8 shows the performance of different IDMs on average, and Fig. 7 presents the monthly hydrograph of observed and simulated river runoff for Lijin at the first experiment. They validate that the variation of runoff at Lijin is sensitive to changes in precipitation at upstream stations, and the GIDM performed better than the other two models: the GIDM improved the performance of

DECEMBER 2014

2247

BAI ET AL.

FIG. 8. Comparison of observed runoff and interpolated runoff forced by SIDM, OIDM, and GIDM at gauges (a) Huayuankou (from January 1951 to December 1970) and (b) Yichang (from January 1991 to December 2010) for 33.33% of the data.

traditional models by about 0.6%–33% in terms of different evaluation criteria.

c. More interpolation and forecasting experiments at other stations Subsequent to experiments 2 and 3, the three IDMs have been used to interpolate monthly river discharges and to estimate the rainfall–runoff relationship at five other gauges. In particular, the runoff curves from Huayuankou and Yichang are given (see Figs. 8, 9) because they are at the midstream of two different main rivers in China and have different physiographical factors, such as catchment area and underlying surface. Although there are some obvious underestimations in Fig. 8a, it could be indicated that the GIDM provides better interpolation and prediction performance than traditional IDMs. Moreover, according to the analysis in Figs. 8b and 9, the consistency of the new method can be validated. Estimation of variations of discharges at Lanzhou, Zhutuo, and Datong stations are shown in Tables 9 and 10. According to the values in Tables 9 and 10, the GIDM obtained the best RMSE, R, E, and MAPE statistics for the interpolations and predictions at Lanzhou, Zhutuo, and Datong. In summary, there is a considerable prospect for the interpolation and prediction of river runoff from incomplete information using the GIDM.

FIG. 9. Comparison of observed and simulated runoff at gauges (a) Yichang (from January to December 2010) and (b) Huayuankou (from January 2009 to December 2010) for prediction using different IDMs (33.33% data).

is to improve the coefficient of IDM for unraveling more information from sparse data. Conventional IDMs are also included for comparison. The monthly runoff data from Lijin, Lanzhou, Huayuankou, Zhutuo, Yichang, and Datong gauging stations and upstream rainfall data are employed to train and validate the different IDMs. According to the results obtained, the potential of the new method can be concluded as follows: 1) The GIDM is appropriate for estimating the relationship between monthly runoff and rainfall upstream with scanty data, which is crucial for flood prevention and water management in unmeasured basins. 2) With sparse observations, the GIDM is an operational tool for long-term interpolation of river runoff, which may provide the acceptable input data in hydrologic modeling of physical process; traditional time series approaches have to use much more information to obtain an acceptable result. TABLE 9. Interpolating performance indices of models for different stations (33.33% data, where boldface font indicates the best performance). Station

Model

RMSE

R

E

MAPE

Lanzhou

SIDM OIDM GIDM SIDM OIDM GIDM SIDM OIDM GIDM

25.6159 24.4267 19.9595 207.9402 198.6825 182.4174 355.7384 321.3408 289.6885

0.8779 0.8732 0.9068 0.6986 0.7075 0.7897 0.6941 0.7536 0.8347

0.6772 0.6819 0.8071 0.4869 0.5184 0.6338 0.4894 0.5772 0.6961

41.86 52.71 34.38 33.16 45.07 30.31 43.10 51.82 34.19

Zhutuo

5. Conclusions A new algorithm for reconstructing and forecasting river discharges with incomplete data has been proposed in this paper. The purpose for constructing the algorithm

Datong

2248

JOURNAL OF HYDROMETEOROLOGY

VOLUME 15

TABLE 10. Forecasting performance indices of models for different stations (33.33% data, where boldface font indicates the best performance). Time

Station

Model

RMSE

R

E

MAPE

Jan 2009 ; Dec 2010

Lanzhou

Jan 2009 ; Dec 2010

Zhutuo

Jan 2009 ; Dec 2010

Datong

SIDM OIDM GIDM SIDM OIDM GIDM SIDM OIDM GIDM

6.7708 6.7846 5.7547 51.1659 48.8880 44.8858 158.6694 143.7688 120.7107

0.8520 0.8475 0.8800 0.7339 0.7328 0.8180 0.6760 0.7340 0.8130

0.7532 0.6618 0.7833 0.6068 0.5369 0.6564 0.4766 0.5622 0.6780

39.98 50.34 32.84 33.79 45.94 30.90 41.32 49.67 32.77

3) On average, the GIDM can improve traditional IDM interpolation and prediction by about 10%–40% in terms of the different performance criteria. Although it is concluded that the new IDM is sufficient to model the runoff time series, it still cannot be acceptable when samples are much too sparse to permit effective simulation and forecasting. Therefore, it is hoped that future research will focus on these priorities, that is, on establishing a more efficient diffusion function, on estimating the relationship between runoff and other meteorological factors, and on saving computational time for searching optimal parameters of IDMs, etc., so as to improve the accuracy of hydrology simulation and to achieve better operation and management of the various engineering systems. Acknowledgments. The authors are very grateful to Dr. Joe Turk and the anonymous reviewers for their valuable comments and constructive suggestions, which helped us significantly in improving the quality of the paper. This research was supported by the Chinese National Natural Science Fund (Grants 41375002, 41075045, 51190091, and 41071018) and the Chinese National Natural Science Fund of Jiangsu Province (BK2011123), the Program for New Century Excellent Talents in University (NCET-12-0262), the China Doctoral Program of Higher Education (20120091110026), the Qing Lan Project, the Elite Young Teachers Program, and the Excellent Disciplines Leaders in Midlife-Youth Program of Nanjing University. REFERENCES Alvisi, S., G. Mascellani, M. Franchini, and A. Bardossy, 2006: Water level forecasting through fuzzy logic and artificial neural network approaches. Hydrol. Earth Syst. Sci., 10, 1–17, doi:10.5194/hess-10-1-2006. Carlson, R. F., A. MacCormick, and D. G. Watts, 1970: Application of linear random models to four annual streamflow series. Water Resour. Res., 6, 1070–1078, doi:10.1029/ WR006i004p01070.

Chen, H.-L., and A. R. Rao, 2002: Testing hydrologic time series for stationarity. J. Hydrol. Eng., 7, 129–136, doi:10.1061/ (ASCE)1084-0699(2002)7:2(129). Cheng, C.-T., J.-Y. Lin, Y.-G. Sun, and K. Chau, 2005: Long-term prediction of discharges in Manwan hydropower using adaptive-network-based fuzzy inference systems models. Advances in Natural Computation, L. Wang, K. Chen, and Y. S. Ong, Eds., Lecture Notes in Computer Science, Vol. 3612, Springer, 1152–1161. Feng, L., and C. Huang, 2008: A risk assessment model of water shortage based on information diffusion technology and its application in analyzing carrying capacity of water resources. Water Resour. Manage., 22, 621–633, doi:10.1007/ s11269-007-9182-z. Flajolet, P., and R. Sedgewick, 1995: Mellin transforms and asymptotics: Finite differences and Rice’s integrals. Theor. Comp. Sci., 144, 101–124, doi:10.1016/0304-3975(94)00281-M. Goldberg, D. E., and J. H. Holland, 1988: Genetic algorithms and machine learning. Mach. Learn., 3, 95–99. Hong, M., R. Zhang, J. X. Li, J. J. Ge, and K. F. Liu, 2013a: Inversion of the western Pacific subtropical high dynamic model and analysis of dynamic characteristics for its abnormality. Nonlinear Processes Geophys., 20, 131–142, doi:10.5194/npg-20-131-2013. ——, ——, H. Z. Wang, J. J. Ge, and A. D. Pan, 2013b: Bifurcations in a low-order nonlinear model of tropical Pacific sea surface temperatures derived from observational data. Chaos: Interdiscip. J. Nonlinear Sci., 23, 023104, doi:10.1063/1.4802036. Hu, T. S., K. C. Lam, and S. T. Ng, 2001: River flow time series prediction with a range-dependent neural network. Hydrol. Sci. J., 46, 729–745, doi:10.1080/02626660109492867. Huang, C., 1997: Principle of information diffusion. Fuzzy Sets Syst., 91, 69–90, doi:10.1016/S0165-0114(96)00257-6. ——, 2001: Information matrix and application. Int. J. Gen. Syst., 30, 603–622, doi:10.1080/03081070108960737. ——, 2002: Information diffusion techniques and small-sample problem. Int. J. Inf. Technol. Decis. Making, 1, 229–249, doi:10.1142/S0219622002000142. Keskin, M. E., D. Taylan, and O. Terzi, 2006: Adaptive neuralbased fuzzy inference system (ANFIS) approach for modelling hydrological time series. Hydrol. Sci. J., 51, 588–598, doi:10.1623/hysj.51.4.588. Komorník, J., M. Komorníková, R. Mesiar, D. Szökeová, and J. Szolgay, 2006: Comparison of forecasting performance of nonlinear models of hydrological time series. Phys. Chem. Earth, 31, 1127–1145, doi:10.1016/j.pce.2006.05.006. Li, Q., J. Zhou, D. Liu, and X. Jiang, 2012: Research on flood risk analysis and evaluation method based on variable fuzzy

DECEMBER 2014

BAI ET AL.

sets and information diffusion. Saf. Sci., 50, 1275–1283, doi:10.1016/j.ssci.2012.01.007. Muttil, N., and K.-W. Chau, 2006: Neural network and genetic programming for modelling coastal algal blooms. Int. J. Environ. Pollut., 28, 223–238, doi:10.1504/IJEP.2006.011208. Nash, J. E., and J. V. Sutcliffe, 1970: River flow forecasting through conceptual models part I—A discussion of principles. J. Hydrol., 10, 282–290, doi:10.1016/0022-1694(70)90255-6. Nayak, P. C., K. P. Sudhheer, D. M. Rangan, and K. S. Ramasastri, 2004: A neuro-fuzzy computing technique for modeling hydrological time series. J. Hydrol., 291, 52–66, doi:10.1016/ j.jhydrol.2003.12.010. Palm, R., 2007: Multiple-step-ahead prediction in control systems with Gaussian process models and TS-fuzzy models. Eng. Appl. Artif. Intell., 20, 1023–1035, doi:10.1016/j.engappai.2007.02.003. Pei-Zhuang, W., 1990: A factor spaces approach to knowledge representation. Fuzzy Sets Syst., 36, 113–124, doi:10.1016/ 0165-0114(90)90085-K. Salas, J. D., 1993: Analysis and modeling of hydrologic time series. Handb. Hydrol., 19, 1–72. Sedki, A., D. Ouazar, and E. El Mazoudi, 2009: Evolving neural network using real coded genetic algorithm for daily rainfall– runoff forecasting. Expert. Syst. Appl., 36, 4523–4527, doi:10.1016/ j.eswa.2008.05.024. Sorooshian, S., Q. Duan, and V. K. Gupta, 1993: Calibration of rainfall–runoff models: Application of global optimization to

2249

the Sacramento Soil Moisture Accounting Model. Water Resour. Res., 29, 1185–1194, doi:10.1029/92WR02617. Tao, W., Y. Kailin, and G. Yongxin, 2008: Application of artificial neural networks to forecasting ice conditions of the Yellow River in the Inner Mongolia Reach. J. Hydrol. Eng., 13, 811– 816, doi:10.1061/(ASCE)1084-0699(2008)13:9(811). Todini, E., 1996: The ARNO rainfall–runoff model. J. Hydrol., 175, 339–382, doi:10.1016/S0022-1694(96)80016-3. Wang, H.-Z., R. Zhang, W. Liu, G.-H. Wang, and B.-G. Jin, 2008: Improved interpolation method based on singular spectrum analysis iteration and its application to missing data recovery. Appl. Math. Mech., 29, 1351–1361, doi:10.1007/s10483-008-1010-x. Wang, W. C., K. W. Chau, C. T. Cheng, and L. Qiu, 2009: A comparison of performance of several artificial intelligence methods for forecasting monthly discharge time series. J. Hydrol., 374, 294–306, doi:10.1016/j.jhydrol.2009.06.019. Whigham, P., and P. Crapper, 2001: Modelling rainfall–runoff using genetic programming. Math. Comput. Modell., 33, 707– 721, doi:10.1016/S0895-7177(00)00274-0. Xinzhou, W., Y. Yangsheng, and T. Yongjing, 2003: The theory of optimal information diffusion estimation and its application. Geospat. Inf., 1, 10–17. Zounemat-Kermani, M., and M. Teshnehlab, 2008: Using adaptive neuro-fuzzy inference system for hydrological time series prediction. Appl. Soft Comput., 8, 928–936, doi:10.1016/ j.asoc.2007.07.011.

Suggest Documents