The Value of Hydrologic Information in Reservoir Outflow ... - MDPI

16 downloads 0 Views 4MB Size Report
Oct 1, 2018 - ... information and the forecasted inflows are more important in the flood season than in the non-flood season for reservoir ... [2] used information theory to ..... human experience effectively from history reservoir operation data.
water Article

The Value of Hydrologic Information in Reservoir Outflow Decision-Making Kebing Chen 1 , Shenglian Guo 1, * , Shaokun He 1 , Tao Xu 2 , Yixuan Zhong 1 and Sirui Sun 3 1

2 3

*

State Key Laboratory of Water Resources and Hydropower Engineering Science, Wuhan University, Wuhan 430072, China; [email protected] (K.C.); [email protected] (S.H.); [email protected] (Y.Z.) China Yangtze Power Co., Ltd., Yichang 443133, China; [email protected] Middle Changjiang River Bureau of Hydrology and Water Resources Survey, Wuhan 430012, China; [email protected] Correspondence: [email protected]; Tel.: +86-27-6877-3568

Received: 22 August 2018; Accepted: 26 September 2018; Published: 1 October 2018

 

Abstract: The controlled outflows from a reservoir are highly dependent on the decisions made by the reservoir operators who mainly rely on available hydrologic information, such as past outflows, reservoir water level and forecasted inflows. In this study, Random Forests (RF) algorithm is used to build reservoir outflow simulation model to evaluate the value of hydrologic information. The Three Gorges Reservoir (TGR) in China is selected as a case study. As input variables of the model, the classic hydrologic information is divided into past, current and future information. Several different simulation models are established based on the combinations of these three groups of information. The influences and value of hydrologic information on reservoir outflow decision-making are evaluated from two different perspectives, the one is the simulation result of different models and the other is the importance ranking of the input variables in RF algorithm. Simulation results demonstrate that the proposed model is able to reasonably simulate outflow decisions of TGR. It is shown that past outflow is the most important information and the forecasted inflows are more important in the flood season than in the non-flood season for reservoir operation decision-making. Keywords: reservoir operations; hydrologic information; data mining; random forests; decision-making; three gorges reservoir

1. Introduction With the impact of population growth, urbanization and industrialization, reservoirs play a vital role in regulating water resources by altering the spatial and temporal distribution of natural runoff. The management of reservoirs is often performed by human decision-makers, who are able to combine various hydrologic information, such as past outflows, reservoir water level and forecasted inflows with predefined rules. For reservoir operation decision-makers, it is difficult to evaluate which of hydrologic information is the most important. To rank hydrologic information and judge their value, we try to understand how outflow decisions are made by analyzing historic reservoir operation data based on an outflow simulation model. To extract knowledge from data, the attempts of using data-mining techniques for better reservoir operation have gained much popularity in recent years. Bessler et al. [1] extracted the operating rules for a reservoir in U.K. using the decision tree algorithm, linear regression and evolutionary algorithm. They found that decision tree algorithm for its visible interpretation was more understandable to reservoir operators and easier to practice in real-world. Hejazi et al. [2] used information theory to

Water 2018, 10, 1372; doi:10.3390/w10101372

www.mdpi.com/journal/water

Water 2018, 10, 1372

2 of 15

understand operators’ release decisions by investigating reservoir historical release data in the U.S and revealed strong ties between release decisions and hydrologic information—especially with current inflows and previous release. Corani et al. [3] used a Lazy Learning algorithm to reproduce human decisions in regulating Lake Lugano and achieved high accuracy. Random Forests (RF), a tree-based algorithm, employs an ensemble prediction of decision trees and usually outperforms single tree. In retaining visible interpretation of the decision tree, the RF can be exploited to rank the importance of the input variables in explaining the selected output behavior [4]. Caruana and Niculescu-Mizil [5] conducted a large-scale empirical comparison and showed that the RF algorithm achieved excellent performances compared to various data-mining algorithms. In the field of water resources management, Li et al. [6] built a RF model for the prediction of lake water level and concluded that RF could provide information or simulation scenarios for water management and decision-making. Albers et al. [7] evaluated the relative importance of contributing discharges for significant flood events in Canada by RF and proved the function of RF as an exciting new method of analysis to evaluate hydrology. Yang et al. [8] used the Classification and Regression Tree algorithm and RF algorithm to simulate the controlled outflows from nine major reservoirs in California, and concluded that the reservoir storage volume, seasonality and downstream river stage were extremely important variables for operating the reservoirs in California. Sultana et al. [9] used the RF model to assess the business interruption in Germany due to floods and found that the water level was the most important variable for influencing business interruption. Tillman et al. [10] used RF classification analysis to investigate the relationship between suspended sediment and salinity in upper Colorado River basin and concluded that no simple source could explain the relationship between them. Referring to the pioneer study of Reference [2], hydrologic information in reservoir operation can be divided into three different kinds, namely past, current and future information. The study used past inflow, past storages, past releases, current inflows and forecasted inflows as variables, and established dependence of reservoir release decisions on each of the five variables individually. Inspired by the work of Reference [8], in which all three kinds of hydrologic information are used to simulate California reservoir outflow operation, we use different combinations of hydrologic information to build outflow simulation models by RF algorithm in this study. We rank hydrologic information and judge their value from two different perspectives that one is a simulation result of different models and the other is the importance ranking of the input variables in RF algorithm. Further, we try to compare the results of the two perspectives and verify each other. The objectives of this study are to build reservoir outflow simulation models based on RF algorithm with different combinations of hydrologic information, and to evaluate the influences and value of hydrologic information on outflow decision-making. The rest of the paper is organized in the following order: The case study and selected hydrologic information are described firstly; then the methodology to build simulation models is introduced; the results and discuss are presented in the following sections, and finally the conclusion is given. 2. Case Study and Selected Data The Three Gorges Reservoir (TGR) is an essential, backbone project in the developing and harnessing of the Yangtze River in China and the world’s largest power station in terms of installed capacity (22,500 MW). The TGR has been operated for more than a decade since 2003 and accumulates a large amount of reservoir operation data [11]. Ma et al. [12] investigated hourly operation of TGR in non-flood season by data mining to improve the hydropower generation. Until now, no effort has been undertaken to analyze TGR daily operation for different time periods, such as in the flood season and non-flood season. In order to build reservoir outflow simulation models, the TGR operation data are categorized into model inputs (decision variables) and output (target variable). After discussed with the decision-maker of TGR, the current model inputs include most of the important hydrologic information in the real-world operation. Similar to Reference [2], we view hydrologic information in reservoir operation

Water 2018, 10, 1372

3 of 15

as three different kinds, namely past, current and future information. The types of model inputs and output are summarized as follows: (1)

Past information

It is clear that past outflow is a classic indicator of reservoir operation. Since reservoir operators may refer to distant information beyond the past 1-day, we determine to consider outflow information from the past 1–3 days, i.e., Qt −1 , Qt −2 , Qt −3 . (2)

Current information

The current information also contains three variables, i.e., month of a year (M), which concerns the influence of seasonality on reservoir operation; and reservoir water level (RWL) and water level at the downstream flood control point (DWL), which are widely used as indicators for guiding reservoir outflow decision-making. (3)

Future information

The forecasted 1-day, 2-day and 3-day inflows, i.e., It+1 , It+2 , It+3 , are the actual predicted values, which are renewed every day in the real-world operation. According to the operational inflow forecasting scheme of TGR, the upstream and tributary flows are routed to the reservoir by Muskingum method [13], and the precipitation records in the interval basin are transformed into runoff with different hydrologic models, such as unit hydrograph [14] and Xinanjiang model [15], etc. The summation of these flow components is the forecasted inflow of TGR. (4)

The model output is the average outflow in tomorrow, Qt+1 .

A summary of the input variables and the output variable is listed in Table 1. A schematic map illustrating the past, current and future hydrologic information is shown in Figure 1. In practice, reservoir operators may rely on all of three kinds of information or a combination of some of them under certain circumstances and time periods [2]. Considering the actual operation situation of TGR, the current information, especially the reservoir water level, is indispensable. Since TGR is typically operated to serve different purposes for different periods, we split the data into two parts to further investigate variations in reservoir operations between flood season (from 1 June to 30 September) and non-flood season. The case where all year data are used is also retained as a benchmark. As shown in Table 2, row represents different combinations of information, and column indicates time periods in which the data set will be used. Therefore, we have nine scenarios for analyzing and building outflow simulation models, in which scenarios 1–6 have six input variables while scenarios 7–9 have nine input variables. The data set of TGR covers 9 years from 1 June 2008 to 31 May 2017. We use the data from 1 June 2008 to 31 May 2015 for training and cross-validation, and the rest is used for test period. These data are downloaded from the Database of TGR. Table 1. Detailed information of model input and output variables. Information

Input/Output Variable Names

Abbr.

Unit

Resolution

m3 /s m3 /s m3 /s

Daily Daily Daily

Past

past 1-day outflow past 2-day outflow past 3-day outflow

Qt − 1 Qt − 2 Qt − 3

Current

month reservoir water level downstream water level

M RWL DWL

m m

Monthly Daily Daily

forecasted 1-day inflow forecasted 2-day inflow forecasted 3-day inflow tomorrow average outflow

It+1 It+2 It+3 Qt+1

m3 /s m3 /s m3 /s m3 /s

Daily Daily Daily Daily

Future

Water 2018, 10, 1372

4 of 15

Water 2018, 10, x FOR PEER REVIEW

4 of 15

Table 2. Designed nine scenarios for building outflow simulation models.

Table 2. Designed nine scenarios for building outflow simulation models. Combination of Information All Year Flood Season Non-Flood Season Combination of Information All Year Flood Season Non-Flood Season PastPast + Current 1 + Current 1 2 2 3 3 Current + Future 4 5 Current + Future 4 5 6 6 Past + Current + Future 7 8 9 Past + Current + Future 7 8 9

Figure 1. Schematic map illustratingthe the past, past, current hydrologic information. Figure 1. Schematic map illustrating currentand andfuture future hydrologic information.

3. Methodology 3. Methodology 3.1. Random Forests Algorithm 3.1. Random Forests Algorithm In order to establish reservoiroutflow outflow simulation simulation model, thethe regression model In order to establish thethe reservoir model,namely, namely, regression model between the above hydrologic input and output variables, we used Random Forests (RF) algorithm, between the above hydrologic input and output variables, we used Random Forests (RF) algorithm, which can build classification or regression models between input and output variables. As a whitewhich can build classification or regression models between input and output variables. As a white-box box and nonparametric tree-based data-mining technique, RF is an ensemble of multiple decision and nonparametric tree-based technique, is an ensemble ofnodes, multiple decision trees. As shown in Figure 2a,data-mining the tree-like structures areRF composed of decision branches, andtrees. As shown in Figure 2a, the tree-like structures are composed of decision nodes, branches, and leaves, leaves, which form a cascade of rules leading to classes or numerical values. The tree is obtained by whichpartitioning form a cascade of rules leading to classes or numerical values. The tree is obtained by partitioning at the decision node with a proper splitting criterion. at the decision node with splitting The decision treesainproper classification RFcriterion. will eventually divide the whole training data set space into multiple Each class consists a seteventually of rules thatdivide splits the Thespace The decisionclasses. trees in classification RFofwill thedecision whole variable trainingspaces. data set decision trees in regression RF take the average of the target variable values (numerical values) in into multiple classes. Each class consists of a set of rules that splits the decision variable spaces. each class and store the corresponding splitting rules. For regression, the common splitting criterion The decision trees in regression RF take the average of the target variable values (numerical values) in is to minimize the summation of relative errors in Equation (1) [16]. each class and store the corresponding splitting rules. For regression, the common splitting criterion is to minimize the summation of relative errors in Equation (1) 2[16].R  L 2

arg min( RE (d ))  arg min   ( yl  yL )   ( yr  yR )  " l 1 r R1 # L

(1)

R ∑ l nodeL with L∑and Rr numbers where yl and yr are the left and right branches of decision of target variables, r =1 l =1 yL and yR are the mean of resulting target variables, and d is the splitting rule of decision node. Theybuilding procedure of the RF from decision trees is shown in Figure 2b and is described where yl and r are the left and right branches of decision node with L and R numbers of target variables, briefly below [6]. yL and yR are the mean of resulting target variables, and d is the splitting rule of decision node. Step 1: For each decision tree in the RF, a random subset of the training data set is used. By this The building procedure of the RF from decision trees is shown in Figure 2b and is described way, the training set for each tree is not the same. briefly below [6]. Step 2: When constructing decision nodes, the splitting of each decision tree is picked from a Step 1: For each decision tree in the RF, a random subset of the training data setmake is used. By this random subset of all input variables. Step 1 and Step 2 bring randomness. The two steps the RF way, the training fortoeach treeover-fitting is not the and same. algorithm notset easy fall into have good anti-noise ability.

argmin( RE(d)) = argmin

( y − y )2 +

( y − y )2

(1)

Step 2: When constructing decision nodes, the splitting of each decision tree is picked from a random subset of all input variables. Step 1 and Step 2 bring randomness. The two steps make the RF algorithm not easy to fall into over-fitting and have good anti-noise ability.

Water 2018, 10, 1372

5 of 15

Step 3: The final output of RF is obtained from the averaged results of each decision tree. Water 2018, 10, x FOR PEER REVIEW 5 of 15 The main parameters to adjust when using RF for regression are estimator and depth. The former Step 3:ofThe final of RFThe is obtained the averaged results of each it decision tree.to compute. is the number trees inoutput the forest. larger, from the better, but also the longer will take The it main parameters to adjust when using RF for regression are estimator and depth. The former In addition, is noted that results will stop getting significantly better beyond a critical number of the number in thetree forest. The length larger, the better, but also the from longeraitroot will to take compute. trees.isThe depth ofofatrees decision is the of the longest path a to leaf. The Large In addition, it is noted that results will stop getting significantly better beyond a critical number of values of depth will lead to fully grown trees, which has a more complicated structure and may over-fit trees. The depth of a decision tree is the length of the longest path from a root to a leaf. The Large the data. In order to evaluate the regression model and then determine the parameters, we use the values of depth will lead to fully grown trees, which has a more complicated structure and may overexplained regression score fit thevariance data. In order to evaluate the(2). regression model and then determine the parameters, we use the explained variance regression score (2).

Var {ytar − yout } Explained variance(ytar , yout ) = 1 − Var { y  y } (2) Vartar{ytar }out Explained vari ance ( y tar , y out )  1  (2) Va r { y tar } where ytar is the corresponding target output, yout is the output of RF, and Var is Variance. The best where ytar is the corresponding target output, yout is the output of RF, and Var is Variance. The best possible score is 1.0, lower values are worse. possible score is 1.0, lower values are worse. As mentioned above, the RF algorithm has two main advantages which could be suitable for analyzing As mentioned above, the RF algorithm has two main advantages which could be suitable for reservoir operation data operation and favored decision-makers. RF is a nonparametric algorithm,algorithm, and each path analyzing reservoir databy and favored by decision-makers. RF is a nonparametric from and the top node leaf can node be interpreted as be aninterpreted if-then-elseasrule, which canrule, provide visible eachdecision path from the to topa decision to a leaf can an if-then-else which physical This visible interpretation contrary to other data mining can interpretation. provide visible physical interpretation. This stands visible interpretation stands contrary to methods, other data such mining methods,which such asact neural as a black box and it cannot derived how the as neural networks, as a networks, black boxwhich and itactcannot be derived how thebe prediction is achieved prediction is achieved there. For reservoir operators, they can judge the quality of the outflow there. For reservoir operators, they can judge the quality of the outflow simulation model by analyzing analyzing these if-then-else rules. Furthermore, RF provides a measure the these simulation if-then-elsemodel rules.by Furthermore, RF provides a measure of the relative importance of inputofvariables, relative importance of input variables, which can help reservoir operators to rank hydrologic which can help reservoir operators to rank hydrologic information and judge their value quantitatively. information and judge their value quantitatively.

(a)

(b)

Figure 2. Demonstration of (a) decision tree structure and (b) RF algorithm.

Figure 2. Demonstration of (a) decision tree structure and (b) RF algorithm.

3.2. Statistical Measurements of ModelPerformance Performance 3.2. Statistical Measurements of Model In order to mathematically quantify and compare the performance of the outflow simulation In order to mathematically quantify and compare the performance of the outflow simulation models, we select three statistical measurements [8], namely, root mean square error (RMSE), Nashmodels, we select three statistical measurements [8], namely, root mean square error (RMSE), Sutcliffe model efficiency (NSE), and Normalized Peak Flow Difference (△Qp). The formulas of these Nash-Sutcliffe model efficiency (NSE), and Normalized Peak Flow Difference (4Qp ). The formulas of statistical measurements are as follows [17,18]: these statistical measurements are as follows [17,18]: 1 N (Qobs ,i  Qsim ,i ) 2 RMSE v (3) u  u 1N Ni 1 2 RMSE = t ∑ ( Qobs,i − Qsim,i ) (3) NNi=1

 (Q

obs ,i

NSE  1NSE = 1 −

 Qsim ,i ) 2

i 1 NN

2 ∑ ( Qobs,i − Qsim,i2 ) i =1 (Qobs ,i  Q obs )

 i 1

N

∑ ( Qobs,i − Qobs )

i =1

2

(4)

(4)

Water 2018, 10, 1372

6 of 15

Water 2018, 10, x FOR PEER REVIEW

6 of 15

Q − Qsim,m 4 Q p = obs,m × 100% m = argmax( Qobs,i ), i ∈ 1, 2, . . . , N (5) QQ obs , m  Q sim ,m obs,m Q p   100% m  arg max(Qobs ,i ), i  1, 2,..., N (5) Qobs ,m where Qobs and Qsim are the observed and simulated outflow, respectively; Qobs is the mean of the observed outflow during thethe test period; m is simulated the time period when maximum Q outflow happens during Qsim are observed and outflow, respectively; is the mean of the where Qobs and obs the test period; and N is the total number of days during the test period. observed outflow during the test period; m is the time period when maximum outflow happens during the test period; and N is the total number of days during the test period.

4. Results

4. Results

4.1. Candidate Model Parameters and Importance of Input Variables 4.1. Candidate Parameters Variables In this study, Model to build a simpleand RFImportance structure of forInput avoiding over-fitting, the estimator is chosen from

3, 4, . . . , 9, thetodepth from 3, 4, 5,for 6, avoiding respectively. For tuning these two parameters, In 10, thisand study, buildisa chosen simple RF structure over-fitting, the estimator is chosen from we adopt a grid search approach, which considers all6,candidate 32 (8For estimators × 4 depths) parameter 3, 4, …, 9, 10, and the depth is chosen from 3, 4, 5, respectively. tuning these two parameters, combinations, method (K 5) for judging the score×(explained variance we adopt aand gridK-fold search cross-validation approach, which considers all=candidate 32 (8 estimators 4 depths) parameter combinations, and K-fold cross-validation 5) for judging the score (explainedparameter variance regression score) of each combination. Themethod higher (K the= score, the better the candidate regressionare. score) of these each 32 combination. Themodels higher with the score, the parameter better the candidate parameter combinations From RF regression different combinations, we try combinations are. From these 32reservoir RF regression models with different to choose a suitable one as selected outflow simulation model. parameter combinations, we try choose a suitable one as data selected reservoirfor outflow simulation model. Weto use the shuffled training (2008–2015) cross-validation and calculate the nine scenarios We use the shuffled training data (2008–2015) for and calculate the nine separately. During the cross-validation process, for each of ninecross-validation the scenarios, we record the importance scenarios separately. During the cross-validation process, for each of nine the scenarios, we record score of input variables from RF algorithm. The variable importance scores are shown in Figure 3, the importance score of input variables from RF algorithm. The variable importance scores are shown on which the ordinates are logarithms. Comparing these scenarios, we find that Qt −1 is the most in Figure 3, on which the ordinates are logarithms. Comparing these scenarios, we find that Qt-1 is the important variable, and the importance of It+1 will be increased significantly when past information is most important variable, and the importance of It+1 will be increased significantly when past not used. Moreover, comparing the influences of future information between scenarios 8 and 9, there information is not used. Moreover, comparing the influences of future information between scenarios are some Duringfindings. flood season (scenario 8), It+1 , It+2 and are more important, 8 andinteresting 9, there arefindings. some interesting During flood season (scenario 8),IIt+3 t+1, It+2 and It+3 are more and their importance is reduced as the increase of forecasting period. However, the It+1 , It+2 the andIt+1 It+3 important, and their importance is reduced as the increase of forecasting period. However, , are nearly the importance during non-flood season (scenario 9). (scenario From the9). above It+2 andofIt+3 aresame nearly of the same importance during non-flood season Fromimportance the above scoreimportance of input variables, we can find thatwe Qtcan information) the most important variable, It+1 −1 (past score of input variables, find that Qt−1 (past is information) is the most important (future information) will be the most important variable without past information, and the forecasted variable, It+1 (future information) will be the most important variable without past information, and inflow more important TGRimportant decision-making during flood season. theisforecasted inflow isfor more for TGR decision-making during flood season.

Figure 3. Cont.

Water 2018, 10, 1372

7 of 15

Water 2018, 10, x FOR PEER REVIEW

7 of 15

Figure 3. Variable importance scoresinindifferent differentscenarios. scenarios. Figure 3. Variable importance scores

4.2. Selected Parameters andand Simulation Results 4.2. Selected Parameters Simulation Results Figure 4 plots the the rank of cross-validation scores. higherscore, score,namely, namely, Figure 4 plots rank of cross-validation scores.The Thehigher higher rank rank means means higher the better candidate parameter combinations.ItItisisobserved observedthat that the the satisfying satisfying results the better candidate parameter combinations. resultsare areobtained obtained depth is four.Moreover, Moreover, considering depth willwill leadlead to a complex tree structure, which whenwhen the the depth is four. consideringmore more depth to a complex tree structure, may over-fit the training data. Therefore, the depth of four is an appropriate value. As for estimator, which may over-fit the training data. Therefore, the depth of four is an appropriate value. As for there there is no obvious difference for different parameters and no unified best choices forchoices nine scenarios. estimator, is no obvious difference for different parameters and no unified best for nine The estimator between 7 and 10 can get good results. So, for reducing calculation time and a scenarios. The estimator between 7 and 10 can get good results. So, for reducing calculation timebetter and a of hydrologic information from different scenarios, we choose estimator = 7 instead of bettercomparison comparison of hydrologic information from different scenarios, we choose estimator = 7 instead different parameter values of each scenario. of different parameter values of each scenario. We regard the RF regression model, which has fixed parameters (four and seven), as the selected We regard the RF regression model, which has fixed parameters (four and seven), as the selected reservoir outflow simulation model. To examine the impact of different hydrologic information on reservoir outflow simulation model. To examine the impact of different hydrologic information on model’s predictive performance and reflect the value of information, we test the predictive capability model’s predictive and reflect the of information, we test the predictive capability of the reservoirperformance outflow simulation model onvalue the hold-out dataset (2015–2017). Since hold-out data of the reservoir outflow model on theand hold-out dataset (2015–2017). Since hold-out have never been usedsimulation in any training process cross-validation, they are considered here asdata an haveindependent never been test used in anywhich training process and cross-validation, they considered here as period, can fairly evaluate the performance of the are models. For scenarios 1, an 4 independent fairly evaluate the 2017. performance the models. 1, 4 and 7, thetest testperiod, period which is fromcan 1 June 2015 to 31 May For otherofscenarios, onlyFor partscenarios of the data and 7, the (either test period from or 1 June 2015 to 31 May 2017.The Forcomputed other scenarios, part of the data series floodisseason non-flood season) is used. statisticsonly are summarized in seriesTable (either flood season or non-flood season) is used. can Thebe computed areifsummarized 3. According to Reference [19], model simulation judged asstatistics satisfactory NSE is greaterin 0.50. The statistical performances of the simulatedcan outflows are satisfactory for allif nine Tablethan 3. According to Reference [19], model simulation be judged as satisfactory NSEscenarios is greater of NSE in Table 3 ranges from 0.572 to 0.965. After comparison of these than since 0.50. the Thevalues statistical performances of the simulated outflows are satisfactory for allnine ninescenarios, scenarios are twooffindings: sincethere the values NSE in Table 3 ranges from 0.572 to 0.965. After comparison of these nine scenarios, there(1) are Splitting two findings: the data into two parts has no improvement on the model’s performance. Compared (1)

with scenario 1, scenarios 2 and 3 do not obviously improve the performance of RMSE, NSE and

Splitting the data into two parts has no improvement on the model’s performance. Compared △Qp in three different time periods. For scenarios 4 to 9, the result is also the same. with scenario 1, scenarios 2 and 3 do not obviously improve the performance of RMSE, NSE and (2) The future information is effective in a particular scenario and time period. The observed and 4Qpsimulated in three different time periods. For scenarios 4 to 9, the result is also the same. reservoir outflows of scenarios 1, 4 and 7 are shown in Figure 5. From Table 3 and (2) The Figure future5,information is effective in a particular and time period. Theslightly observed and we can observe that scenario 1 (withoutscenario future information) performs poorer simulated reservoir outflows of scenarios 1, 4 and 7 are shown in Figure 5. From Table 3 and than the best scenario 7 (with all information). Both of them are far better than scenario 4 Figure 5, we past can observe that scenario 1 (without information) performs slightly poorer (without information). Comparing statisticalfuture performances of scenarios 1 and 7, scenario 7 thanhas theobviously best scenario 7 (with information). them are far better than scenario 4 increasing moreall during flood seasonBoth thanof non-flood season. There is no significant difference these two during non-flood Further, based on the values of scenario NSE, the 7 (without past between information). Comparing statisticalseason. performances of scenarios 1 and 7, scenarios 1 and 7 perform better during non-flood season, while scenario 4 performs much better has obviously increasing more during flood season than non-flood season. There is no significant during flood season. difference between these two during non-flood season. Further, based on the values of NSE, the From scenarios and 7facts, perform better during non-flood season, while scenario 4 performs much these1three we can see that there are identical results with the importance of input better during flood season. variables. Namely, past outflow is the most important information, and future information will play a more prominent role during flood season, especially in scenario 4 (without past information).

From these three facts, we can see that there are identical results with the importance of input variables. Namely, past outflow is the most important information, and future information will play a more prominent role during flood season, especially in scenario 4 (without past information).

Water 2018, 10, 1372

8 of 15

Water 2018, 10, x FOR PEER REVIEW

8 of 15

Table 3. Statistical measurements between the observed and simulated outflows.

Table 3. Statistical measurements between the observed and simulated outflows. All Year Flood Season Non-Flood Season Scenarios

Flood Season All Year RMSE RMSE NSE 4Qp NSE 4Qp 3 /s) (m3 /s) (m Scenarios RMSE RMSE NSE △Q NSE △Q p p 3 (m3/s) 0.899 1 1225(m /s)0.959 1.9 1864 1.9 1864 0.909 0.899 1.9 2 and 3 1 1181 12250.9620.959 2.51.9 1764 2.5 1764 0.696 0.909 2.5 4 2 and 3 2525 11810.8290.962 7.72.5 3239 7.7 5 and 6 4 2506 25250.8320.829 5.07.7 3143 5.0 3239 0.714 0.696 7.7 7 5 and 6 1141 25060.9650.832 0.95.0 1718 0.9 3143 0.915 0.714 5.0 8 and 9 7 1195 11410.9610.965 2.40.9 1794 2.4 1718 0.906 0.915 0.9 8 and 9 1195 0.961 2.4 1794 0.906 2.4

Non-Flood Season RMSE NSE 4Qp (m3 /s) RMSE NSE △Q p (m3/s) 717 0.950 2.2 717732 0.950 0.948 2.2 6.5 732 2077 0.948 0.587 6.5 25.9 2116 0.587 0.57225.9 28.8 2077 690 0.572 0.95428.8 5.4 2116 0.949 5.4 4.9 690729 0.954 729

0.949

4.9

Figure Figure 4. 4. The The rank rank of of cross-validation cross-validation scores scores for for candidate candidate parameter parametercombinations. combinations.

Water 2018, 10, 1372 Water 2018, 10, x FOR PEER REVIEW

9 of 15 9 of 15

Figure Figure5. 5. Comparison Comparison of of observed observed and and simulated simulatedreservoir reservoiroutflows. outflows.

5. Discussion 5. Discussion 5.1. The Impact of Splitting Data Set by Prior Knowledge 5.1. The Impact of Splitting Data Set by Prior Knowledge Affected by the monsoon climate and precipitation, 60–80% inflow of TGR in a year concentrates by the(June monsoon climate and[20]. precipitation, 60–80% of flood TGR in a year is concentrates in theAffected flood season to September) During the floodinflow season, control dominant in the flood season (June to September) [20]. During the flood season, flood control is dominant among several utilization functions. Figure 6 shows the kernel distribution of It+1 in training period among several utilization functions. Figure 6 shows the kernel distribution of I t+1 in training period (2008–2015) by Violin plot, which reveals a huge difference in inflows between flood season and (2008–2015) by Violin plot, which reveals a huge difference in inflows between flood season and nonnon-flood season. floodItseason. is natural that the performance of models will be improved by dividing yearly data sets into It is data natural the performance of models will be improved by dividing yearly data sets into seasonal sets.that However, Table 3 shows that splitting the data into two parts has no significant seasonal data sets. However, Table 3 shows that splitting the data into two parts has no significant improvement on model performance. To explain this, we decided to explore the structure of the improvement on model performance. Tophysical explain interpretation this, we decided to explorealgorithm, the structure of the outflow simulation models. From visible of tree-based we could outflow simulation models. From visible physical interpretation of tree-based algorithm, we could easily understand how the outflow simulation model makes the outflow decision. easily understand how the outflow simulation model makes the outflow decision. Taking scenario 4 as an example, which has the poorest performance among scenarios 1, 4 and 7, scenario asseven an example, has the poorest performance among scenarios 1, 4first and FigureTaking 7 shows the top4of decisionwhich trees in the outflow simulation model, which reveals the 7, Figure 7 shows the top of seven decision trees in the outflow simulation model, which reveals rule to make outflow decision. All of the seven regression trees use the values of It+1 . The values to the be 3 /s.seven first rule toare make outflow decision. All ofmthe regression trees6,use the data values of It+1by . The values compared between 15,050 and 17,700 As shown in Figure when is split 15,050 or 3/s. As shown in Figure 6, when data is split by 15,050 3 to be compared are between 15,050 and 17,700 m 17,700 m /s, the corresponding months are mainly split into flood season (from June to September) or 17,700 m3/s,season the corresponding months mainly into floodthat season June to September) and non-flood (other months) of theare TGR. Thesplit result proves the (from RF algorithm can extract and non-flood season (other from months) of the TGR. The result proves that the RF algorithm can extract human experience effectively history reservoir operation data. Based on the above discussion, we human experience effectively from history reservoir operation data. Based on the above discussion, emphasize the importance of current time (flood season or non-flood season) for TGR decision-making. we emphasize the importance of current time (flood season or non-flood season) for TGR decisionmaking.

Water 2018, 10, 1372 Water2018, 2018,10, 10,x xFOR FORPEER PEERREVIEW REVIEW Water

Figure6.6.The Thekernel kerneldistribution distributionof forecasted1-day 1-dayinflow inflowinindifferent differentmonths. months. Figure Figure The kernel distribution ofofforecasted forecasted

The top ofofseven seven decision Figure7.7.The Thetop topof sevendecision decisiontrees treesininthe theoutflow outflowsimulation simulationmodel modelofofscenario scenario4.4. Figure

10 of 15 1010ofof1515

Water 2018, 10, 1372 Water 2018, 10, x FOR PEER REVIEW

11 of 15 11 of 15

5.2. Past Information Is the Most Important Information to explain explainwhy whypast pastinformation informationis is most important. explanation is that the We try to thethe most important. OneOne explanation is that the past past information is known accurate while future information forecastedwith withuncertainty uncertainty [21]. [21]. information is known andand accurate while future information is isforecasted information cannot cannot determine determine the the outflow outflow alone. alone. We We can Compared with past information, current information imagine that outflows will be quite different under different inflow situations although M, RWL and DWL are the same. However, if there is no flood process, keeping the past outflow would not be a bad choice. Another explanation explanation is interestingly interestingly that past outflow information not only contains past information. Let yesterday. In fact, Let us us imagine imagine how the operator of TGR made the outflow decision yesterday. fact, forecasting. So, naturally, the operator took forecasting and the yesterday, they already had future forecasting. state of the reservoir into consideration and made the outflow decision yesterday. It shows that past outflow information informationcontains contains much more information its surface meaning. By theanalysis, above much more information than than its surface meaning. By the above analysis, we have speculated reason why past information the most To important. Tospeculation prove our we have speculated the reasonthe why past information is the most is important. prove our speculation quantificationally, Figure shows the correlation variables by heat map.three The quantificationally, Figure 8 shows the8correlation between allbetween variablesallby heat map. The first first threethat variables outflowhave information the closest the variables describethat pastdescribe outflow past information the closesthave correlation with correlation the output, with reservoir output, outflow. The variables ranking from to sixth are the future information. outflow.reservoir The variables ranking from fourth to sixth arefourth the future information.

Figure 8. Correlation Correlation matrix matrix between all input variables and outflow. Figure

5.3. Future Future Information Information in in Particular Particular Scenario Scenario and 5.3. and Time Time Period Period Let us us think about why why future future information will play play aa more more prominent role during the flood Let think about information will prominent role during the flood season. Forecasted inflow is of great importance to reservoir release decisions under high hydrologic season. Forecasted inflow is of great importance to reservoir release decisions under high hydrologic uncertainty, and and this this is is aageneral generalconclusion conclusiongiven givenby byReference Reference[2]. [2].Figure Figure 6 shows that inflow uncertainty, 6 shows that thethe inflow of of TGR has high uncertainty during flood season, especially from July to September, when the TGR has high uncertainty during flood season, especially from July to September, when the kernel kernel distribution is anearly a line. So, information future information a most important role in scenario distribution is nearly line. So, future plays a plays most important role in scenario 4 during4 during flood season can easily understand. For lacking forecasted inflow brought from past outflow flood season can easily understand. For lacking forecasted inflow brought from past outflow information (Q , Q and Q ), future information will have a great impact on improving outflow t −and −1, Qt−2 2 information (Qtt−1 Qt−3t),−3future information will have a great impact on improving outflow simulation model model performance. simulation performance.

From the change in forecasting accuracy, we can prove the result of the importance of input variables. The importance of It+1, It+2, and It+3 will decrease significantly over time in the flood season. Figure 9 shows that the coefficient of determination (R2) between observed and forecasted inflows is

Water 2018, 10, 1372

12 of 15

the change in forecasting accuracy, we can prove the result of the importance of 12 input Water From 2018, 10, x FOR PEER REVIEW of 15 variables. The importance of It+1 , It+2 , and It+3 will decrease significantly over time in the flood season. Figure that the coefficient of determination (R2 ) between observed and forecasted inflows reduced9 shows more obviously in the flood season. The values of R2 in non-flood season are higher than is reduced more obviously thepoint floodofseason. The values R2ofinfuture non-flood season are those in flood season. Fromin this view, to make full of use information, wehigher suggestthan the those in flood season. From this point of view, to make full use of future information, we suggest the operators of TGR improve forecasting accuracy, especially in flood season. operators of TGR improve forecasting accuracy, especially in flood season. 70000

35000

60000

30000

50000

Forecasted inflow

Forecasted inflow

R² = 0.996

R² = 0.9946

40000 30000 20000 10000

20000 15000 10000 5000

It+1

0

25000

It+1

0

0

20000

40000

60000

80000

0

70000

30000

50000

25000

Forecasted inflow

Forecasted inflow

60000

40000 30000 20000 10000

15000

20000

25000

30000

35000

R² = 0.9765

20000 15000 10000 5000

It+2

0

It+2

0

0

20000

40000

60000

0

80000

5000

10000

15000

20000

25000

30000

35000

35000

70000

R² = 0.9085

60000 50000 40000 30000 20000 10000

It+3

0 0

20000

R² = 0.9434

30000

Forecasted inflow

Forecasted inflow

10000

35000

R² = 0.9707

(a)

5000

40000

60000

Observed inflow

80000

25000 20000 15000 10000 5000

(b)

It+3

0 0

5000

10000

15000

20000

25000

30000

35000

Observed inflow

Figure 9. Scatter-plots of observed and forecasted inflows (a) flood season season (b) (b) non-flood non-flood season. season.

5.4. 5.4. The The Practical Practical Application Application of of This This Study Study From of view, From the the reservoir reservoir downstream downstream water water users’ users’ point point of view, the the controlled controlled outflows outflows from from an an upstream upstream reservoir reservoir are are highly highly dependent dependent on on the the decisions decisions made made by by the the reservoir reservoir operators, operators, instead instead of of aa natural natural inflow inflow process. process. To To establish establish proper proper and and useful useful water water management management plans, plans, downstream downstream water users need to understand the operation pattern of the upstream reservoir, and water users need to understand the operation pattern of the upstream reservoir, and even even more, more, build build some some models models to to estimate estimate the the outflows outflows from from upstream upstream reservoir. reservoir. Our Our reservoir reservoir outflow outflow simulation simulation model model meets meets water water users’ users’ needs, needs, and and its its visible visible physical physical interpretation interpretation can can further further help help water water users users understand the operation pattern of the upstream reservoir easily. understand the operation pattern of the upstream reservoir easily. From From the the reservoir reservoir operators’ operators’ point point of of view, view, the the simulation simulation model model contains contains their their experience, experience, and and they can make corrections based on the model output. The corrected value can be used as daily as outflow they can make corrections based on the model output. The corrected value can be used daily decisions in real-world reservoirreservoir operation. The simulation model will be a useful for reservoir outflow decisions in real-world operation. The simulation model will betool a useful tool for

reservoir operators. In addition to the simulation model, evaluating hydrologic information can help them too. Reservoir operators need some hydrologic information to make outflow decisions. By the statistical measurements of outflow simulation models and input variables importance analysis, we infer the relationship between different groups of hydrologic information and observed outflow. For

Water 2018, 10, 1372

13 of 15

operators. In addition to the simulation model, evaluating hydrologic information can help them too. Reservoir operators need some hydrologic information to make outflow decisions. By the statistical measurements of outflow simulation models and input variables importance analysis, we infer the relationship between different groups of hydrologic information and observed outflow. For reservoir operators of TGR, we suggest that they should pay close attention to the value of future information, especially in the flood season. Besides, the importance of forecasted inflow is evidently reduced with the increasing of the forecast period during flood season. For ensuring the value of future information, improved forecasting accuracy and rolling forecasting should be provided to reservoir operators. From the researchers’ point of view, we need to close the gap between theoretical optimal and the real-world reservoir operation. Many theoretical optimal operations for TGR are based on operating rules, which contain different hydrologic information variables [22–26]. Usually, these variables are selected by researchers’ experience. However, which variables should be recommended and selected? In this study, we first prove that TGR is operated differently over the flood season and non-flood season, thus, more realistic seasonal operating rules should be established. Second, we suggest that operating rules should contain the previous outflow, which has the strongest ties with outflow decisions. Last, we prove that forecasted inflow is of great importance to reservoir outflow decisions in the flood season, so forecasted inflow is highly recommended to be included in flood control operating rules. 6. Conclusions In this study, the RF algorithm was proposed to build a reservoir outflow simulation model for TGR in China. Different simulation models were established based on the combinations of three groups of hydrologic information. The influences and value of hydrologic information for reservoir outflow decision-making were evaluated. The following findings can be drawn: (1)

(2)

(3)

The statistical performances of simulation results demonstrate that the RF algorithm can reasonably simulate outflow decisions. The RF with visible physical interpretation and variables importance measure is suitable and helpful for evaluating the value of hydrologic information. The past outflow is the most important information for reservoir operator decision-making. The forecasted inflow is more important during flood season than non-flood season in outflow decision-making. The proposed reservoir outflow simulation model is useful for downstream water users and operators of TGR. The value analysis of hydrologic information will help reservoir operators and theoretical optimization researchers of TGR make better use of hydrological information in practice and study.

Author Contributions: Conceptualization and software. K.C. and S.G.; Data Curation, T.X. and S.S.; Formal Analysis, S.H. and Y.Z.; Writing-Original Draft Preparation, K.C.; Writing-Review & Editing, S.G. Funding: This paper was funded by the National Key R&D Plan of China (Grant No. 2016YFC0402206) and the National Natural Science Foundation of China (Grant No. 51539009; 51879192). Acknowledgments: The authors are very grateful to the China Yangtze Power Co., Ltd. and Middle Changjiang River Bureau of Hydrology and Water Resources Survey for providing valuable data. Conflicts of Interest: The authors declare no conflict of interest.

References 1. 2.

Bessler, F.T.; Savic, D.A.; Walters, G.A. Water reservoir control with data mining. J. Water Res. Plan. Manag. 2003, 129, 26–34. [CrossRef] Hejazi, M.I.; Cai, X.; Ruddell, B.L. The role of hydrologic information in reservoir operation—Learning from historical releases. Adv. Water Resour. 2008, 31, 1636–1650. [CrossRef]

Water 2018, 10, 1372

3.

4. 5.

6.

7.

8.

9. 10.

11.

12.

13. 14. 15. 16.

17. 18. 19.

20. 21.

22. 23. 24.

14 of 15

Corani, G.; Rizzoli, A.E.; Salvetti, A.; Zaffalon, M. Reproducing human decisions in reservoir management: The case of lake Lugano. In Information Technologies in Environmental Engineering; Springer: Berlin, Germany, 2009; pp. 252–263. [CrossRef] Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [CrossRef] Caruana, R.; Niculescu-Mizil, A. An empirical comparison of supervised learning algorithms. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; ACM: Pittsburgh, PA, USA, 2006; pp. 161–168. [CrossRef] Li, B.; Yang, G.S.; Wan, R.R.; Dai, X.; Zhang, Y.H. Comparison of Random Forests and other statistical methods for the prediction of lake water level: A case study of the Poyang lake in China. Hydrol. Res. 2016, 47, 69–83. [CrossRef] Albers, S.J.; Dery, S.J.; Petticrew, E.L. Flooding in the Nechako river basin of Canada: A Random Forest modeling approach to flood analysis in a regulated reservoir system. Can. Water Resour. J. 2016, 41, 250–260. [CrossRef] Yang, T.; Gao, X.; Sorooshian, S.; Li, X. Simulating California reservoir operation using the classification and regression-tree algorithm combined with a shuffled cross-validation scheme. Water Resour. Res. 2016, 52, 1626–1651. [CrossRef] Sultana, Z.; Sieg, T.; Kellermann, P.; Müller, M.; Kreibich, H. Assessment of business interruption of flood-affected companies using Random Forests. Water 2018, 10, 1049. [CrossRef] Tillman, F.D.; Anning, D.W.; Heilman, J.A.; Buto, S.G.; Miller, M.P. Managing salinity in upper Colorado river basin streams: Selecting catchments for sediment control efforts using watershed characteristics and Random Forests models. Water 2018, 10, 676. [CrossRef] Zhang, J.; Feng, L.; Chen, L.; Wang, D.; Dai, M.; Xu, W.; Yan, T. Water compensation and its implication of the Three Gorges Reservoir for the river-lake system in the middle Yangtze river, china. Water 2018, 10, 1011. [CrossRef] Ma, C.; Lian, J.J.; Wang, J.N. Short-term optimal operation of Three-gorge and Gezhouba cascade hydropower stations in non-flood season with operation rules from data mining. Energy Convers. Manag. 2013, 65, 616–627. [CrossRef] Cunge, J.A. On the subject of a flood propagation computation method (musklngum method). J. Hydraul. Res. 1969, 7, 205–230. [CrossRef] Nash, J. The form of the instantaneous unit hydrograph. Int. Assoc. Sci. Hydrol. Publ. 1957, 3, 114–121. Ren-Jun, Z. The xinanjiang model applied in china. J. Hydrol. 1992, 135, 371–381. [CrossRef] Hancock, T.; Put, R.; Coomans, D.; Vander Heyden, Y.; Everingham, Y. A performance comparison of modem statistical techniques for molecular descriptor selection and retention prediction in chromatographic qsrr studies. Chemom. Intell. Lab. Syst. 2005, 76, 185–196. [CrossRef] Yin, J.B.; Guo, S.L.; He, S.K.; Guo, J.L.; Hong, X.J.; Liu, Z.J. A copula-based analysis of projected climate changes to bivariate flood quantiles. J. Hydrol. 2018, 566, 23–42. [CrossRef] Yin, J.B.; Guo, S.L.; Liu, Z.J.; Yang, G.; Zhong, Y.X.; Liu, D.D. Uncertainty analysis of bivariate design flood estimation and its impacts on reservoir routing. Water Resour Manag 2018, 32, 1795–1809. [CrossRef] Moriasi, D.N.; Arnold, J.G.; Van Liew, M.W.; Bingner, R.L.; Harmel, R.D.; Veith, T.L. Model evaluation guidelines for systematic quantification of accuracy in watershed simulations. Trans. ASABE 2007, 50, 885–900. [CrossRef] Wu, X.S.; Guo, S.L.; Yin, J.B.; Yang, G.; Zhong, Y.X.; Liu, D.D. On the event-based extreme precipitation across China: Time distribution patterns, trends, and return levels. J. Hydrol. 2018, 562, 305–317. [CrossRef] Wu, X.S.; Wang, Z.L.; Guo, S.L.; Liao, W.L.; Zeng, Z.Y.; Chen, X.H. Scenario-based projections of future urban inundation within a coupled hydrodynamic model framework: A case study in Dongguan city, China. J. Hydrol. 2017, 547, 428–442. [CrossRef] Liu, X.Y.; Guo, S.L.; Liu, P.; Chen, L.; Li, X.A. Deriving optimal refill rules for multi-purpose reservoir operation. Water Resour. Manag. 2011, 25, 431–448. [CrossRef] Guo, S.L.; Chen, J.H.; Li, Y.; Liu, P.; Li, T.Y. Joint operation of the multi-reservoir system of the Three Gorges and the Qingjiang cascade reservoirs. Energies 2011, 4, 1036–1050. [CrossRef] Li, Y.; Guo, S.L.; Quo, J.L.; Wang, Y.; Li, T.Y.; Chen, J.H. Deriving the optimal refill rule for multi-purpose reservoir considering flood control risk. J. Hydrol.-Environ. Res. 2014, 8, 248–259. [CrossRef]

Water 2018, 10, 1372

25. 26.

15 of 15

Mu, J.; Ma, C.; Zhao, J.Q.; Lian, J.J. Optimal operation rules of Three-Gorge and Gezhouba cascade hydropower stations in flood season. Energy Convers. Manag. 2015, 96, 159–174. [CrossRef] Zhou, Y.L.; Guo, S.L.; Xu, C.Y.; Liu, P.; Qin, H. Deriving joint optimal refill rules for cascade reservoirs with multi-objective evaluation. J. Hydrol. 2015, 524, 166–181. [CrossRef] © 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Suggest Documents