Physics and Chemistry of the Earth 50–52 (2012) 34–43
Contents lists available at SciVerse ScienceDirect
Physics and Chemistry of the Earth journal homepage: www.elsevier.com/locate/pce
Infilling of missing rainfall and streamflow data in the Shire River basin, Malawi – A self organizing map approach F.D. Mwale a,⇑, A.J. Adeloye a, R. Rustum b a b
School of Built Environment, Heriot Watt University, Riccarton, Edinburgh EH14 4AS, UK School of Built Environment, Heriot Watt University, Dubai Campus, United Arab Emirates
a r t i c l e
i n f o
Article history: Available online 28 September 2012 Keywords: Hydrometeorological data infilling Malawi Rainfall and streamflow Self-organizing maps
a b s t r a c t A major requirement for the assessment, development and sustainable use of water resources is the availability of good quality hydrological time series data of sufficiently long duration. However, it is not uncommon to find data that are riddled with gaps, characterized by questionable quality and short durations. Sometimes, the data are just not available. Such situations are most prevalent in developing countries and the consequence is a high degree of uncertainty in the assessed characteristics of water management schemes and ultimately its ineffectual performance. Thus dealing with these problems is an important exercise in hydrological analyses. This paper focuses on the multivariate infilling of gaps for rainfall and streamflow data in the Shire River basin in Malawi, using a self organizing map (SOM) approach, which is a form of unsupervised artificial neural networks. The results show that this approach can produce reliable estimates of hydro-meteorological data thus offering promise for reducing the uncertainties associated with the use of insufficient data for water resources assessment. Ó 2012 Elsevier Ltd. All rights reserved.
1. Introduction Effective planning, management and control of water resources systems require data on relevant hydrometeorological variables such as rainfall, stream flow evapotranspiration and temperature (Khalil et al., 1998). The importance of good quality data with sufficiently long duration for these operations cannot be over-emphasized, as demonstrated by Adeloye (1990, 1996) for the case of water resources planning. However, it is not uncommon to find data records that exhibit some form of deficiency through inadequate length, dubious quality or presence of gaps and discontinuities. Such situations are more prevalent in developing countries (Gyau-Boake and Schultz, 1994; Ilunga and Stephenson, 2005; Adeloye, 2011). This deficiency arises from a number of factors, including temporary absence of observers; malfunctioning of monitoring equipment; or lack of financial resources among other factors. The consequence of using such data is uncertainty and ineffectual performance of water resource systems (Adeloye, 1996, 2011). To redress this problem, the records must be augmented using one of the many available techniques. This paper focuses on data augmentation with respect to rainfall and stream flow data for the Shire River basin in Malawi using the self-organizing map (SOM) artificial neural networks, as part of an on-going flood risk analysis study in the Lower Shire River floodplain. ⇑ Corresponding author. E-mail address:
[email protected] (F.D. Mwale). 1474-7065/$ - see front matter Ó 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.pce.2012.09.006
In the next section, some of the available data augmentation methods are briefly reviewed, emphasizing their relative strengths and limitations. Next further detail about the SOM is given, followed by the methodology, results and conclusions in that order. 2. Infilling methods A number of methods are adopted in infilling missing data depending on the length of the gaps, the availability of hydrometeorological data from neighboring stations, the season of missing values, the climatic region under consideration, the knowledge and expertise of the person responsible for correcting data, length of existing data record, the importance of prediction and hence consideration of the performance of the model to be used for infilling (Gyau-Boake and Schultz, 1994; Khalil et al., 1998; Rees 2008). These methods range from simple interpolation to complex statistical methods. When dealing with an auto-series (data from a station for which in-filling is to be made), approaches such as simple arithmetic averages (Linacre, 1992; Dinpashoh et al., 2011) and linear interpolation techniques (Yawson et al., 2005) have normally been used. Linear interpolation involves drawing a straight line between two data points; one immediately before the gap and the other soon after the gap and interpolating the missing value from this straight line. However, it is probably more common to use other surrounding stations acting as donor sites, which can be combined using either weighted averages, e.g. Gyau-Boake and Schultz (1994), Kumambala (2010) or linear regression (Abatzoglou et al., 2009; Dastorani
F.D. Mwale et al. / Physics and Chemistry of the Earth 50–52 (2012) 34–43
et al., 2010). The weighting factor may take the form of correlation coefficient or a ratio of areas or distance between the station for which values are to be transferred and the donor stations (Adeloye and Rustum, 2012; Mohamoud and Parmar, 2006; Mohamoud, 2008). In linear regression, the variable at the incomplete site becomes the dependent variable while the observations at the donor sites become the independent variables. While most of these traditional methods offer simplicity, there are some challenges. For example, Rees (2008) observes that serial interpolation techniques are only suitable in stable periods i.e. periods having neither flood events nor significant rainfall. In addition, their application is also limited to short lengths of the gap (Hydrology Project and Technical Assistance Training Module (SWDP) #39). For periods with variable flows and longer sequences of missing data, regression analysis and other forms of hydrological modeling are recommended (Rees, 2008; Hydrology Project and Technical Assistance Training Module (SWDP) #39). However, both conceptual and physically based hydrological modeling while offering accurate estimates can be quite resource-intensive for rapid application to a large number of stations (Harvey et al., 2010). For example, the estimation of runoff depends on meteorological (precipitation, evapotranspiration, etc.) factors and catchment characteristics (slope, land use, soil moisture, soil infiltration capacity) for which data may not always be available. This would mean that such models cannot be effectively calibrated thus precluding their use for prediction. Besides, model calibration requirements may constrain portability between catchments (Harvey et al., 2010). Another important aspect in some of these traditional infilling methods is an implicit or explicit assumption of linearity between variables (Khalil et al., 1998), which may not be true. Since the augmentation procedure is normally based on data from neighboring stations, Adeloye (2009) observes that reconstruction of a data set using regression methods may not be feasible when the predictor is missing. Besides, classical regression methods normally analyze for one predictand (or dependent variable); for a large number of variables, developing different predictive regression equations for each can be time consuming (Rustum, 2009). The above limitations of traditional approaches to data infilling have fueled the increasing attention being given to data driven models with artificial neural networks (ANNs) being the most widely used. Their appeal has been well documented e.g. by Thirumalaiah and Deo (1998), Dawson and Wilby (1998), Kneale et al. (2004), Lekkas et al. (2004), Minns and Hall (2004) and this stems from their ability to model complex nonlinear patterns; to work without a priori knowledge of the underlying process; and their robustness to the existence of missing data during training or calibration. ANNs with various configurations have been successfully applied in a wide array of issues including in infilling of stream flow and precipitation data e.g. Ilunga and Stephenson (2005), Coulibaly and Evora (2007), and Ogwueleka and Ogwueleka (2009). Not only do ANNs address some challenges faced with the traditional methods, their performance has also been shown to be significantly better (Dastorani et al., 2010; Starrett et al., 2010). In particular, several recent studies (e.g. Rustum and Adeloye, 2007; Kalteth and Hjorth, 2009) have found that the self organizing map (SOM), an unsupervised ANN, performed better than the most widely used multi-layer perceptron artificial neural networks (MLP-ANNs) in water resources. SOMs are also very robust to missing data (Malek et al., 2008) during its training whereas MLP-ANNs will require a complete data set for its training. Thus, if data are missing, an off-line pre-processing to provide estimates of the data in the input space is mandatory before the training of MLP-ANNs can proceed. (Rustum and Adeloye, 2007). As a result of the above attributes, this study has used the SOM for augmenting data in the Shire basin. Common to all the data available in this flood-plain basin is the considerable proportion
35
of missing data which would naturally preclude the use of MLPANNs or other regression-based methods. Indeed, where such noise is available in the data, feed-forward MLP-ANNs have been known to give unrealistic results (Rustum et al., 2007). On the contrary, unsupervised ANNs, typified by the SOM, while they have distinct input and output layers, have no specific input or output variables, since all the variables in the input vectors are also contained in each node of the output layer. Rather, the SOM approach involves performing a clustering of a large dimensional array of the input layer into a smaller, usually 2-dimensional, array in the output layer. The variables in each node of the output layer (also termed the code vector) will thus have exactly the same variables contained in each of the input vectors. These nodal variables, however, represent the essential features of the closely related input vectors that have been clustered around the node, thus making any inherent correlations between the vectors in the array much more visible. Because the SOM clusters the data, it is able to handle missing data in the input vectors, as well as provide robust estimates for such missing values by equating them to their corresponding values in the features (or code vector) of the output node ‘‘closest’’ to the incomplete input vector. The way this is done, especially with respect to evaluating the ‘‘closeness’’ between an input vector and clusters center will be fully explained in Section 3.2. In the next section, further details about the SOM are provided, emphasizing its use in multivariate prediction. This is followed by the case study description and analysis. The results are then presented and discussed and finally the main conclusions are presented. 3. Self organizing maps SOM are a competitive, unsupervised form of artificial neural networks pioneered by the Finnish professor, Professor Teuvor Kohonen (Kohonen et al., 1996). They provide a means of compressing data from multi-dimensions to lower dimensions discrete map, usually two dimensions, although higher dimensions are possible but not as common (Haykin, 1999). They also cluster input patterns in such a way that similar patterns are represented by the same output neurons or by one of its neighbors. The information in a SOM is stored in such a way that any topological relationships within the training set are maintained. This implies that the SOM translates the statistical dependencies between the data into geometric relationships, therefore maintaining the most important topological and metric information contained in the original data (Rustum, 2009). 3.1. Basics of the SOM The SOM (also called feature map or Kohonen map) is one of the most widely used artificial neural networks algorithms (Kohonen et al., 1996). It is usually presented as a dimensional grid or map whose units (nodes or neurons) become tuned to different input data patterns. Its algorithms are based on unsupervised competitive learning, which means that training is entirely data driven and the neurons or nodes on the map compete with each other (Alhoniemi et al., 1999). The principal goal of the SOM is to transform an incoming signal pattern of arbitrary dimension into a two-dimensional discrete map. It involves clustering the input patterns in such a way that similar patterns are represented by the same output neurons, or by one of its neighbors (Back et al., 1998). In this way, the SOM can be viewed as a tool for reducing the amount of data by clustering, thus converting complex, nonlinear statistical relationship between high dimensional data into simple relationship on low
36
F.D. Mwale et al. / Physics and Chemistry of the Earth 50–52 (2012) 34–43
dimensional display (Kohonen et al., 1996). This mapping preserves the most important topological and metric relationship of the original data elements, implying that not much information is lost during the mapping. As remarked earlier, the SOM consists of two layers: the multidimensional input layer and the competitive or output layer; both of these layers are fully interconnected as illustrated in Fig. 1. The output layer consists of M neurons arranged in a two-dimensional grid of nodes. Each node or neuron i (i = 1, 2, . . . , M) is represented by an n-dimensional weight or reference or code vector Wi = [wi1, . . . , win], where n is the dimension of each input vector, i.e. the maximum number of variables in the input vector. In other words, each neuron in the output layer of the SOM contains exactly the same set of variables contained in the input vectors and thus, unlike the MLP-ANNs, variables in the SOM are not partitioned into input or output variables. Garcia and Gonzalez (2004) offer guidance on determining the optimum number of neurons, which is:
pffiffiffiffi M¼5 N
ð1Þ
where N is the total number of data samples. Once M is known, the number of rows and columns in the SOM can be determined. A guideline by Garcia and Gonzalez (2004) on the dimensions of M is that:
l1 ¼ l2
rffiffiffiffiffi e1 e2
ð2Þ
where l1 and l2 are the number of rows and columns respectively, e1 is the biggest eigenvalue of the training data set and e2 is the second biggest eigenvalue.
the jth element of the code vector i; n is the dimension of the input vector; and mj is the so called ‘‘mask’’ which is used to include in (mj = 1), or exclude from (mj = 0), the calculation of the Euclidian distance, the contribution of a given element xj of the input vector. This is very useful where the input vector contains missing elements because all that needs to be done is to set the mask (mj) to zero for such elements. In this way, the SOM is able to handle missing values in the input vector without any problem. The neuron whose vector most closely matches the input data vector (i.e. for which the Di is minimum) is chosen as a winning node or the best matching unit (BMU). The code vectors of this winning node and those of its adjacent neurons are then adjusted to match the input data using Eq. (4), thus bringing the code vectors further into agreement with the input vector (Vesanto et al., 2000).
wi ðt þ 1Þ ¼ wi ðtÞ þ aðtÞhci ðtÞ½xðtÞ wi ðtÞ
ð4Þ
where t denotes time, a(t) is the learning rate at t, hci(t) is the neighborhood function centered in the winner unit c at time t and all the other variables are as defined previously. In this manner each node in the map internally develops the ability to recognize input vectors similar to itself. This characteristic is referred to as self-organizing, because no external information is supplied to lead to a classification (Penn, 2005). The process of comparison and adjustment continues until the optimal number of iteration is reached or the specified error criteria are attained. Both the learning rate and the neighborhood function affect the learning effectiveness of the SOM and must be chosen carefully. In particular, the learning rate decreases monotonically with increased number of iterations as in the following equation:
3.2. Training the SOM
aðtÞ ¼ ao ð0:005=ao Þt=T
The multi-dimensional input data is first standardized by deducting the mean and then dividing the result by the standard deviation. To start, the neurons in the output layer are seeded with randomly generated, standardized values. A standardized input vector is then chosen at random and presented to each of the individual neurons of the SOM for comparison with their code vectors in order to identify the code vector most similar to the presented input vector. The identification uses the Euclidian distance, which is defined as:
where ao is the initial learning rate and T is the training length (Vesanto et al., 2000), thus forcing the weight vector to converge. In general, best results are obtained by setting T = 250/N0.5 (SOM toolbox for Matlab 5 – see http://www.cis.hut.fi). The neighborhood function is normally chosen to be Gaussian centered in the winner unit c, such that:
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi uX u n Di ¼ t mj ðxj wij Þ2 ;
where rc and ri are the positions of nodes c and i on the KSOM grid and r(t) is the neighborhood radius. Like the learning rate a(t), r(t) also decreases monotonically as the number of iterations increases. The quality of the trained SOM is measured by the total average quantization error and total topographic error. The quantization error is:
i ¼ 1; 2; . . . ; M
ð3Þ
j¼1
where Di is the Euclidian distance between the input vector and the code vector i; xj is the jth element of the current input vector; wij is
2
hci ðtÞ ¼ expðkrc ri k
qe ¼
=ð2r2 ðtÞÞÞ
N 1X kX i W c k N i¼1
ð5Þ
ð6Þ
ð7Þ
where qe is the quantization error, Xi is the ith data sample or vector, Wc is the prototype vector of the best matching unit for Xi and kk denotes the Euclidian distance (Eq. (3)). The topographic error is:
te ¼
Fig. 1. The architecture of SOM. Source: Fei et al. (2006).
N 1X uðX i Þ N i¼1
ð8Þ
where ui() is a binary integer such that it is equal to 1 if the first and second best matching units for Xi are not adjacent units; otherwise it is zero. The SOM can be used for many practical tasks, such as the reduction of the amount of training data for model identification, nonlinear interpolation and extrapolation (i.e. prediction), generalization and compression of information for easy transmission
F.D. Mwale et al. / Physics and Chemistry of the Earth 50–52 (2012) 34–43
X
X
X
X Known values
X X
Missing values
BMU search
BMU
X X
X
X
X
X
?
X
?
X
?
X
Prediction
Fig. 2. Prediction of missing components of the input vector using SOM. Source: Rustum and Adeloye (2007).
(Kangas and Simula, 1995; Kohonen et al., 1996; Tananaki et al., 2007). 3.3. Use of the SOM for prediction Once the SOM has been fully and effectively trained as described above, it is now ready to be used for prediction. The application of the SOM for data record infilling, the main purpose of this study, is illustrated in Fig. 2 (see also Rustum and Adeloye, 2007). As evident in Fig. 2, there can be more than one variable needing to be predicted in a single input vector. In fact as illustrated in Fig. 2, there are three variables of the input vector that need to be predicted. This ability to simultaneously predict multiple variables is what makes the SOM a much more versatile tool than classical regression. The multivariate prediction using the SOM proceeds as follows: i. Decide on the variables needing prediction in the input vector. These will be variables that are unavailable because they are actually missing (e.g. missing river stage and discharge) resulting in a depleted input vector. In the schematic of Fig. 2, the input vector has three variables missing, which are represented by ‘‘?’’ ii. Determine the Euclidian distance, D, of the depleted vector from each of the nodes of the output layer of the trained SOM using Eq. (3). In doing this, the mask, mj, of each of the unavailable variables will be set to zero while mj will be set to unity for all the other variables in the input vector. iii. Examine all the D’s for the minimum and hence isolate the SOM’s BMU for the depleted input vector. It should be noted that while the input vector in step (i) above is depleted, i.e. has variables missing, the BMU identified here is a node of a trained SOM and hence has the full complement of variables. iv. Replace the missing values of the input vector by their corresponding values in the BMU identified in step (iii) above. 4. Methodology 4.1. Study area and data The infilling exercise focuses on gauging and rainfall stations within the Shire River basin of Malawi, as the Lower Shire floodplain for which the flood risk assessment is to be conducted falls within this basin. The Shire River basin is a sub-basin of Lake Malawi/Shire River basin and lies below Lake Malawi (Fig. 3). It is drained by Shire River, the only outlet of Lake Malawi. What is given as catchment size of the Shire River basin varies across literature due to differences in values given to the catchment sizes of Lake Malawi/Shire River basin at the confluence with the Zambezi River and that of Lake Malawi catchment alone. For
37
example Shela (2000) gives a value of 150,000 km2 as Lake Malawi/Shire River basin catchment size whilst according to Belfuss and Dos Santons (2001) the catchment size is 154,000 km2. Similarly, according to Shela the Lake Malawi catchment alone is 126,500 km2 while Beilfuss and Dos Santos put this value at 126,550 km2. Thus the Shire River basin may be considered to be between 23,000 km2 and 27,500 km2 in size at its confluence with the Zambezi. The infilling was applied to average daily values of water level, stream flow and rainfall from seven main river gauging stations and 16 rainfall stations in the basin (see Fig. 3). Although the basin extends into Mozambique, only stations that fall within Malawi were used. In addition, these stations used herein are those whose data were readily available. The river stations are Mangochi, Liwonde, Chikwawa, Chiromo, Tengani and Nsanje on the Shire and Sinoya on the Ruo. Of these, only Mangochi, Liwonde, Chikwawa, Chiromo and Sinoya have both flow and water level data; all the others have water level data only. The rainfall stations are Nsanje, Makhanga, Ngabu, Chikwawa, Nchalo, Neno, Mwanza, Mimosa. Thyolo, Bvumbwe, Chileka, Chichiri, Makoka, Chingale, Balaka and Mangochi. The rainfall data were sourced from the Department of Climate Change and Meteorological Services while flow and water level data were provided by the Ministry of Irrigation and Water Development. The rainfall data were collected using manual rainfall gauges. Similarly, water levels are manually collected using staff gauges. The water levels are then converted to flow using rating curves but only when the former are available. The historical periods of data used in this study are presented in Table 1 from where it is clear that these are non-uniform. A further feature of the information presented in Table 1 is that several of the records are missing and, although this is not evident in Table 1, some of the missing periods do overlap thus making the use of traditional prediction approaches such as regression impossible. For example, it is common to predict the discharge data using the water level and a rating function. However, when the water level data at a station are missing, the rating curve will be useless in such a situation. On the contrary, the prediction function of the SOM is unencumbered by such missing values as explained previously, as long as there are some values in an input vector, and its application will naturally result in the simultaneous prediction of the water level and the flow. The SOM being developed here thus represents a more complete approach for discharge estimation than the use of a traditional rating curve. 4.2. Application of SOM The analysis period is taken as 1978–2008 as this period has substantial overlap of data. Data are arranged in columns with each column representing a variable to be infilled e.g. Mangochi daily rainfall, Chikwawa daily flow, etc. In this exercise, there are 28 variables in total which constitute a single input vector. There are 11201 such vectors in total corresponding to the number of daily observations (complete and incomplete) in the record. Entries without data are recorded as NaN (Not a Number) to meet Matlab requirements. A SOM tool box developed at the Laboratory of Information and Computer Science (CIS) at Helsinki University of Technology (http://www.cis.hut.fi/projects/somtoolbox) was used in MATLAB environment by Mathworks Inc. A batch training algorithm was adopted. Based on the multivariate relationship that exist between rainfall and runoff data, all data i.e. flow, water level and rainfall data, constituting 28 variables were first trained together. This is referred to as Case 1. However, Kalteth and Berndtsson (2007) investigated the ability of SOM to interpolate rainfall data in a region with high spatial and temporal variability in Iran. They found that
38
F.D. Mwale et al. / Physics and Chemistry of the Earth 50–52 (2012) 34–43
Fig. 3. The Shire River basin and location of some gauging and rainfall stations.
Table 1 Record length of data used. Station
Record length
Proportion of missing data (%)
Flow Mangochi Liwonde Chikwawa Chiromo Sinoya
1956–2008 1948–2008 1977–1998 1953–1998 1980–1990
7.0 2.8 11.7 5.7 23.0
Water level Mangochi Liwonde Chikwawa Chiromo Sinoya Tengani Nsanje
1953–1996 1980–2010 1980–2003 1970–2009 1962–2002 1970–2006 1960–2003
0.3 2.1 10.0 14.4 2.4 23.4 41.5
Rainfall Nsanje Makhanga Ngabu Chikwawa Nchalo Neno Mwanza Mimosa Thyolo Bvumbwe Chileka Chichiri Makoka Chingale Balaka Mangochi
1973–2009 1953–2010 1981–2010 1979–2010 1981–2010 1980–2009 1935–2006 1958–2010 1962–2010 1953–2010 1961–2010 1981–2010 1981–2010 1952–2010 1981–2010 1961–2010
8.4 3.2 0.0 10.7 1.6 14.0 17.1 2.5 0.2 0.6 0.7 0.0 0.7 15.1 1.4 0.02
results improved with subsequent training on data that was homogeneous. Therefore, two more scenarios were investigated and these are called Cases 2 and 3 respectively.
In Case 2 flow and water level data were trained separate from rainfall data. This resulted in two sets of data for independent training: 12 variables of flow and level data with 11201 input vectors and; 16 variables of rainfall having 11201 vectors. Case 3 only dealt with rainfall. Rainfall stations were split into three clusters based on findings by Ngongondo et al. (2011a) and Ngongondo et al. (2011b). According to Ngongondo et al. (2011a) rainfall in Malawi is highly variable with spatial correlations being highest only within 20 km of a station. In the southern region, in which the study area falls, there are three homogeneous rainfall regions (Ngongondo et al., 2011b) and these were used. The regions are the predominantly semi-arid low lying Shire valley that occupies the southern arm of the Malawi Rift Valley with an average altitude of 84 m above sea level (cluster 1), the southern highlands with an altitude of above 1000 m above sea level (cluster 2), and areas along Lake Malawi, the upper Shire River basin and the surrounding medium altitude and plain areas with average altitude of 632 m above sea level (cluster 3). Based on the information in Table 1, Cluster 1 comprises Nsanje, Makhanga, Ngabu, Chikwawa, Nchalo stations. Stations in cluster 2 are Neno, Mwanza, Mimosa, Thyolo, Bvumbwe, Chileka, Chichiri, Makoka while Chingale, Balaka and Mangochi fall in cluster 3. The performance of SOM was assessed through the coefficient of determination (R2) and visual inspection of time series plots. 5. Results and discussion The performance of the SOM for Case 1 is summarized in Table 2. On observation of Table 2, it is evident that in Case 1 where flow, water level and rainfall are trained together, SOM predictions on rainfall are unsatisfactory while its modeling skills on flow and water level data are fairly good. While all gauge stations register R2 values equal to or above 0.8, except for Sinoya flow for which the value 0.48, the resulting R2 values on rainfall data range from 0.38 to 0.60.
39
F.D. Mwale et al. / Physics and Chemistry of the Earth 50–52 (2012) 34–43 Table 2 SOM performance in terms of R2 based on Case 1. Flow and water level data MangochiL LiwondeL ChikwawaL ChiromoL SinoyaL TenganiL NsanjeL MangochiF LiwondeF ChikwawaF ChiromoF SinoyaF
Rainfall data 0.85 0.84 0.85 0.81 0.79 0.88 0.86 0.80 0.88 0.83 0.87 0.48
NsanjeR MakhangaR NgabuR ChikwawaR NchaloR NenoR MwanzaR MimosaR ThyoloR BvumbweR ChilekaR ChichR MakokaR ChingaleR BalakaR MangochiR
0.47 0.47 0.49 0.52 0.58 0.38 0.56 0.46 0.56 0.60 0.48 0.55 0.45 0.42 0.46 0.55
As indicated earlier, Kalteth and Berndtsson (2007) found that the predictive ability of SOM was affected by the correlation in the data set. In Case 1, the existence of little or no correlation between flow/water level and the rainfall data is quite evident from the resulting component planes of flow, water level and rainfall data (Fig. 4). A component plane shows the values of one variable as determined by each unit in the map (Vesanto et al., 2000) and therefore each component plane can be thought of as a slice of a SOM (Adeloye et al., 2011). Component planes are color or gray shaded in a two dimensional lattice. Light colors are indicative of areas in which the variable has high values whilst dark colors illustrate low values. Thus component planes help to visually identify relationships, in terms of correlations, between the variables involved in the analysis. For example, if the color gradients of two
U-matrix
MangochiF
1410
1450
932
1.77
552
494
780
804
472
LiwondeL
209
d
5.82
NsanjeR
5.98
1.46
d
34.1
MwanzaR
MakhangaR
31
MimosaR
33.7
2.84
d
ThyoloR
37.5
0.0229
d
BalakaR
ChikwawaR d
BvumbweR
3.24
39
NchaloR
1.07
d
ChilekaR
0.028
d
24.4
16.9
0.00106
d
Fig. 4. Component planes of flow, water level and rainfall data resulting from Case 1.
33.8
0.0154
d
32.4 16.2
0.0619
d
38.6 19.3
19.5
MangochiR
5.02 3.05
38.2
0.000255
d
12.2
0.0142
7.68
12.1
d
19.1
42.2
0.0754
d
TenganiL
5.46
21.2
16.9
d
6.16
157
d
18.8
40.6
0.161
d
0.0231
d
NgabuR
20.4
ChingaleR
SinoyaL
4.5
23.7
0.0487
d
15.5
0.0416
152
d
11.9
73.1
0.00381
d
3.44
1.24
d
36.5
MakokaR
ChiromoL
2.34
34.8
0.0434
d
17.1
d
163
d
17.4
36.3
0.0144
d
6.34
3.64
d
18.2
ChichiriR
ChikwawaL
4.99
3.72
NenoR
SinoyaF
825
7.31
d
ChiromoF
896
8.8
NsanjeL
ChikwawaF
3.49
0.0455
MangochiL
LiwondeF
planes are parallel, that is an indication of high positive correlation; anti-parallel gradients imply negative correlation between variables. From Fig. 4, it can be observed that the coloration of flow and water level component planes is similar except for Sinoya flow suggesting all gauge stations are correlated except for the latter. This is not unexpected as all stations lie on the Shire River whilst Sinoya is a gauge station on Shire’s tributary, the Ruo. Fig. 4 also shows flow and water level component planes are quite distinct from rainfall component planes with the latter being similar too. These results therefore suggest there might be little correlations between flow/water level and rainfall which could affect the predictive ability of SOM on data. To improve the results, especially on rainfall, separate trainings were carried out: one on flow and level data together and the other on rainfall data (Case 2) and the respective performances are summarized in Table 3. As seen in Table 3, the agreement between observed values and those values predicted by SOM both on flow/ level and rainfall data has improved. For example, the SOM has produced a much improved performance for flow and level at all gauging stations, with R2 generally in excess of 0.92, except for Sinoya flow. As in Case 1, the Sinoya flow has the lowest value of 0.81 but is still quite good. The resulting component planes for flow/ water level data are shown Fig. 5a. As established earlier, similarities do exist between these variables. Similarities between some component planes are more conspicuous. For example, flow and water levels at Mangochi and Liwonde appear strongly related. Similarly, water levels at Chikwawa and those at Chiromo appear to be correlated. For other variables, it is visually difficult to determine the similarity between them. Despite apparent differences with some component planes, SOM results in predicting flow and water levels are very satisfactory and did not justify further investigation. These results from Case 2 were therefore adopted for infilling of flow and water level data. The range of R2 values on
0.012
d
40
F.D. Mwale et al. / Physics and Chemistry of the Earth 50–52 (2012) 34–43
across rainfall components planes have not necessarily translated into better performance in terms of the R2. The results for Case 3 are summarized in Table 4 from where it is apparent that better satisfactory results on rainfall can be achieved when rainfall stations are further grouped into clusters identified by Ngongondo et al. (2011b). With the exception of Mwanza, R2 between 0.7 and 0.93 are achieved. The modeling skills of SOM is the most satisfactory in cluster 3 with stations attaining R2 in excess of 0.9. The resulting component planes of this case are shown in Fig. 6(a)–(c).
Table 3 SOM performance in terms of R2 based on Case 2. Flow and water level MangochiF LiwondeF ChikwawaF ChiromoF SinoyaF MangochiL LiwondeL ChikwawaL ChiromoL SinoyaL TenganiL NsanjeL
Rainfall 0.92 0.97 0.95 0.95 0.81 0.94 0.97 0.94 0.95 0.94 0.96 0.96
NsanjeR MakhangaR NgabuR ChikwawaR NchaloR NenoR MwanzaR MimosaR ThyoloR BvumbweR ChilekaR ChichiriR MakokaR ChingaleR BalakaR MangochiR
0.52 0.62 0.56 0.55 0.65 0.51 0.54 0.62 0.68 0.71 0.63 0.62 0.62 0.46 0.64 0.58
Table 4 SOM performance in terms of R2 based on Case 3.
rainfall when trained separate from flow and water levels also jumped to 0.46–0.71. Despite this improvement, however, the overall SOM prediction on rainfall was still adjudged unsatisfactory, prompting the investigation in Case 3. The resulting component planes for Case 2 rainfall SOM are shown in Fig. 5b. Again, these component in exhibit some similarities. At a closer look, some differences are also apparent. For example, Nsanje, Makhanga, Ngabu, Chikwawa and Nchalo rainfall stations are correlated and different from the rest. These stations all lie in the Lower Shire valley where rainfall is low and quite erratic ranging between 650 mm and 750 mm per annum. Unlike the flow and water level data, however, these apparently small differences
U-matrix
MangochiF
Flow and water level data
Rainfall
MangochiF LiwondeF ChikwawaF ChiromoF SinoyaF
0.92 0.97 0.95 0.95 0.81
NsanjeR MakhangaR NgabuR ChikwawaR NchaloR
Cluster 1
0.86 0.90 0.82 0.76 0.77
MangochiL LiwondeL ChikwawaL ChiromoL SinoyaL TenganiL NsanjeL
0.94 0.97 0.94 0.95 0.94 0.96 0.96
NenoR MwanzaR MimosaR ThyoloR BvumbweR ChilekaR ChichiriR MakokaR
Cluster 2
0.70 0.65 0.75 0.77 0.79 0.77 0.76 0.76
ChingaleR BalakaR MangochiR
Cluster 3
0.90 0.92 0.93
LiwondeF
ChikwawaF
6.1
946
870
3.1
558
510
0.0972
171 SinoyaF
d
MangochiL
LiwondeL
1880
8.97
904
946
7.13
149
11.5
6.42 4.97
5.29
d
ChikwawaL
131
d
1660
d
863
150
d ChiromoF
1600
3.52
d
ChiromoL
d
SinoyaL
TenganiL
3.58
6.54
8.47
5.27
2.4
4.66
5.83
3.15
1.21
2.78
3.19
d
d
1.03
d
d
NsanjeL 6.22 3.75 1.28 d
(a) U-matrix
NsanjeR
3.79
MakhangaR
1.9 3.77e-005
ChikwawaR
38.7
45.4
22.7
19.3
22.7
3.56e-005
0.000147
d NchaloR
NgabuR
45.3
22.5
5.05e-005
d
NenoR
45
2.78e-006
d
MwanzaR
d
MimosaR
ThyoloR
47.9
34
30.9
59
46.2
24
17
15.4
29.5
23.1
3.07e-005
4.98e-007
2.03e-006
3.37e-006
d
d
BvumbweR
d
ChilekaR
1.44e-005
d
ChichiriR
d
MakokaR
ChingaleR
50
37.9
36
50.6
39.7
25
18.9
18
25.3
19.8
8.15e-006
2.05e-006
4.16e-006
1.5e-006
d
d
BalakaR
d
d
MangochiR 44.2
37.7
22.1
18.8
2.38e-005 d
3.13e-006 d
(b) Fig. 5. Component planes for flow and water level (a) and rainfall (b) resulting from Case 2.
8.38e-006 d
41
F.D. Mwale et al. / Physics and Chemistry of the Earth 50–52 (2012) 34–43
U-matrix
NsanjeR
1.53
43.8
35.2
1.68
3.95e-014
3.84e-009
d ChikwawaR
24.5
2.29e-010 51.3
d
BalakaR
NchaloR
2.77e-010
d
-5.86e-014
2.09e-013
d
55.3
47.3
1.42
27.7
23.7
64.8
4.28e-011 d BvumbweR 51.1
d ChichiriR
d
4.47e-007
d MakokaR
9.94e-008
49.8
45.8
24.9
22.9
3.26e-007
d ChilekaR
d
33.2
d
0
4.31e-008 54.9 27.4
25.6
32.4
66.4
MimosaR
MwanzaR
3.66e-006
-4.44e-016
(c)
2.84
ThyoloR
MangochiR
-3.38e-014
(a) U-matrix
d
36.3
26.9
d
69.7
34.8
72.6
53.7
25.7
ChingaleR
3.36
70.4
49
d
U-matrix
87.6
4.45e-008 NgabuR
MakhangaR
3.06
d
4.28e-008
1.58e-007
(b) Fig. 6. Component planes (a), (b) and (c) for clusters 1, 2 and 3 respectively.
Under the best scenarios (Case 2 for flow and water levels, and Case 3 for rainfall), the match between observed and predicted values is very good (Fig. 7). Magnitudes, both high and low, and trends of observed data are well replicated by SOM. In addition, imputed data interpolates well within the original series. For the sake of space, only two variables are shown; water level at Liwonde gauging station and rainfall at Makhanga station. In general, results show that the SOM is a powerful predictive tool that handles large data sets and high proportions of missing values. However, the quality of prediction depends on the correlation of data in the training set. In this study, results also show that the predictive capacity of SOM in this catchment is better on flow and water level data in comparison to rainfall data. One possible
reason could be attributed to the same issue of variability of data raised by Kalteth and Berndtsson (2007). Rainfall is highly variable in this catchment (Ngongondo et al., 2011a). On the contrary, flow and water levels for these stations are likely to be correlated since gauge stations used in this exercise all lie on the Shire River except for Sinoya which lies on the tributary. Nonetheless, the fact that flow prediction has been better with the SOM is to be welcome because although flow data are the preferred ones to have for effective water resources assessment, they are also the most difficult and expensive to measure. They are thus the ones most likely to be missing and the result of the study reported here offers a significant re-assurance for data sparse regions of the world. Similar work conducted by Adeloye and Rustum
42
F.D. Mwale et al. / Physics and Chemistry of the Earth 50–52 (2012) 34–43
References
Fig. 7. A comparison of observed and predicted daily water levels at Liwonde (top) and daily rainfall at Makhanga station (below).
(2012) in the Osun basin of south west Nigeria, whereby runoff and rainfall were trained together yielded very good results, warranting no further clustering. In the current study, however, results are consistent with earlier findings by Kalteth and Berndtsson (2007). They confirm the effectiveness of SOM in predictions, which can be much improved if the variables exhibit low spatial variability and/or high correlations, which was the case for the Osun basin (Adeloye and Rustum, 2012). Where this is not the case, e.g. with the rainfall data for the Shire catchment analyzed, working with clusters of homogeneous regions has proved useful in improving the predictability of the SOM.
6. Conclusion SOM is a powerful tool for infilling data. It can handle not only large data sets but also data characterized by a high proportion of missing values. Such data are very much typical of data scarce developing countries, which introduce difficulties and uncertainties in water resources assessment, planning and development in these countries. The attractiveness of SOM lies in its clustering ability making it immune to missing values. The use of traditional approaches in this study would have been tedious considering the number of variables to be infilled. Besides, the relative high proportion of missing data would have also meant limiting the infilling to periods where enough predictors are available thus losing valuable information. Despite SOM’s powerful modeling skills though, it is evident from this study that the predictive capacity of SOM will be quite dependant on the correlation of the data involved.
Abatzoglou, J.T., Redmond, K.T., Edwards, L.M., 2009. Classification of regional climate variability in the state of California. J. Appl. Meteorol. Climatol. 48, 1527–1541. Adeloye, A.J., 1990. Streamflow data and surface water resource assessment. J. Water Supply Res. Technol. – AQUA 39, 225–236. Adeloye, A.J., 1996. An opportunity loss model for estimating value of streamflow data for reservoir planning. Water Resour. Manage 10 (1), 45–79. Adeloye, A.J., 2009. The relative utility of multiple regression and ANN models for rapidly predicting the capacity of water supply reservoirs. Environ. Model. Software 24 (10), 1233–1240. Adeloye, A.J., 2011. In: Proceedings of the Symposium HS03 – Risk in Water Resources Management, Melbourne, Australia, IAHS 347, pp. 121–126. Adeloye, A.J., Rustum, R., 2012. Self-organising map rainfall–runoff multivariate modelling for runoff reconstruction in inadequately gauged basins. Hydrol. Res. 43, 603–617. Adeloye, A.J., Rustum, R., Kariyama, I.D., 2011. Kohonen self organizing map estimator for reference crop evapotranspiration. Water Resour. Res. 47. http:// dx.doi.org/10.1029/2011WR010690. Alhoniemi, E., Hollmen, J., Simula, O., Vesanto, J., 1999. Process monitoring and modeling using the self-organizing map. Integr. Comput. Aided Eng. 6 (1), 3–14. Back, B., Sere, K., Hanna, V., 1998. Managing complexity in large database using self organising map. Acc. Manage. Inform. Technol. 8, 191–210. Belfuss, R., Dos Santons, D., 2001. Patterns of hydrological change in the Zambezi delta, Mozambique. Working Paper #2, Program for the Sustainable Management of Cahora Bassa Dam and the Lower Zambezi Valley, Maputo, Mozambique. Coulibaly, P., Evora, N.D., 2007. Comparison of neural network methods for infilling missing daily weather records. J. Hydrol. 341, 27–41. Dastorani, M.T., Moghadamnia, A., Piri, J., Rico-Ramirez, M., 2010. Application of ANN and ANFIS models for reconstructing missing flow data. Environ. Monit. Assess. 166, 421–434. Dawson, C.W., Wilby, R., 1998. An artificial neural network approach to rainfall runoff modelling. Hydrol. Sci. 43 (1), 47–66. Dinpashoh, Y., Jhajharia, D., Fakheri-Fard, A., Singh, V.P., Kahya, E., 2011. Trends in reference crop evapotranspiration over Iran. J. Hydrol. 399, 423–433. Fei, B.K.L., Eloff, J.H.P., Olivier, M.S., Venter, R.H.S., 2006. The use of self-organising maps for anomalous behaviour detection in a digital investigation. Forensic Sci. Int. 162, 33–37. Garcia, H., Gonzalez, L., 2004. Self-organizing map and clustering for wastewater treatment monitoring. Eng. Appl. Artif. Intell. 17 (3), 215–225. Gyau-Boake, P., Schultz, G.A., 1994. Filling gaps in runoff time series in West Africa. Hydrol. Sci. 39 (4), 621–636. Harvey, C.L., Dixon, H., Hannaford, J., 2010. Developing best practice for infilling daily river flow data. In: Kirby, Celia (Eds.), Role of Hydrology in Managing Consequences of a Changing Global Environment. Proceeding of the BHS Third International Symposium, British Hydrological Society, pp. 816–823. Haykin, S., 1999. Neural Networks: A Comprehensive Foundation. Prenticee Hall, New Jersey. Hydrology Project and Technical Assistance Training Module (SWDP) #39: How to Correct and Complete Discharge Data, New Delhi, India.