Comparison of classification and clustering ... - Semantic Scholar

Theor Appl Climatol DOI 10.1007/s00704-010-0267-x

ORIGINAL PAPER

Comparison of classification and clustering methods in spatial rainfall pattern recognition at Northern Iran Saeed Golian & Bahram Saghafian & Sara Sheshangosht & Hossein Ghalkhani

Received: 23 April 2009 / Accepted: 2 February 2010 # Springer-Verlag 2010

Abstract Pattern recognition is the science of data structure and its classification. There are many classification and clustering methods prevalent in pattern recognition area. In this research, rainfall data in a region in Northern Iran are classified with natural breaks classification method and with a revised fuzzy c-means (FCM) algorithm as a clustering approach. To compare these two methods, the results of the FCM method are hardened. Comparison proved overall coincidence of natural breaks classification and FCM clustering methods. The differences arise from nature of these two methods. In the FCM, the boundaries between adjacent clusters are not sharp while they are abrupt in natural breaks method. The sensitivity of both methods with respect to rain gauge density was also analyzed. For each rain gauge density, percentage of boundary region and hardening error are at a minimum in the first cluster while the second cluster has the maximum error. Moreover, the number of clusters was sensitive to the number of stations. Since the optimum number of classes is not apparent in the classification methods and the boundary between adjacent classes is abrupt, use of clustering methods such as the FCM method, overcome such deficiencies. The methods were also applied for mapping an aridity index in the study S. Golian (*) Shahrood University of Technology, Shahrood, Iran e-mail: [email protected] B. Saghafian Science and Research Branch, Islamic Azad University, Tehran, Iran S. Sheshangosht : H. Ghalkhani Water Research Institute, Tehran, Iran

region where the results revealed good coincidence between the FCM clustering and natural breaks classification methods.

1 Introduction Pattern recognition by classification or clustering methods has many applications on meteorological and hydrological studies. For example, it has applied to investigate temperature trend in the US (Lawson et al. 1981) or to identify areas of central North America with similar cloud frequency behavior (Schulz and Samson 1988). In decision-making for water resource management and planning, clustering of annual and monthly rainfall data and extracting regions with different rainfall patterns can be a useful tool for managers and stakeholders. There are several definitions provided for “pattern recognition.” For example, Bezdek (1981) stated that “pattern recognition is a search for structure in data”. Schalkoff (1992) also defined pattern recognition as the science that concerns the description or classification (recognition) of measurements. Approaches to pattern recognition include neural networks, classification methods, and clustering algorithms. “Classification” belongs to supervised pattern recognition category where as “clustering” refers to unsupervised approaches. Lauzon et al. (2006) used a Kohonen neural network for clustering of precipitation fields. They also employed mean rainfall in each cluster and the upstream flows as inputs of a lumped rainfall-runoff model and simulated the flow at downstream. The results demonstrated the relevance of the proposed clustering method, which produces groups of precipitation fields that are in agreement with the global climatological features affecting the study region.

S. Golian et al.

Ramachandra Rao and Srinivas (2006; 2008) used hybrid clustering algorithms for regionalization analysis. The watershed was initially clustered by means of agglomerative hierarchical clustering algorithms such as single linkage, complete linkage, and Ward's algorithm. Each derived cluster was refined with partitional clustering procedures such as K-means algorithm. The regions given by the clustering algorithms were, in general, not statistically homogeneous in terms of runoff generation mechanism. Hoffman and Hargrove (2005) discussed clustering methods and their ability in eco-regionalization (regionalization of eco-regions). They used a geographic multivariate clustering method with K-means algorithm as a type of quantitative regionalization method. Maps of nine characteristics including elevation, plant-available water capacity, soil organic matter, total soil nitrogen, depth to a seasonally high water table, mean precipitation during the growing season, mean solar insolation during the growing season, degree-day heat sum during the growing season, and degree-day cold sum during the non-growing season were generated with 1 km resolution. Using these maps and K-means clustering method, they divided the United States into as many as 3,000 eco-regions. Kulkarni and Kripalani (1998) used FCM method to classify seasonal (June through September) percentage departure from normal rainfall patterns over India for the period 1871–1994. The dominant modes of spatio-temporal variability in the Indian monsoon rainfall were identified. Monthly rainfall data for 306 stations spread for June through September months over 124 years were obtained and spatial averages for 51 uniform blocks of 2.5° latitude by 2.5° longitude were prepared. Using the FCM method, most dominant rainfall patterns were classified into four clusters and the spatio-temporal characteristics for each cluster were analyzed. Osaragi (2002) proposed a spatial data classification method based on the minimization of information loss and compared the results with five other classification methods. He applied each method to seven different sets of data from Digital Mesh Statistics compiled by Statistics Bureau and Statistics Center of Japan. Each data was classified into nine classes. Then, the ratio L of information loss by each method was compared with other methods. The results of numerical analysis showed that the Natural Break's method was the most effective classification method. Claggett et al. (2004) assessed development pressure and land-use changes in the Baltimore–Washington, DC, region exploring the utility of two modeling approaches for forecasting future development trends and patterns. The study area was divided into five classes representing percent area of urban land by using the Jenks' optimization algorithm to identify breakpoints between classes that minimize the sum of the variance for each class. The

output data from two modeling approaches were divided into five classes of development pressure ranging from “very low” to “very high”. Comparison between classification and clustering methods were not reported in previous studies. In this paper, we use a fuzzy clustering algorithm to classify annual rainfall data over the period 1975 to 2008 in northern Iran, and the results are compared with a hard classification method. Optimum number of clusters is derived through the fuzzy clustering method. Unlike the hard classification method, the boundary between adjacent clusters is not sharp and the boundary region is introduced in the 1c ; c1 interval where c is the c number of clusters.

2 Methodology Suitable rainfall data were available from 1975 to 2008 for some 25 rain gauge stations in the study region. Data filling, where required, was performed using a multivariate regression method between adjacent stations. Next, statistical tests were conducted for all stations in the study area. In case of hydrological and water resources time series at common time scales (e.g., monthly or annual), most statistical analyses are based on a set of fundamental assumptions, i.e., the series is consistent, is trend-free and constitutes a stochastic process whose random component follows the appropriate probability distribution function (Eischeid et al. 1995). Consistency implies that all the collected data belong to the same statistical population. Trend exists in a data set if there is a significant correlation between the observations and time. Trend or nonstationarity is normally introduced through human activities such as land-use changes or human-induced climate change. Double mass curve is the most widely technique for consistency test .This test revealed that the data were consistent for all rain gauges. In general, randomness in a hydrological time series means that the data arise from natural causes. If there is no randomness, then the series is persistent; this persistence is normally quantified in terms of the serial correlation coefficient (McMahon and Mein 1986) The Spearman rank order correlation nonparametric test was used to investigate the existence of long-term trends in the data sets. Also, outlier test was carried out. Outliers are data points which depart significantly from the trend of the remaining data. The retention, modification, and deletion of these outliers can significantly affect the statistical parameters computed from the data. Results showed that three stations had trend and outlier values with 5% significance level. These stations were removed in future analysis. In the third step, the nonparametric run test was applied for randomness. This test is described by McGhee (1985) among others. With application of the run test, it was

Comparison of classification and clustering methods in spatial rainfall pattern recognition at Northern Iran

deduced data of two stations were not random at 5% two-tailed significance level. Mean annual precipitation was retained for the remaining stations in the study area. The spatial distribution of rainfall fields were determined by the inverse distance weighted (IDW) in which the value of a variable in an unsampled point is obtained from values of adjacent points by the following relationship: n P Zi »

Z ¼

i¼1 n P i¼1

dia

ð1Þ

1 dia

(Jenks 1967) so that the boundary values are determined in such a way that the average of a squared deviation in each class is minimized. 2.1 Clustering with FCM algorithm The determination of the number of clusters is the most important issue in clustering algorithms. Here, we use cluster validity index (CVI) criterion proposed by Fukuyama and Sugeno (1989) as follows: SðcÞ ¼

N X c X

ðmik Þm kxk vi k2 kvi xk2

ð2Þ

k¼1 i¼1

where: Z* is the estimated quantity, Zi is the observed quantity at i-th station, di is the distance between i-th station and the unsampled point, “a” is the power usually between 1 and 3 and n is number of sampled points involved in the interpolation. The power “a” influences the accuracy of estimations so that adjacent points are given greater weights when “a” is increased. The interpolation was carried out on a 500-m pixel size. In IDW method, the weights of sampled points are determined according to their distance to the unsampled points while the position and distribution have no effects on the estimation. However, to study the dependence of the results on the station density, four station density scenarios with different distances from the region boundary were considered. Thus, 45, 33, 22, and 15 stations were involved in scenario 1 to 4, respectively. To investigate the annual rainfall patterns, a clustering and a classification method were applied. Clustering algorithms can be divided into hard clustering and fuzzy clustering. In hard clustering, each feature vector is assigned to one of the clusters with a degree of membership equal to one. This is based on the assumption that feature vectors can be divided into non-overlapping clusters with well-defined boundaries between them. Fuzzy clustering allows a feature vector to belong to all the clusters simultaneously with a certain degree of membership in the [0, 1] interval which means that the cluster boundaries overlay each other. Generally, all clustering methods are designed to maximize within-group similarity and to minimize between-group similarity. To achieve this purpose, some measures of similarity or distance between pairs of observations/objects must be established. The most commonly used distance measure is the Euclidean distance (Bunkers et al. 1996). In this study, the Fuzzy c-means method described by Bezdek (1981) is used as patterns clustering method on the basis of Euclidean distance as a measure of similarity. Also, the natural breaks will be used for classification. The “Jenks' optimization method” is employed and realized

Where: N is the number of data to be clustered, c is number of clusters, c≥2, xk is k-th data, usually a vector, x is average of x1, x2,...,xn data, vi is vector expressing the center of the i-th cluster, kkis the norm, μik is grade of membership of k-th data to the i-th cluster, and m is adjustable weight (usually m=1.5∼3). The number of clusters, c, is determined so that S(c) reaches a minimum as c increases. It is also imposed that: c X

mik ¼ 1

ð3Þ

i¼1

which means that the memberships of a chosen input feature vector over all the c fuzzy clusters should sum up to 1.0. The procedure for determining cluster centers and grade of membership of k-th data belonging to the i-th cluster is as follows (Sugeno and Yasukawa 1993): 1. Set t (iteration index) to unity. 2. Set an initial vector for cluster centers: V0 =(v1, v2,...,vc). t 3. Calculate the membership matrix UcN from vector of cluster centers determined in the previous step (“c” is the number of clusters and “N” is the number of data): mik ¼

c P

1

kxk vi k2

kxk vj k

j¼1

2 m1

ð4Þ

2

4. Calculate new vector of cluster centers from matrix Uct»N ; N P

V ¼ t

ðmik Þm xk

k¼1 N P

ð5Þ ðmik Þ

m

k¼1

5. If V t V t1 ", then stop, else t=t+1 and go to step 3.

S. Golian et al.

Fig. 1 Digital elevation model (DEM) and the location of rain gauges network in the study area

Fig. 2 Four rain gauge scenario networks

Comparison of classification and clustering methods in spatial rainfall pattern recognition at Northern Iran

Fig. 3 Map of mean annual rainfall

Fig. 4 Map of mean annual temperature

S. Golian et al. Table 1 Number of clusters and annual precipitation depth (in millimeters) for cluster centers for different rain gauge network density

Cluster Cluster Cluster Cluster Cluster Cluster

1 2 3 4 5 6

45 stations

33 stations

22 stations

15 stations

237.0 439.5 562.5 748.2 – –

238.1 451.2 561.2 745.2 – –

228.9 392.6 501.39 595.4 704.2 819.5

225.9 361.8 484.0 592.4 713.7 821.5

P xi X . observation values (xi) and the mean; i.e., This value is called “squared deviations, array mean” (SDAM). 2. Calculate the arithmetic mean within each class (Z c ). For each class, calculate the sum of the squared deviations between observation values (xi) and the class' arithmetic mean (xi Z c ). Finally, 2 the sum of all PP classes is determined by xi Z c . This value is called “squared deviation, class means” (SDCM). 3. Calculate the GVF: GVF ¼

Once the procedure stops, the cluster validity index, S(c), will be calculated for a given c. This procedure is repeated for different number of clusters. If S(c) increases with c, then the optimum number of cluster (equal to c−1) has been obtained. 2.2 Classification with natural breaks algorithm Natural breaks is a common method for spatial data classification and is based on Jenks optimization algorithm introduced in 1967. In general, the optimization minimizes within-class variance and maximizes between-class variance in an iterative series of calculations. Optimization is achieved when the goodness of variance fit (GVF) quantity is maximized. The GVF value is calculated as follows (Dent 1996): 1. Calculate the arithmetic mean (X ) for the variable, and calculate the sum of the squared deviations between

SDAM SDCM SDAM

ð6Þ

The method first specifies an arbitrary grouping of the numeric data. SDAM is a constant and does not change unless the data changes. The mean of each class is computed, and the SDCM is calculated. Observations are then moved from one class to another in an effort to reduce the sum of SDCM and therefore the GVF statistics increases. This process continues until the GVF value no longer increases. For each cluster in each scenario, sum of grades of membership of all cells in the study region is calculated, and the result is divided by the total number of cells in the area. Thus, the mean grade of membership to a specific cluster in each scenario is determined with FCM clustering method. Also the mean grade of membership of cells located in boundary regions is obtained. For this purpose, all cells with grade of membership greater than 1c are assumed to belong to that cluster where cells with grade of membership between 1c and c1 c belong to the boundary

Fig. 5 Comparison of Fuzzy FCM and Natural Breaks classification methods for different densities of rain gauges network

Comparison of classification and clustering methods in spatial rainfall pattern recognition at Northern Iran Fig. 6 Comparison of FCM and natural breaks methods with 45 rain gauges

region, and cells with grade of membership of less than 1c do not belong to that cluster. c is the number of clusters for each scenario. The following abbreviations are used hereafter: MMFFuzzyCi

MMFFuzzyC_Bi

MMFFuzzyC_WBi

Errori

Mean grade of membership to the i-th cluster for all cells in the study area in fuzzy clustering method. Mean grade of membership of boundary cells to the i-th cluster in fuzzy clustering method. Mean grade of membership of noneboundary cells to the i-th cluster in fuzzy clustering method. hardening error in fuzzy clustering method which is in fact the mean

Fig. 7 Comparison of FCM and natural breaks methods with 33 rain gauges

MMFClassi

grade of membership of cells with grade of membership less than 1c. Mean grade of membership to the i-th class in the classification method. For each month, the study area is partitioned into classes that are equivalent in number, with clustering method using natural breaks classification method. The mean grade of membership to the i-th class (which takes 0 or 1 value) is calculated.

To compare the classification and clustering methods in each month, the mean grade of memberships to the i-th class/cluster are illustrated simultaneously.

S. Golian et al. Fig. 8 Comparison of FCM and Natural Breaks methods with 22 rain gauges

Fig. 9 Comparison of FCM and Natural Breaks methods with 15 rain gauges

Comparison of classification and clustering methods in spatial rainfall pattern recognition at Northern Iran Fig. 10 Comparison of FCM and Natural Breaks methods for De Martonne

The methodology was further applied for evaluation of aridity condition in the study region. The De Martonne aridity index is expressed by: I¼

P T þ 10

ð7Þ

where I is the De Martonne aridity index, P is the annual precipitation in millimeters and T is the annual temperature in Centigrade degree. The De Martonne=climate classification is as follows: 0