Recent Researches in Applied Information Science
Design and implementation of a clustering model for river sectors based on biotope characteristics DANA SIMIAN 1, DANIEL HUNYADI 1, ANGELA BĂNĂDUC 2 1 Department of Informatics 2 Department of Environmental Sciences Faculty of Sciences University “Lucian Blaga” of Sibiu Victoriei 10 Str., Sibiu ROMANIA
[email protected],
[email protected],
[email protected]
Abstract: - The aim of this paper is to introduce models for river sectors clustering based on some biotope characteristics. Cluster analysis is realized using hierarchical and non-hierarchical methods in order to obtain the optimum number of classes. The implementation is made using RapidMiner. Key-Words: - cluster analysis, hierarchical clustering, non-hierarchical clustering, biotope characteristics consideration when solving a classification problem from ecology domain: data analysis, choice of variable to be use in the clustering analysis, choice of classification techniques and evaluation of the classification results by an ecological point of view. Our main aim is to realize a model for river sectors classification capable to automatic give the number of classes, their components and the biotope attributes hierarchization. The considered study case is the Vişeu River watershed, a second order tributary of Danube River, localized in the north part of the Romanian territory. Vişeu River has its springs in the Rodna Mountains, 80 km river length, 1,606 km2 watershed surface and a multiannual average flow at the confluence with the Tisa River of 30.7 m3/s. The rest of article is organized as follows. In section 2 we define the model specification. We introduced 2 models and characterize them. Implementation details are given in section 3. The results for the considered case study are presented in section 4. Results interpretation from computational and ecological point of view, conclusions and further directions of study can be found in section 5.
1 Introduction The river sectors clustering based on the biotope or/and biocoenosis characteristics has applications in the lotic ecosystems management. The present paper goal is to propose an efficient model for sectors river clustering based on some biotope characteristics and to choose an adequate data mining system for implementing our model. Usually ecological studies and analysis require complex classification tasks: classification of sites based on wet deposit loads [13], classification of waters based on their quality [15], classification of climate regions [12] and habitats [2, 4], species classification [1], etc. Biodiversity evaluation also implies classification and clustering [3, 6, 9, 14]. The frequently used techniques of classification in ecology are based on the cluster analysis using hierarchical or non-hierarchical methods with various partitioning or agglomerating methods based on similarity or dissimilarity measures ([1-5, 8, 9, 14]). Usually agglomerative clustering and the associate dendrogram are used, the main disadvantage of this technique is that is unable to make an automatic choice of the optimum number of clusters. For a specified type of data corresponding to a given classification problem the result are influenced by the type of the specific clustering algorithm, the similarity measures and the arbitrary determined of classes number [9, 12, 13]. On the other hand, the results of a specified clustering or classification technique are strongly dependent on the type of data. Techniques that work well for some types of data may give unsatisfactory results for other ones. Therefore many aspects must be taken into
ISBN: 978-1-61804-089-3
2 Model specification 2.1 Input data The input data of our model are represented by biotope variables, regarding the physical characteristics of river sectors (width, depth), substratum types (mud, sand, gravel, pebbles, cobbles, boulders and large boulders, vegetable debris), diversity of pools, riffles, runs and bends, bank vegetation (coniferous, deciduous, alder trees,
162
Recent Researches in Applied Information Science
willow tree, herbaceous vegetation, banks with no vegetation), channel modification, land use. In the case of the Vişeu River watershed, these variables were measured for 24 river sectors, chosen in relation with the main confluences and with a visual assessment of the biotope characteristics. Data were centralized in an EXCEL document.
of an agglomerative clustering algorithm is at least quadratic. Another hierarchical clustering method is divisive clustering, that adopts a top – down strategy for the construction of the clustering structure through recursive partitioning, from the root to the branches. Initially all the elements are situated in a single cluster. The method is more efficient if we do not need the whole hierarchy and it is capable to give the optimum number of clusters. It is more complex than agglomerative clustering methods because it requires a second flat clustering algorithm as inner function. In our model we chose divisive clustering method together with k-means (non- hierarchical clustering method). Non-hierarchical methods need as input data the number of clusters and use an optimality criterion in order to give a unique partition of data in the given number of groups. The optimality criterion can be the maximization of internal cohesion for the specified number of groups. Regardless of the clustering method the distance between clusters must be computed at each step. Several distances can be defined ([5, 7, 10]). The distance chosen in our model is the centroid distance, i.e. the distance between two clusters is defined as the distance between their centroid. Many criteria for deciding which attribute is more relevant for data grouping at some level are available ([5, 7, 10]). For our model we use two criteria: information gain and gain ratio. Most of the cluster analysis developed in ecology problems use non standardized data. By a computational point of view, to avoid a subset of data dominates the analysis due to their unit measure, data normalization is required. In our model we consider two cases: 1. Non standardized data 2. Standardized data, using min-max normalization for each attribute (column). The standardized value X* of the attribute X is:
2.2 Cluster analysis Cluster analysis is a descriptive data mining method ([5, 7, 10]) whose purpose is to cluster the observation data into groups that meet the internal cohesion and external separation conditions, i.e. the groups are homogeneous inside but heterogeneous from one group to another. In the following we describe the stages of cluster analysis made for our considered river sectors clustering. The data EXCEL sheets the observations are situated on rows and columns represent the variables. 2.2.1 Choice of variables The choice of variables in a cluster analysis must be done such that to achieve the stated objectives ([7]). It is important to reduce the matrix data retaining only the relevant attributes. After analyzing the input data, we concluded that the relevant biotope variables for our problem are: - average and maximum riverbed width: RWA and RWM, expressed in meters, - average and maximum depth: DA and DM expressed in meters, - substratum types expressed in surfaces percentage, in transverse sections, formed of mud, sand, gravel, pebbles, cobbles, boulders and large boulders, - altitude and channel modification, i.e. deviation by the natural state: CM%, expressed in percentage. Therefore, all the variables taken into account are continuous ones. 2.2.2 Clustering methods There are two types of clustering methods: hierarchical and non – hierarchical. Hierarchical clustering gives the groups in a tree – like structure (dendrogram), each level of the structure corresponding to a fixed number of groups. The most used in existing studies in ecology is hierarchical agglomerative clustering that forms the groups using a bottom – up strategy that consists in combining the existing clusters starting from leaves to root. The main disadvantage of this method is that it does not give the optimal number of groups. From a computational point of view the complexity order
ISBN: 978-1-61804-089-3
X* =
X − X min X max − X min
(1)
As a output of the k-means method results the clusters contents and the decision rules. 2.2.3 Model definition In Fig. 1 is defined the proposed model. The output data are included in ellipses. We also chose two type of results visualization:
163
Recent Researches in Applied Information Science
In our model input data are given in Excel format, therefore we need to use the RM import operator ReadExcel. For the hierarchical clustering we use TopDownClustering operator. KMeans operator solve the non- hierarchical clustering task offering as output data the clusters and ClusterModel operator gives the clusters content in the form of a list. The representation of the clustering solution as a decision tree is realized using DecisionTree operator. The implemented processes and the practical results are presented in the next section.
- visualization of clusters content as list of elements together with the decision rules for clusters construction - visualization equivalent decision tree
4 Practical results The input data are imported from Excel format. Data normalization is realized direct in the Excel file using the formula (1). Standardized input data, after importing process are represented in Fig. 2. The first two columns identify the river sectors using an ID and a name.
Fig. 1 – Proposed clustering model
3 Model implementation We chose RapidMiner (RM) for implementation of our model. The main reasons which recommend RapidMiner for our model implementation are: - Is one of the most powerful open-source systems for data mining. - It includes a large collection of modular operators for design and processing of complex data mining problems - Knowledge and data miner processes are represented by means of tree-operators. The leaves of the tree correspond to the simplest steps from the modelled process; the interior nods correspond to the abstract steps and the root to the whole process. - For each operator are defined the input and output data and many settings parameters. - All RapidMiner processes are described using XML - It has a user friendly interface. - It supports a flexible arrangement/rearrangement of operators - It allows data import from 18 formats (Excel, CSV, XML, Access, AML, ARFF, XRFF, SPSS, Stata, Sparse, DBase, C4.5) - Offers many types of output data visualization thereby proving a easier understanding and interpretation of the results.
Fig. 2 – Standardized input data The chains of processes are presented in Fig. 3, 4, 6. Figure 3 illustrate the chain for hierarchical divisive clustering. The output data for this process, represented in Fig. 4, is the optimal number of clusters
Fig. 3 – TopDown clustering
ISBN: 978-1-61804-089-3
164
Recent Researches in Applied Information Science
Fig. 4 – Output data of hierarchical clustering Non-hierarchical clustering process, returning the clusters and the decision rules is represented in Fig. 5.
Fig. 8. Graphic of centroid distances for standardized data The results obtained in the case of standardized and non standardized data are presented in Fig. 9 and Fig. 10
Fig. 5 – Non- hierarchical process 1 Decision tree representation of the solution is obtained using the process illustrated in Fig. 6.
Fig. 9 –Clustering results for non standardized data
Fig. 6 – Non- hierarchical process 2 RM give the possibility of better understanding the clusters construction offering tables and diagrams with centroid distances used in k-means method, as illustrated in Fig. 7. and 8. We use both the gain ratio and gain information criteria and the results were the same. Fig. 10 – Clustering results for standardized data
5. Conclusions and further directions of study In this paper we proposed two model for river sector clustering based on biotope factors. The two models differ by the form of input data: standardized and non-standardized. There are not, as our knowledge these kinds of clustering models for river sectors classification. Using a hierarchical divisive clustering method follows by k-means method we obtain the optimum number of clusters, the clusters effective and a hierarchy of most relevant attributes. Data standardized using min-max normalization
Fig. 7. Graphic of centroid distances for non standardized data
ISBN: 978-1-61804-089-3
165
Recent Researches in Applied Information Science
method where not used before in such ecological clustering problems. The results obtained are very interesting. In the case of non standardized data, the first attribute that differentiates the data is Chanel Modification(CM) and the second one is cobble. In the case of standardized data first attribute that differentiates the data is sand and the second one is RWM. From a computational point of view the centroid diagram easy explain the difference. Normalization makes that the attribute having a great number of values equal to maximum and minimum of the range becomes the most relevant (with a bigger gain information or ratio). An analysis from an ecological point of view shows that both clustering solution are correct. On the superior course of the river the substrate is dominant lythologic, the channel is not modified and the river width is smaller. On the middle and inferior course the width grows, the sand percentage grows too and the frequency of channel modification grows. The following conclusions and further directions of study results from this analysis. The normalization of data influence the results. In our case study both clustering solution are correct from ecological point of view, but we can not conclude that for the cases of other river. A further problem to study is to choose the best normalization method for a specific case study. On the other hand, our conjecture is that a possible correlation may exist between the attributes which are responsible for the data grouping in the standardized and non standardized case. If we could prove this then using these two models we can obtain possible correlated attributes. So, we restrict the search of correlated attributes. Then correlation can then be proved or unproved using specific known methods. Another direction of study consists in using our proposed methods for other case studies to see their level of generality. Acknowledgements: The third author was supported by the project POSDRU/89/1.5/S/63258 from the European Social Fund. References: [1] V. Adriaenssen., P.F.M. Verdonschot, P.L.M. Goethals, N. DePauw, Application of clustering techniques for the characterization of macroinvertebrate communities to suport river restoration management, Aquatic Ecology, 41, 2007, pp. 387-398. [2] C.A. Boys, M.C. Thoms, A large-scale, hierarchical approach for assessing habitat associations of fish assemblages in large dryland rivers, Hydrobiologia, 572, 2006, pp. 11-31.
ISBN: 978-1-61804-089-3
[3] R.D. Brown, Y.C. Martin, An Evaluation of Structural Descriptors and Clustering Methods for Use in Diversity Selection, SAR and QsAR in Environmental Research, Vol. 8, 1998, pp. 23-39. [4] R.J. Diaz, M. Solan, R.M. Valente, A review of approaches for classifying benthic habitats and evaluating habitat quality, Journal of Environmental Management, 73, 2004, pp. 165- 181 [5] B.S. Everitt, Cluster Analysis, Third ed., John Wiley& Sons Inc., New-York, 1993. [6] D.P. Faith, P.A.Walker, Environmental diversity: the best-possible use of surrogate data for assessing the relative biodiversity of sets of areas, Biodiversity Conservation, 5, 1996, pp. 399-415. [7] P. Giudici, S. Figini, Applied Data mining for business and industry, Second ed., John Wiley& Sons Ldt., UK, 2009. [8] P.L.M. Goethals, Special issue “Ecological informatics applications in water management”, Aquatic Ecology, 41, 2007, pp. 371-372. [9] R. Kent, Y. Carmel, Evaluation of five clustering algorithms for biodiversity surrogates, Ecological Indicators, 11, 2011, pp. 896-901. [10] D.T. Larose, Descovering knowledge in data. An introduction to data minig, John Wiley& Sons Inc., New Jersy, 2005. [11] I. Mierswa, M. Wurst, R. Klinkenberg, M. Scholz, T. Euler, Yale (now: RapidMiner): Rapid Prototyping for Complex Data Mining Tasks, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2006), 2006. [12] Y. Rhee, Y. Im, G. J. Carbone, J.R. Jensen, Delineation of climate regions using in-situ and remotely-sensed data for the Carolinas, Remote Sensing of Environment, 112, 2008, pp. 30993111. [13] V. Simeonov, H. Puxbaum, S. Tsakovski, C. Sarbu, M. Kalina, Classification and receptor modelig of wet precipitation data from central Austria, Environmetrics,10, 1999, pp. 137-152. [14] A. Trakhtenbrot, R. Kadmon, Environmental cluster analysis in representing regional species diversity, Conservation Biology, 20, 2005, pp. 1087-1098. [15] I.M.H. Yeung, Multivariate analysis of the Hong Kong Victoria water quality data, Environmental monitoring and Assessment, 59, 1999, pp.331-342.
166