Document not found! Please try again

Clustering methods and algorithms in data mining

2 downloads 419 Views 339KB Size Report
survey of cluster analysis and techniques and methods already available in data ... biology, image pattern recognition, security, business intelligence and Web ... Mostly heuristic method can be applied to small to medium size data base. If.
Journal of Computing Technologies (2278 – 3814) / # 8 / Volume 4 Issue 7

Clustering methods and algorithms in data mining: Concepts and a study Y. Sunil Raj1, A. Pappu Rajan2, S. Charles3, S. Antony Joseph Raj4 1,2,4

Assistant Professor, 4Assistant Professor 1,3,4Department of Computer Science, 2St.Joseph’s Institute of Management 1,2,3St.Joseph’s College(Autonomous), Scared Heard College(Autonomous) [email protected] [email protected] [email protected] [email protected]

Abstract— Fundamental operation in data mining is partitioning of objects into groups. It is an unsupervised learning task to identify finite set of categories expressed as clusters to describe data. Clustering is based on the principle of maximizing the intra class similarity and minimizing the inter class similarity. This paper deals with the concept of data mining and clustering, survey of cluster analysis and techniques and methods already available in data mining in a nutshell, information about existing research on clustering in data mining. Keywords— Data Mining, Clustering

Introduction Data Mining is a family of techniques that transforms raw data into actionable facts. It is interdisciplinary which can be defined in different ways [1]. In the field of database management industry data analysis is mainly evolved with number of large data repositories or data warehouses. In this view data mart is collection of data repositories. Data mart can be processed and it yield result. There are a number of data mining functionalities used to specify the patterns to be found in data mining process, which is a part of pattern analysis in mining [3]. The data mining functionalities are characterization, discrimination, the mining of frequent patterns, associations, correlations, classification, regression, and outlier analysis. This paper is trying to provide a wide survey on clustering analysis. In general the aim of clustering is to find intrinsic structures in data, and organize them into useful subgroups for further study and analysis. Clustering as a data mining technique has its root in many application area such as biology, image pattern recognition, security, business intelligence and Web search. Cluster analysis is one of the data mining techniques to achieve data distribution, or a preprocessing step to other mining algorithms working on identified clusters [2]. Clustering is an unsupervised learning task meant for identifying a finite set of clusters to describe the data. The properties of clusters can be analysed to determine cluster profiles that differentiate one cluster from another cluster [2, 3].

class and high intra-class similarity. A good clustering method produces high quality clusters having high intra-class similarity. The low inter-class similarity means dissimilar to the objects in other clusters [3, 4]. Clustering result quality may depend on similarity measure used by algorithm, its ability to discover hidden patterns and its implementation. I.

PRELIMINARIES

The basic concept of clustering is the process of partitioning large data set of objects into small subsets. Each small subset is a unique cluster, such that the objects are clustered together based on the properties of intra and interclass similarity. In principle maximize intra-class similarity and minimize interclass similarity give more efficiency in clustering. The clusters are so formed that objects within cluster have high similarity with other objects present in same cluster and dissimilar to objects in other clusters. Based on attribute values of objects and their distance measures, similarity and dissimilarity are assessed. The measures include distance measures like Euclidean distance, Manhattan distance between two objects of numeric data. Data mining tools process data to discover interesting patterns or knowledge [10, 3]. Knowledge discovery has several step, one of them is mining which can be used with huge volume of data. The data source includes database, data warehouse, the Web, other data repositories and the data which are streamed into the system dynamically. This data may be in various forms as spatial, hypertext and multimedia, quantitative, special structures. The relational databases are used to study data mining and are known as richest information repositories. Clustering is a dynamic and challenging field of data mining which focuses on finding methods for effective and efficient cluster analysis in large database [2, 3]. In the next part popular and very useful clustering methods are discussed. II. BASIC CLUSTERING METHODS This paper concentrates on a set of clustering technique that is already in use. There are many mining methods available which can be classified into four major categories.

Any cluster should exhibit two properties such as low inter-

© 2015 JCT. All Rights Reserved

8

Journal of Computing Technologies (2278 – 3814) / # 9 / Volume 4 Issue 7

Clustering Methods

Partitioning

K-Means

K-Medoids

Hierarchical

CURE

BIRCH

Density Based

DBSCAN

DENCLUE

Grid Based

STING

selected objects is grouped with the medoid based on their similarity. PAM is iteratively replacing one of the medoids by one of the non-medoids objects for improvement in the algorithm. The PAM is an expensive algorithm in context of finding the medoids, comparing each medoid and iteration of each item in the whole data set of the algorithm.

CLIQUE

Fig.1.0. Classification of Clustering

The figure 1.0 Classification of Clustering Methods, shows the various type of clustering methods in data mining. The categories are partitions, hierarchical, density based and grid based. 3.1. Partitioning Method In partitioning each cluster must belong to exactly one group that depends on a level of partitioning. If a given set of n objects a partitioning method construct k partitions of data, each partitions an exclusive cluster such as k ≤ n. most partitioning methods are distance based. When k partitions are created it is an initial partitioning. It then uses iterative relocation technique which improves partitioning by moving objects from one cluster to another. In general, partitioning technique, the objects in the same cluster are closer to each other, but different clusters are treated far away from each other, which is based on level of partitioning and form of data repositories with their nature. Most applications in partitioning method adopt popular heuristic methods like k-means and k-medoids algorithms, which progressively improve the clustering quality because they are falling under greedy approach. Mostly heuristic method can be applied to small to medium size data base. If data may be larger, this can be extended to new faster heuristic partitioning methods for clustering the data.

3.2. Hierarchical Method Hierarchical method is based on representation of data objects. This creates hierarchical decomposition in the given data set of data objects. This can be classified as agglomerative and divisive. Agglomerative method uses bottom-up strategy which typically starts with each object to form its own cluster and iteratively merges clusters into larger clusters so that all objects form a single cluster. For the merging step first check closest clusters according to similarity measures and then combine the clusters forming a single cluster. Divisive method uses top-down strategy which start with objects in a single cluster as a hierarchy root. It then divides root into several sub-clusters. In this method difficulties are only selecting point of merge and split because new cluster formation is based on selection point of respective cluster. Clustering, once a merge or split is done, neither it can be not undone nor perform object swapping in between clusters. To overcome this drawback, in case of swapping between clusters in hierarchical method multiphase clustering technique which introduces two algorithms BIRCH and Chameleon can be used. These two algorithms are partitioning objects hierarchically using tree structure irrespective of micro and macro clusters through leaf and non leaf nodes. It then applies to other clustering algorithms to create macro-clustering on micro-clusters. Chameleon explores dynamic modeling in hierarchical clustering.

3.1.1. K-Means K-means is one of the most popular clustering algorithms in metric spaces. Initially k cluster centroids are selected at random, and then reassigns all the points to their nearest centroids and recomputed centroids of the newly assembled groups [4,8]. The iterative relocation continues until the criterion function, meet k means the centroids are influenced noice and others because of a small number of data. Other weaknesses are sensitivity to initialization, entrapments into local optima, poor cluster descriptors, and inability to deal with clusters of arbitrary shape, size and density.

3.2.1 CURE Clustering Using Representatives is an agglomerative method following two novel steps [11]. First, clusters are represented by a fixed number of well-scattered points instead of a single centroid. Second, the representatives are shrunk toward their cluster centers by a constant factor. At all iterations clusters with the closest representatives are merged. The use of multiple representatives allows CURE to deal with arbitraryshaped clusters of different sizes, while the shrinking dampens the effects of outliers and noise. The improvement of scalability in CURE algorithm is selection of combination of random sampling and use of partitioning.

3.1.2. K-Medoids Unlike K-Means, in K-Medoids or Partitioning around Medoids [3, 4] algorithm a cluster is represented by its medoid that is the most centrally located object in the cluster. Medoids are more resistant to outliers and noise compared to centroids. PAM begins by selecting randomly an object as medoid for each of the k clusters. Then, each of the non-

3.2.2 BIRCH This algorithm introduces a novel hierarchical data structure, and CF-tree, for compressing the data into many small subclusters and then performs clustering with these compressed summary data set [10]. Sub-clusters are represented by compact summaries, called cluster-features that are stored in

© 2015 JCT. All Rights Reserved

9

Journal of Computing Technologies (2278 – 3814) / # 10 / Volume 4 Issue 7

leafs. The non-leaf nodes store the sums of the CF of their children. A CF-tree is built dynamically and incrementally, requiring a single scan of the dataset. An object is inserted in the closest leaf entry. Two input parameters control the maximum number of children per non-leaf node and the maximum diameter of sub-clusters stored in leafs. BIRCH can create a structure for varying parameters which can influence performance of clustering which also connected with main memory of the system. BIRCH is a fast method algorithm but has few drawbacks, they are data order sensitivity and inability to deal with non-spherical clusters. Here size of clustering can determine the boundary of a cluster diameter. 3.3. Density based clustering method Density based clustering method can create spherical shaped clusters and the clusters are based on distance between objects. Density based clustering method use notation that can represent density for formation of clusters, such as clusters of arbitrary shape take as oval or s-shape [8]. The general idea is to grow clusters in such a way that density of neighbourhood may exceed some threshold. Density based clustering method divide a set of data objects or data points into multiple exclusive clusters or in hierarchy of clusters. The DBSCAN and OPTICS are popular algorithms in density based clustering. There are many challenges in density based clustering such as high dimensional spaces reduce any clustering tendency and various input parameters. 3.3.1 DBSCAN This algorithm seeks for core objects whose neighbourhood contains at least M input points [7]. The skeleton of a cluster is defined using set of core objects with overlapping neighbourhoods. The boundaries of clusters are represented by the non-core points lying inside neighbourhood of core objects. This can discover arbitrary-shaped clusters, is insensitive to outliers and complexity of input data is O (N2). Complexity can be improved up to O (N log N), if a spatial index data structure is used. 3.3.2 DENCLUE Density-based Clustering uses a function called influence function to describe the impact of a point about its neighbourhood. Using density attractors and local maxima of overall density function clusters are determined. To compute sum of influence functions a grid structure is used. The complexity of this algorithm is (O(N)), which may find arbitrary-shaped clusters, noise resistance, and insensitivity to the data ordering, but suffers from its sensitivity to input parameters. 3.4. Grid based clustering method Grid based clustering consists of multi-resolution grid structure. Grid based method takes a space driven approach by partitioning the embedded space into cells independent of the distribution of the input objects. The advantage of this method is fast processing which can be based on number of objects and the dimension. The STING and CLIQUE are used to store

© 2015 JCT. All Rights Reserved

grid cell with high dimensional data space. 3.4.1. STING This algorithm works with numerical attributes working under multi-resolution clustering technique [14]. Rectangular cells are constructed and information’s such as mean, maximum and minimum are pre computed and stored within it [15]. Corresponding to the resolution there are different levels of cells introduced. Parameters at bottom level cells are used to produce parameters at higher level cells. First layer consists of small number of cells where query processing begins. For each cell in this layer pertinence is checked by computing internal confidence, while irrelevant cells identified and removed iteratively until the bottom layer is reached. [16]. 3.4.2. CLIQUE CLIQUE [17] is a subspace partitioning algorithm search for clusters throughout the data space. This divides partitions of each dimension into non overlapping intervals called cells. A cell is dense if numbers of objects that map to it exceed the threshold or else the cell is sparse. All the cells are obtained by partitioning every dimension into intervals of equal length. V. CONCLUSION This paper discusses about different types of clustering methods such as partitioning, hierarchical, density based and grid based clustering. The above mentioned clustering features also has been discussed and detailed study of positive and negative attributes of clustering algorithm has been discussed. The researcher can be taken any one of clustering method to apply real time data, which will help to group data easily. Only after grouping data, mining technique can be used to solve the problem. In clustering the reaching solution through minimum steps is major challenge of new research, because recent data format and volume are numerous to manage and to find solution easily with minimum criteria. References [1] X. Wu, V. Kumar, J.R. Quinlan, J. Ghosh, Q. Yang, H. Motoda, G.J. McLachlan, A. Ng, B. Liu, P.S. Yu, Z. H. Zhou, M. Steinbach, D.J.Hand, D. Steinberg, “Top 10 Algorithms in Data Mining” Knowledge Information Systems, Vol. 14, Pp. 1-37, 2007. [2] I. Guyon, U.V.Luxburg, and R.C. Williamson, “Clustering: Science or Art?”, Proc. NIPS Workshop Clustering Theory, 2009. [3] Jiawei Han, Michheline Kamber, “Data Mining Concepts and Techniques”, Data & Analytics, 3rd Edition, Pp.383-422, 2014. [4] Arun K. Pujari, “Data Mining Techniques - A Reference Book”, Pp 114-147. [5] JiHe, ManLan, Chew-Lim Tan, Sam-Yuan Sung, Hwee Boon Low, “Initialization of Cluster Refinement Algorithms: A Review and Comparative Study”, Proceeding International Joint Conference on Neural Networks, Budapest, 2004. [6] P. Arabie, L. J. Hubert, G. De Soete, “Clustering and Classification”, World Scietific, 1996. [7] Martin Ester, Hans-Peter Kriegel, Jorg Sander, XiaoweiXu, “A Density Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”, 2nd international Conference on Knowledge Discovery and date Mining, 1996.

10

Journal of Computing Technologies (2278 – 3814) / # 11 / Volume 4 Issue 7

[8] Anderberg M.R., “Cluster Analysis for Applications”, Academic Press, New York, 1973, pp. 162-163. [9] Biswas, G., Weingberg, J., Fisher, D. H., “ITERATE: A Conceptual Clustering Algorithm for Data Mining”, IEEE Transactions on Systems, Man, and Cybernetics, v28C., Pp219230. [10] Tian Zhang , Raghu Ramakrishnan , Miron Livny, “BIRCH: An Efficient Data Clustering Method For Very Large Databases”, Proceedings ACM SIGMOD international conference on Management of data, p.103-114, June 04-06, 1996, Montreal, Quebec, Canada. [11] Sudipto Guha, Rajeev Rastogi, Kyuseok Shim, “CURE: An Efficient Clustering Algorithm for Large Databases”, Proceedings ACM SIGMOD international conference on Management of data, p.73-84, June 01-04, 1998, Seattle, Washington, United States. [12] Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, Jörg Sander, “OPTICS: Ordering Points to Identify the Clustering Structure”, Proceedings ACM SIGMOD international conference on Management of data, Pp.49-60, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States. [13] George Karypis, Eui-Hong Han, Vipin Kumar, “Chameleon: Hierarchical Clustering Using Dynamic Modeling”, Computer, V.32 n.8, Pp.68-75, August 1999. [14] Wei Wang, Jiong Yang, Richard Muntz, “STING: A Statistical Information Grid Approach to Spatial Data Mining”, Proceedings 23rd International Conference on Very Large Data Bases, 1987. [15] Ilango, V Mohan, “A Survey of Grid Based Clustering Algorithms”, International Journal of Engineering Science and Technology, Vol. 2(8), 2010. [16] R. Agrawal et al, “Automatic Subspace Clustering of High Dimensional Data For Data Mining Applications”, Proceedings ACM SIGMOD international conference on Management of data,1998. [17] Venish Raja ,PappuRajan (2013),”The NextBig Thing in Computing is Cloud Computing: An Introduction , Concept and Issues”, International Journal of Research in Computer Application & Management, Vol.03,Issue No.11.pp.7-11. [18] Pappu Rajan,G.Prakash Raj,Rosario Vasantha Kumar .P.J. (2013) A study on application o f data and web mining techniques to enrich user experience in libraries and on line book stores, International Journal of Research in Computer Application & Management ,ISSN 2231-1009 , Volume No .3, Issue No.8 . [19] Pappu Rajan, Dr. Rosario Vasantha Kumar .P.J, ,Jothi Kumar,(2015),Green ICT Services and Issues : Nano,Grid and Cloud Computing, International Journal of Research in Computer Application and Management,ISSN 22311009.Volume .5,Issue01,January,2015. [20] A.Pappu Rajan , S.P.Victor ,” Features and Challenges of web mining systems in emerging technology “, International Journal of Current Research ,Vol.4, Issue ,07 ,pp.066-070, July, 2012 ,ISSN : 0975-833X [21] A.Pappu Rajan (2013),A study on security threat awareness among students using social networking by applying data mining techniques, International journal of research in Commerce,IT,& Management,Vol.3,Issue 9,Issn : 2231-5756. [22] ,Rosario Vasantha Kumar .P.J, Pappu Rajan,. (2015) ,RFID Technology Using in Library and Information Centers, Journal of Computing Technologies ,ISSN 2278-3814 , Volume No .4, Issue No03,.March2015 . [23] Pappu Rajan,G.Prakash Raj,Rosario Vasantha Kumar .P.J.Bastin Jesu Raj(2015),A study on Smart Phone User Behaviour and Security Awarenes,International Journal of Exclusive

© 2015 JCT. All Rights Reserved

Management Research,ISSN 2249-2585,Print ISSN 22498672,Vol.5,Issu 3,March 2015,pp.1-10. [24] Pappu Rajan.A,Victor S.P(2012), Utilizing Data Mining Application Techniques to Involve a Business Analytics for Retail Outlets,International Journal of Current Research,ISSN 0975833X,Vol.4,Issue 11,pp.196-204,November 2012. [25] Pappu Rajan.A,Victor S.P(2012), Data Mining Techniques for Preprocessing Process in Web Log Mining,International Journal of Current Research,ISSN 0975-833X,Vol.4,Issue 11,pp.139144,November 2012.

11

Suggest Documents