WWW.IJITECH.ORG
ISSN 2321-8665 Vol.02,Issue.05, May-2014, Pages:0319-0324
A Survey on Cluster Techniques for Purpose Oriented Data Mining MAHESH KANDAKATLA1, LOKANATHA C. REDDY2, DR. VAIBHAV BANSAL3 1
Research Scholar, Dept of CSE, OPJS University, Rajasthan, India, E-mail:
[email protected]. 2 Professor, Dept of CS, School of Science& Technology, Dravidian University, Kuppam, AP, India, E-mail:
[email protected]. 3 Associate Professor, Dept of CSE, OPJS University, Rajasthan, India, E-mail:
[email protected].
Abstract: Database mining has gained the huge amount of attention in database communities due to its wide applicability in retail, financial, telecom industrial sectors to improve marketing strategy. It is noted that analysis of past transaction data provides very valuable information on customer buying behavior and thus improve the quality of business decisions (Such as what to put on sale, how to customize marketing programs, to name a few). Cluster analysis could be used to find interesting segments in the business sector. This program of study continues personal research and professional practice in the applied field of computer science, particularly within the area of end-user systems accessibility with high reliability for decision making while several attractive choices exist. On the other hand, each clustering algorithm has its own strengths and weaknesses, due to the complexity of information. The goal of this survey is to provide a comprehensive review of different clustering techniques in data mining.
The grouping of knowledge instances groups into subsets in order that similar cases are sorted along, whereas completely different instances belong to different groups. The cases are organized during this way in an economical representation that characterizes the population being sampled. therefore the output of cluster analysis is that the range of teams or teams forming the partition structure, the information set.
Keywords: Database, Retail, Financial, Telecom, Clustering. I. WHY CLUSTERING The grouping of data is one in all the tough Mining techniques within the discovery method knowledge data. cluster huge quantity of data could be a tough task as a result of the goal is to search out a suitable partition, unsupervised manner (ie, without any previous knowledge) making an attempt to maximise intra-group similarity and minimize similarity between sectoral groups to their cohesion whereas maintaining high cluster.
Fig.1. Cluster Formation [1]
Fig.2. Clustering Process II. MOTIVATION Data mining involves six common categories of tasks: Anomaly detection (Outlier/change/deviation detection) – The identification of unusual information records, which may be information errors that need any investigation. Association rule learning (Dependency modeling) – Searches for relationships between variables. clustering [2][3][4] – is that the task of discovering groups and structures within the information that are in how or another "similar", without exploitation known structures within the information. Classification – is the task of generalizing known structure to apply to new data. for instance, an e-mail program attempt to classify an e-mail as "legitimate" or as "spam". Regression – tries to seek out a function which models the data with the smallest amount error. Summarization providing a additional compact representation of the data set, which includes visualisation and report generation. during this paper, numerous clump analysis is completed. clump Analysis or clump may be a technique of grouping information into
Copyright @ 2014 IJIT. All rights reserved.
MAHESH KANDAKATLA, LOKANATHA C. REDDY, DR. VAIBHAV BANSAL completely different teams (i.e.) set of objects, so that the data in every cluster share similar trends and pattern. A good clustering technique will produce prime quality clusters with high intra-cluster similarity and low inter- cluster similarity. the standard of a result made by clustering depends on each the similarity measure employed by the strategy and its implementation. the quality of a clusters created by clustering technique is additionally measured by its ability to search out some or all of the hidden patterns. Fig.6. Contiguous Clusters (8 Contiguous Cluster) D. Density-Based Clusters A cluster could be a dense region of points, which is separated by in step with the low-density regions, from alternative regions that's of high density. Used when the cluster is intertwined or irregular, and once noise and outliers are present.
Fig.3. Data Flow Representation III. GENERAL TYPES OF CLUSTERS A. Well-Separated Clusters If the clustering area sufficiently well separated, then any cluster technique performs well. A cluster could be a set of node such any node in a very cluster is nearer to each alternative node in the cluster then to any node not in the cluster. Fig.7. Density-Based Clusters (6 Density-Based Cluster) E. Conceptual Clusters Shared property or abstract Clusters that share some common property or represent a particular concept.
Fig.4. Well-Separated Cluster B. Center-Based Clusters A cluster may be a set of points so a point in a cluster is nearest (or additional similar) to one or additional different points in the cluster as compared to any point that is not in the cluster.
Fig.8. Conceptual Cluster (2 Overlapping Circles)
Fig.5. Center-Based Clusters C. Contiguous Clusters (Nearest Neighbour or Transitive) A cluster may be a set of points in order that a point in a very cluster is nearest (or additional similar) to 1 or additional other points within the cluster as compared to any point that's not within the cluster.
IV. PARTITIONING CLUSTERING ALGORITHMS Partitioning clustering algorithms, like K-means, medoids, PAM, CLARA and CLARANS assign objects into k (predefined cluster number) clusters, and iteratively allocate objects to boost the quality of clustering results. K-means is that the most popular and easy-to understand clustering algorithmic program [5]. K-means algorithm is extremely sensitive to the selection of the initial centroids, in other words, the various centroids might produce important variations of bunch results. Another disadvantage of K-means
International Journal of Innovative Technologies Volume.02, Issue No.05, May-2014, Pages: 0319-0324
A Survey on Cluster Techniques for Purpose Oriented Data Mining is that, there's no general theoretical answer to find the best categories: agglomerative and divisive. Agglomerative and number of clusters for any given data set. an easy answer divisive clustering are applied to document clustering. would be to check the results of multiple runs with completely different k numbers and select the simplest one according to a given criterion, however once the data size is giant, it would be very time overwhelming to have multiple runs of K-means and therefore the comparison of clustering results after each run. K-medoids calculates the medoid of the objects in each cluster. the method of K-medoids algorithm is kind of similar as K-means. Whereas, K-medoids clustering algorithm is extremely sensitive to outliers. Outliers might seriously influence clustering results. To solve this drawback, some efforts are made based on K- medoids, as an example PAM (Partitioning Around Medoids) was proposed by kaufman and Rousseeuw [6]. PAM inherits the features of Kmedoids clustering algorithm. Meanwhile, PAM equips a Fig.9. Tree Structure of Training Data (Dendrogram). medoids swap mechanism to produce higher clustering results. AGNES (Agglomerative Nesting) adopts agglomerative strategy to merge clusters. AGNET arranges every object as a PAM is more robust than k-means in terms of handling cluster at the beginning, then merges them as upper level noise and outliers, since the medoids in PAM area unit less clusters by given agglomerative criteria step-by-step until all influenced by outliers. With the O(k(n-k)2) process value for objects form a cluster. The similarity between 2 clusters is every iteration of swap (where k is the cluster range, n is that measured by the similarity perform of the closest pair of data the items of the data set), it's clear that PAM solely performs points within the 2 clusters, i.e., single link. DIANA (Divisive well on small-sized datasets, but doesn't scale well Analysis) adopts an opposite merging strategy, it initially puts to giant datasets. PAM is embedded within the applied all objects in one cluster, then splits them into many level math analysis systems, like SAS, R, S+ and etc. to modify the clusters until each cluster contains only 1 object. The applications of huge sized datasets, i.e., CLARA (Clustering merging/splitting decisions area unit critical in AGNES and massive Applications). By applying PAM to multiple DIANA. On the opposite hand, with O(n2) computational sampled subsets of a dataset, for every sample, CLARA cost, their application is not scalable to very giant datasets. will turn out the higher clustering results than PAM in Zhang et al [8] proposed an efficient hierarchic clustering larger data sets. however the efficiency of CLARA depends methodology to deal with the above problems, BIRCH on the sample size. On the other hand, an area optimum (Balanced and iterative Reducing and clustering using clustering of samples may not the worldwide optimum of the Hierarchies). BIRCH summarizes an entire dataset into a CFfull data set. Ng and han [7] abstracts the mediods looking in tree and then runs a hierarchical clustering algorithm on a PAM or CLARA as looking k subgraphs from n points graph, multi-level compression technique, CF-tree, to induce the and supported this understanding, they propose a PAMclustering result. Its linear scalability is good at clustering like clustering algorithm known as CLARANS (Clustering with a single scan and its quality is further improved by a massive Applications based mostly upon randomized couple of further scans. it's an efficient clustering technique Search). whereas PAM searches the full graph and CLARA on arbitrarily formed clusters. however BIRCH is sensitive to searches some random sub-graphs, CLARANS every which the input order of data objects, and can conjointly solely deal way samples a group and selects k medoids in climb subwith numeric data. This limits its stability of clustering and graph mountains. CLARANS selects the neighboring objects scalability in real world applications. CURE uses a set of of medoids as candidates of latest medoids. It samples subsets representative points to describe the boundary of a cluster in to verify medoids in multiple times to avoid bad samples. its hierarchical algorithm [9]. However with the increase of Obviously, multiple time sampling of medoids verification is the complexity of cluster shapes, the amount of representative time intense. This limits CLARANS from clustering very points will increase dramatically in order to take care of the giant datasets in an acceptable period. precision. CHAMELEON [10] employs a multilevel graph partitioning algorithm on the k-Nearest Neighbor graph, V. HIERARCHICAL CLUSTERING ALGORITHMS which can produce better results than CURE on complex The hierarchic strategies cluster training data into a tree of cluster shapes for spacial datasets. but the high complexity of clusters. This tree structure known as dendrogram (Fig. 8). It the algorithm prevents its application on higher dimensional represents a sequence of nested cluster that made topdatasets. down or bottom-up. the foundation of the tree represents one cluster, containing all data points, whereas at the leaves of the A. Density-Based Methods tree, there area unit n clusters, every containing one In the Density-Based technique, most partitioning information. By cutting the tree at a desired level, a strategies cluster objects based on the distance between clustering of the information points into disjoint groups is objects. Such strategies will find arbitrary shaped clusters. the obtained. Hierarchical clustering algorithms divide into 2 International Journal of Innovative Technologies Volume.02, Issue No.05, May-2014, Pages: 0319-0324
MAHESH KANDAKATLA, LOKANATHA C. REDDY, DR. VAIBHAV BANSAL general plan here is to continue growing the given cluster as algorithms on coping with an arbitrarily shaped distribution of long because of the density or say the range of objects or data data of extremely giant and high- dimensional datasets. points in the neighborhood exceeds some threshold. Such strategies can be used to filter our noise or outliers. In the Grid-Based technique, it quantizes the object space into a finite variety of cells that form a grid structure. the most advantage of this approach is its quick processing time, which is independent of a number of data objects and dependent on the quantity of cells in each dimension within the quantized space. The primary idea of density-based strategies is that for every point of a cluster the neighborhood of a given unit distance contains at least a minimum range of points, i.e. the density in the neighborhood should reach some threshold. However, this concept is based on the idea of that the clusters square measure within the spherical or regular shapes. Density-based clustering is to find clusters of arbitrary shape in spatial databases with noise. It forms clusters supported the maximal set of density-connected points. The core part in Density-Based clustering is density-reach ability and density connectivity. additionally, it needs 2 input parameters i.e. Eps which is known as radius and also the MinPts i.e. the minimum number of points needed to form a cluster. It starts with an arbitrary starting point that has not visited once. Then Fig.10. The Grid-Cell Structure of Gird-Based Clustering the ε- neighborhood is retrieved, and if it contains sufficiently Methods many points than a cluster is started. Otherwise, the point is labeled as noise. DBSCAN (Density-Based spatial clustering TABLE I. Clustering Techniques Advantages and of Applications with Noise) was proposed to adopt densityDisadvantages reach ability and density connectivity for handling the Clusterin Cluster Disadvantage arbitrarily formed clusters and noise [11]. however DBSCAN Advantages g Type Name s is extremely sensitive to the following parameter Eps (unit distance or radius) and MinPts (threshold density), as a result Sensitive to Simple of before doing cluster exploration, the user is predicted to outliers estimate Eps and MinPts. DENCLUE (Density-based Centroids not Clustering) is a distribution-based algorithm [12], that meaningful in performs well on clustering giant datasets with high noise. most problems Also, it's significantly quicker than existing density-based Very sensitive algorithms, however, DENCLUE needs a giant number of to the initial parameters. OPTICS is nice at work the arbitrarily shaped configuration K-means clusters, however, its non-linear complexity typically makes it and the Most popular only applicable to tiny or medium datasets [13] . obtained partition B. Grid-Based Methods and often only The idea of grid-based clustering strategies is based on the suboptimal i.e. clustering oriented question respondent in multilevel grid not the Partitiona structures. The upper-level of the grid stores the summary of the globally best l information of its next level, so the grids create cells between partition the connected levels, as illustrated in Fig. 9. Many grid-based Works Cluster should methods are proposed, like STING (Statistical information satisfactory for be Grid Approach) [14], MAFIA [15], CLIQUE [16], and small datasets predetermined therefore the combination of grid-density primarily based Not efficient PAM technique WaveCluster. The grid-based methods are efficient in dealing on clustering data with the complexity of O(N). but the Robust to outliers with medium primary issue of grid-based techniques is how to decide the and large size of grids. This quite depends on the user’s expertise. Based datasets. on the above review, we will conclude that the application of Applicable for Sensitive to CLARA clustering algorithms are to detect grouping the information in large data set outliers real world applications in data mining is still a challenge, CLARA Handles outliers High cost because of the inefficiency of most existing clustering NS effectively International Journal of Innovative Technologies Volume.02, Issue No.05, May-2014, Pages: 0319-0324
DBSCA N
A Survey on Cluster Techniques for Purpose Oriented Data Mining The task of finding clusters of building a different shapes Can detect and spatial index and sizes handle clusters of is timesuitable for large various shapes cosuming and databases (Arbitrary) and less applicable Scales linearly: sizes. to high finds a good dimensional BIRCH clustering with a data sets. single scan and It is not easy improves the to determine quality with a few Does not require proper initial additional scans a-priori values of IPs Robust specification of and MinPts number of and ROCK Appropriate for clusters Able to adjustment of large dataset identify noise data these while clustering parameter is and robust to CHAME Appropriate for not easy when outliers LEON large dataset no. of samples changes
Density Based OPTICS
DENCL UE
RDBC
Hierarchi cal agglomer ative(20)
Good for data set with large amount of noise Faster in computation It is nonparametric algorithm means does not require to input the parameter Solid mathematical foundation Performs well on large datasets with high noise More effective in discovering varied shape clusters RDBC is improvement over DBSCAN and yields superior results Handles noise effectively
Robust to outliers CURE Appropriate for handling large dataset Capable of
Needs large no. of parameters It handles small or medium data sets because of non-linear complexity
S-link
it does not need to specify no.of clusters
Ave-link
It considers all members in cluster rather than single point
Comlink
Needs large no.of parameters
BKMS
Cost Varying
Ignores information about inter connectivity of objects in two clusters
STING Grid
WaveCl uster
International Journal of Innovative Technologies Volume.02, Issue No.05, May-2014, Pages: 0319-0324
Not strongly affected by outliers Low computational cost Better performance than K-means Resistant to noise Allows parallelization and multiresolution High-quality clusters Ability to work well in relatively high dimensional spatial data Successful outlier handling Dimensionality reduction
Handles only numeric data sensitive to data records
space complexity depends on initialization of local heaps Not applicable to high dimesions Termination condition needs to be satisfied Sensitive to outliers It produces clusters with same variance It has problem with hyper spherical shape It has problem with convex shape clusters.
Cannot handle varying densities
Does not define appropriate level of granularity
Cost Varying Prone to high dimensional clusters
MAHESH KANDAKATLA, LOKANATHA C. REDDY, DR. VAIBHAV BANSAL [10]KARYPIS, G., HAN, E.-H., and KUMAR, V. 1999a. CLIQUE Scalability Cost Varying CHAMELEON: A hierarchical clustering algorithm using Prone to high Insensitive to dynamic modeling, COMPUTER, 32, 68-75. dimensional noise [11]CHEN Ning , CHEN An, ZHOU Long-xiang,”An clusters Incremental Grid Density-Based Clustering Algorithm”, Journal of Software, Vol.13, No.1,2002. VI. CONCLUSION [12]M. Parimala, Daphne Lopez, and N. C. Senthilkumar. “A The process of data mining is to separate the information Survey on Density Based Clustering Algorithms for Mining from the giant data set and transform it into a clear form. Large Spatial Databases”. Jun 2011. International Journal of clustering plays an important role in data mining applications Advanced Science and Technology. Vol. 31. and data mining analysis. clustering can be done by [13]M. Ankerst, M.M.Breunig, H.-P. Kriegel, J.Sander, completely different algorithms like grid based algorithm, “OPTICS: Ordering points to identify the clustering hierarchical algorithm, partitioning algorithm and density structure”, in proceedings of ACM SIGMOD Conference, based algorithm. Grid based clustering incorporates a finite 1999 pp. 49-60. number of cells that form the grid structure. hierarchical [14]Wei Wang, Jiong Yang, and Richard Muntz : STING : A based clustering is connectivity based clustering. Partitioning Statistical Grid Appraoch to Spatial Data Mining : based rule is the centroids based clustering. These clustering Department of Computer Science, University of California, techniques produce efficient clusters when compare to Los Angels another clustering with low cost. in this paper, we classified [15]S. Goil, H. Nagesh, and A. Choudhary, "Mafia: Efficient clustering algorithms. one of the foremost purposes of and scalable subspace clustering for very large data sets," algorithms to minimize disk I/O operations, consequently Technical Report, Northwestern University, 1999. reducing time complexity. we've got declared algorithms [16]L. Parson, E. Haque, and H. Liu, "Subspace Clustering attributes, Disadvantages, and advantages. Finally, we for High Dimensional Data: A Review " Sigkdd Explorations, compare them. Table 1 illustrates the advantages and vol. 14, pp. 90- 106, 2004. disadvantages of clustering techniques. VII. REFERENCES [1]R.Saranya, P.Krishnakumari, “Clustering with Multi view point-Based Similarity Measure using NMF”, International Journal of scientific research and management (IJSRM) Volume 1,Issue 6-2013 [2]Pavel Berkhin, “A Survey of Clustering Data Mining Techniques”, pp.25-71, 2002. [3]Pradeep Rai, Shubha Singh” A Survey of Clustering Techniques” International Journal of Computer Applications, October 2010. [4]M.Vijayalakshmi, M.Renuka Devi, “A Survey of Different Issue of Different clustering Algorithms Used in Large Data sets”, International Journal of Advanced Research in Computer Science and Software Engineering, pp.305-307, 2012. [5]J. McQueen, “Some methods for classification and analysis of multivariate observations”, Proc. of 5th Berkeley Symposium on Mathematics, Statistics and Probability, Volume 1, 1967, pp. 281-298. [6]KAUFMAN, L. and ROUSSEEUW, P. 1990. Finding Groups in Data:An Introduction to Cluster Analysis. John Wiley and Sons, New York, NY. [7]NG, R. and HAN, J. 1994. Efficient and effective clustering methods for spatial data mining. In Proceedings of the 20th Conference on VLDB, 144-155, Santiago, Chile. [8]Tian Zhang, Raghu Ramakrishnan & Miron Linvy (1996), “BIRCH: An Efficient Data Clustering Method for Large Databases”, Proceedings of 1996 ACM-SIGMOD, International Conference on Management of Data, Montreal, Quebec [9]Guha S., Rastogi R., Shim K.: CURE: An efficient clustering algorithm for large databases. Proc. Of ACM SIGMOD Conference (1998)
Author’s Profile: Mahesh Kandakatla earned MTech CSE from JNTU Hyderabad; Presently, he is research scholar at OPJS University Rajasthan, India. His active research interests include data mining, network security and ad-hoc networks.
Dr. Lokanatha C. Reddy earned M.Sc. (Maths) from Indian Institute of Technology, New Delhi; M.Tech.(CS) with Honours from Indian Statistical Institute, Kolkata; and Ph.D.(CS) from Sri Krishnadevaraya University, Anantapur. Earlier worked at KSRM College of Engineering, Kadapa; Indian Space Research Organization (ISAC) at Bangalore and as the Head of the Computer Centre at the Sri Krishnadevaraya University, Anantapur. Presently, he is a Professor of Computer Science at the Dravidian University, India. His active research interests include Real time Computation, Distributed Computation, Digital Image Processing, Pattern Recognition, Networks, Data Mining, Digital Libraries and Machine Translation.
International Journal of Innovative Technologies Volume.02, Issue No.05, May-2014, Pages: 0319-0324