Vol 19, No. 8;Aug 2012
Performance Comparison between k-Means and Fuzzy C-Means Algorithms using Arbitrary Data Points Dr. T. VELMURUGAN Associate professor, PG and Research Department of Computer Science, D.G.Vaishnav College, Chennai-600106, India. Tel: +91- 9381032070 E-Mail:
[email protected] Abstract Clustering is the most important exploratory data analysis method widely used in many real time applications. Most of the clustering algorithms proved their efficiency in solving different kind of problems for various data sets. Partition based clustering algorithms are simple to implement in order to test its performance and its clustering quality. This research work analyzes about the performance of two of such algorithms namely k-Means and Fuzzy C-Means. The performance comparison is carried out to cluster the arbitrarily distributed data points. Different shapes of arbitrarily distributed data points are given as input to the algorithms and number of data points in each cluster and time complexity is the output of the algorithms. The comparison study is observed from the computational time between two algorithms. The experimental results show that the performance of k-Means algorithm is better than the Fuzzy C-Means algorithm. Keywords: k-Means Clustering, Fuzzy C-Means Clustering, Arbitrary data points, Cluster analysis 1.
Introduction
Data clustering is an unsupervised data analysis and data mining technique, which offers refined and more abstract views to the inherent structure of a data set by partitioning it into a number of disjoint or overlapping (fuzzy) groups. Hundreds of clustering algorithms have been developed by researchers from a number of different scientific and some other disciplines. The development of clustering methods is very interdisciplinary. Contributions have been made, for example, by psychologists, biologists, statisticians, social scientists, and engineers. There exist huge amount of clustering applications from many different fields, such as, biological sciences, life sciences, medical sciences, behavioral and social sciences, earth sciences, engineering and information, policy and decision sciences to mention just a few. This emphasizes the importance of data clustering as a key technique of data mining and knowledge discovery, pattern recognition and statistics. From the varieties of clustering algorithms, the Partitioned algorithms have the advantage of being able to incorporate knowledge about the global shape or size of clusters by using appropriate prototypes and distance measures in the objective function[9][10]. The intention of this paper is to present a performance comparison between two of the partition-based clustering methods namely k-Means and Fuzzy C-Means (FCM). A comparative analysis of these two algorithms is presented in this research work. In order to do it, the time complexity between the algorithms is taken for analysis. The analysis is based on the clustering result quality of the algorithms. Arbitrarily distributed data points are given as input for the algorithms. The organization of the rest of the paper is as follows. In section 2, the two clustering algorithms and its basic concepts are described. In section 3, experimental results of the taken algorithms are tabulated and discussed. Summary of the experimental results are discussed in section 4. Finally, section 5 contains the conclusions. 2.
Materials and Methods
Data mining is the process of extracting features, discovering patterns and clustering data from large volumes of raw data. For conducting such data archives, there is a growing need for rapid processing ability. Clustering is one of the most powerful techniques in data mining research. Clustering is the process of grouping similar data into groups called clusters, so that the objects in the same cluster are more similar to each other and more different from the objects in the other group [7][8]. It is a useful approach in data mining processes for identifying hidden patterns and revealing underlying knowledge from large data collections. This research work is carried out to compare the
234
[email protected]
Vol 19, No. 8;Aug 2012
performance of k-Means and FCM clustering algorithms based on the clustering result quality as stated[9][10]. The basic ideas and its concepts are explored in the following sections. 2.1
The k-Means Algorithm
The k-Means is one of the simplest unsupervised learning algorithms that solve the well-known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori [2] [3]. The main idea is to define k centroids, one for each cluster. These centroids should be placed in a cunning way because different location causes different result. So, the better choice is to place them far away from each other as much as possible. The next step is to take each point belonging to a given data set and associate it to the nearest centroid. When no point is pending, the first step is completed and an early groupage is done. At this point it is necessary to re-calculate k new centroids as bar centers of the clusters resulting from the previous step. After obtaining these k new centroids, a new binding has to be done between the same data points and the nearest new centroid. A loop has been generated. As a result of this loop, one may notice that the k centroids change their location step by step until no more changes are done. In other words centroids do not move any more. Finally, this algorithm aims at minimizing an objective function, in this case a squared error function. The objective function k
J x j 1 i 1
( j)
where xi
cj
2
2
n
( j) i
cj
,
(1) j
is a chosen distance measure between a data point x i and the cluster centre c j , is an indicator
of the distance of the n data points from their respective cluster centers. The algorithm is composed of the following steps: 1. 2. 3. 4.
Place k points into the space represented by the objects that are being clustered. These points represent initial group centroids. Assign each object to the group that has the closest centroid. When all objects have been assigned, recalculate the positions of the k centroids. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated.
The algorithm is significantly sensitive to the initial randomly selected cluster centers. The k-Means algorithm can be run multiple times to reduce this effect. The k-Means is a simple algorithm that has been adapted to many problem domains and it is a good candidate to work for a randomly generated data points. One of the most popular heuristics for solving the k-Means problem is based on a simple iterative scheme for finding a locally minimal solution [1][9]. This algorithm is often called the k-Means algorithm. 2.2
The Fuzzy C-Means Algorithm
Traditional clustering approaches generate partitions; in a partition, each pattern belongs to one and only one cluster. Fuzzy clustering extends this notion to associate each pattern with every cluster using a membership function. The most widely used clustering algorithm implementing the fuzzy philosophy is FCM, initially developed by Dunn and later generalized by Bezdek [4], who proposed a generalization by means of a family of objective functions. Despite this algorithm proved to be less accurate than others, its fuzzy nature and the ease of implementation made it very attractive for a lot of researchers that proposed various improvements and applications. The basic structure of the FCM algorithm is discussed below. The Algorithm FCM is a method of clustering which allows one piece of data to belong to two or more clusters. This method is frequently used in pattern recognition [5]. It is based on minimization of the following objective function: N
C
J m uijm xi c j
2
i 1 j 1
,1≤m