A dissimilarity measure based fuzzy c-means (FCM ...

1 downloads 0 Views 1MB Size Report
Keywords: Clustering, fuzzy c-means, cluster validity indices, dissimilarity measure. 1. ... puter Engineering, College of Electrical and Mechanical Engineering,.
229

Journal of Intelligent & Fuzzy Systems 26 (2014) 229–238 DOI:10.3233/IFS-120730 IOS Press

A dissimilarity measure based fuzzy c-means (FCM) clustering algorithm Usman Qamar∗ Department of Computer Engineering, College of Electrical and Mechanical Engineering, National University of Sciences and Technology (NUST), Islamabad, Pakistan

Abstract. According to the definition of cluster objects belonging to same cluster must have high similarity while objects belonging to different clusters should be highly dissimilar. In the same way cluster validity indices for analyzing clustering result are based on the same two properties of cluster i.e. compactness (intra-cluster similarity) and separation (inter-cluster dissimilarity). Most of the clustering algorithm developed so far focuses only on minimizing the within cluster distance. Almost all clustering algorithms ignore to include the second property of a cluster i.e. to produce highly dissimilar clusters. This paper recommends and incorporates a dissimilarity measure in Fuzzy c-means (FCM) clustering algorithm, a well-known and widely used algorithm for data clustering, to analyze the benefit of considering second property of cluster. Here we also introduced a new effective way of incorporating the effect of such measures in a clustering algorithm. Experimental results on both synthetic and real datasets had shown the better performance attained by the new improved Fuzzy c-means in comparison to classical Fuzzy c-means algorithm. Keywords: Clustering, fuzzy c-means, cluster validity indices, dissimilarity measure

1. Introduction Clustering is the process of assigning a set of records into subsets, called clusters, such that records in the same cluster are similar and records in different clusters are quite distinct. Clustering has a wide range of applications in the field of engineering, computer science and medical science. There exists a large number of clustering algorithms divided into different types. Partitional clustering is one of the commonly used techniques for data clustering. In this technique a set of data points are assigned into C partitions (clusters). This technique usually involves the optimization of some objective function based on some similarity measure. Partitional clustering is further divided into two types i.e. hard and fuzzy clustering. The difference between ∗ Corresponding author. Dr. Usman Qamar, Department of Computer Engineering, College of Electrical and Mechanical Engineering, National University of Sciences and Technology (NUST), H-12, Islamabad, Pakistan. E-mail: [email protected].

hard and fuzzy clustering is that a data point in hard clustering belongs to one cluster at a time while in fuzzy clustering it can be a member of many clusters at a time. Internal validation measures are mostly used to evaluate the results of clustering algorithms. These are often based on two criterions one is compactness and the other is separation [1]. Compactness measures the similarity of data within the cluster. To measure compactness of a cluster variance or distance between data points within cluster is used. While in separation criteria distance between different clusters is measured and is expected to have a maximum value. The distance measure can be center to center distance of clusters or distance between data points of different clusters. As there are two goals of clustering as well as cluster validity measures but almost all of the partitional clustering algorithms uses only one of these two i.e. the similarity measure. This paper puts an effort toward integrating the second goal, the dissimilarity measure,

1064-1246/14/$27.50 © 2014 – IOS Press and the authors. All rights reserved

230

U. Qamar / A dissimilarity measure based fuzzy c-means (FCM) clustering algorithm

into clustering algorithm. This is done by first proposing a dissimilarity term and then an effective way of integrating this term into an existing partitional clustering algorithm i.e. Fuzzy c-means. Rest of the paper is organized as follows: Section II describes a brief summary of fuzzy c-means clustering algorithm and some of its improved version. Section III presents the suggested dissimilarity term and its integration to fuzzy c-means. This section also explains how the proposed improvement works. Section IV summarizes some validity indices used to compare the clustering results. Results of the experiments on synthetic and real data are included and discussed in Section V. In last concluding remarks are presented in Section VI

2. Background Here we provide some background knowledge on FCM and some other improved clustering algorithms based on FCM. The purpose of presenting these improved clustering algorithms is to support the technique used for integration of dissimilarity term, discussed later in this paper, into Fuzzy c-means clustering.

Here, square of Euclidean norm is used as distance measure (similarity measure), i.e, Di,j =

m  

2 xi,l − cj,l ,

1 ≤ i ≤ n, 1 ≤ j ≤ k

l=1

(2)

Subject to the following condition, k 

ui,j = 1, ui,j ∈ (0, 1] ,

1≤i≤n

(3)

j=1

To optimize the objective function for FCM based on intra-cluster similarity it is needed to minimize its value. The minimization problem is solved by iteratively calculating cluster centers C and fuzzy membership matrix U using the following equations until the change in U is negligible. n m i=1 (ui,j ) xi,l cj,l =  , 1 ≤ j ≤ k, 1 ≤ l ≤ m (4) n m i=1 (ui,j ) ui,j =

k

l=1



1 Di,j Dl,j

2/(m−1) ;

1 ≤ i ≤ n, 1 ≤ j ≤ k (5)

2.2. Competitive agglomerative clustering 2.1. Fuzzy c-means clustering The fuzzy c-means (FCM) algorithm sometimes also referred as the fuzzy k-means algorithm was developed by Dunn, 1974 [2] and improved by Bezdek, 1981 [3]. Fuzzy k-means clustering algorithm is an extension to k-means clustering algorithm with a fuzzy membership matrix embedded into its objective function. Let X = {X1 , X1 , . . . .Xn } be a collection of n objects where each object Xi is represented as [xi,1 , xi,2 , . . . . xi,m ], here m is the number of numerical attributes. To partition the dataset X into k number of clusters, that are represented by C = [c1 , c2 . . . .ck ]T a k × m matrix containing all cluster centers, the objective function used for FCM is as under Ja (X, C, U) =

k  n 

(ui,j )a Di,j ,

1 ≤ a < ∞ (1)

j=1 i=1

Where U is an n × k known as fuzzy membership matrix and ui,j corresponds to the association degree of membership that an ith object of the given data is having with jth cluster center cj .

Competitive agglomeration [4] combines the advantages of both hierarchal and partitional clustering. This is done by adding another term to fuzzy k-means objective function. The objective function is as follows  n 2 k  k n    2 J(X, C, U) = (ui,j ) Di,j − ∝ ui,j j=1 i=1

j=1

i=1

(6) Subject to the condition in equation (3). To calculate the cluster centers C equation (4) is used however U is calculated using equation as follows

1/Di,j ui,j = k

l=1 1/Di,l k

n k α(z)  l=1 1/Di,l y=1 uy,l + ul,j −

k Di,j l=1 1/Di,l l=1 (7) Where z is the iteration number and k n 2 j=1 i=1 (ui,j ) Di,j (8) α (z) = η0 exp (−z/τ)  n

2 k j=1 i=1 ui,j

U. Qamar / A dissimilarity measure based fuzzy c-means (FCM) clustering algorithm

2.3. Agglomerative Fuzzy k-means clustering Another effort for clustering discussed here which is known as agglomerative fuzzy k-means clustering [5] uses similar method as competitive agglomerative clustering but with the modified additional term in objective function. The objective function is as k  k  n n   J(X, C, U) = ui,j Di,j − λ ui,j log ui,j j=1 i=1

j=1 i=1

(9) Subject to the condition in equation (3). Equation (4) is used to calculate the cluster centers C while U is calculated using following equation   exp

ui,j =  k

−Di,j λ



l=1 exp

−D1,j λ



(10)

The value of λ starts from a low value such as 0.1 and increase linearly as iteration progresses. 3. Proposed algorithm First to introduce the effect of second objective of clustering, that is to keep the dissimilarity of objects from different clusters maximum, we propose a dissimilarity term. This term is equal to the average distance between object of a cluster and center of remaining clusters and is calculated as follows. k m  

1

xi,l − cp,l , Bi,j = k−1 (11) p=1,p = / 1 l=1 1 ≤ i ≤ n, 1 ≤ j ≤ k Where Bi,j is the measure of dissimilarity that an ith object in jth cluster is having with remaining clusters. In case of fuzzy clustering an object is member of many clusters at a time. The dissimilarity term will attain the maximum value when it is considered the member of its true cluster. And will have the minimum value when considered as a member in the farthest (highly dissimilar) cluster. Next task is to integrate the effect of this term in FCM through some effective manner. Upon studying the clustering algorithms discussed above i.e. fuzzy c-means, competitive agglomerative clustering and agglomerative fuzzy k-means clustering it is observed that the last two algorithms adds a term to the objective function of FCM for improvement. One possible way of incorporating the proposed term is to include it in the objective function of FCM. But further study of the

231

algorithms reveals that the additional term to FCM ultimately appears in calculating the fuzzy membership matrix U. As each of these algorithms iteratively calculates cluster centers C and fuzzy membership matrix U to optimize their objective function, but the calculation equation for cluster centers C is identical for all algorithms. Eventually all the above algorithms try to find a better U in order to improve their clustering results. In fuzzy clustering the fuzzy membership matrix U contains degree of membership values for all objects to their clusters. It means that all these values are the degree of similarity that an object is having with its cluster. To integrate the dissimilarity term into FCM we add a new step to FCM algorithm after calculation of U. In this step first we calculate dissimilarity values for all objects with all clusters in a matrix form. This matrix can be named as dissimilarity matrix B having dimensions n × k same as U. Then multiply the fuzzy membership matrix value of ith object to jth cluster with the dissimilarity value of ith object with all clusters except cluster j to include the benefit of dissimilarity measure. Multiplying the value of U with some number will violate the condition imposed on U in equation (3). This violation is tackled by again normalizing the values of U in the range of [0, 1]. Multiplying a greater value of dissimilarity with the membership value of an object will further increase its resultant membership value as compared to multiplying a smaller value of dissimilarity with membership value of other object. The total membership value of an object is equal to one so in result of normalization the membership values for an object are adjusted by subtracting membership values towards dissimilar clusters

Fig. 1. Randomly initialized cluster centers.

232

U. Qamar / A dissimilarity measure based fuzzy c-means (FCM) clustering algorithm

Fig. 2. After 1st iteration (modified FCM).

Fig. 3. After 2nd iteration (modified FCM).

and adding them to the most similar clusters. Thus leaving the fuzzy membership matrix in a state where it contains the desired increased membership values for objects to their true clusters. 3.1. Algorithm - (Improved Fuzzy c-means clustering) 1) Randomly initialize cluster centers 2) Repeat a. Compute U using (5) b. Compute B using (6) c. Multiply B with U to calculate resultant U d. Normalize U to fulfill the condition in (3) e. Compute C using (4) 3) Until small change in U

Fig. 4. After 3rd iteration (modified FCM).

Fig. 5. After 4th iteration (modified FCM).

3.2. Working To explain the working of proposed improvement an example is presented. There are nine data points in three clusters with three data points in each cluster with initial cluster center positions are shown in Fig. 1. Self-explanatory Figs. 2, 3 and 4 show fuzzy membership values of data points for 1st, 2nd and 3rd iteration before dissimilarity values are multiplied to U and after change in the values of U. It can be observed in figures that changes in the membership values are according to the expectations. That is membership values to dissimilar clusters of an object are subtracted and added to the membership values to most similar clusters of that object. The value added or subtracted is based on their similarity or dissimilarity towards the cluster. Figure 5

U. Qamar / A dissimilarity measure based fuzzy c-means (FCM) clustering algorithm

233

shows the final cluster center positions and fuzzy membership values of data points for FCM and modified FCM. It is clear from the final results that new membership values are more crisp and determined in less number of iterations by modified FCM as compared to traditional FCM.

mark classifications. And the other is internal validity measure uses the clustering results itself to measure the validity of results. We selected Rand Index (RI) [6] as external validity measure and three internal validity indices Partition Coefficient (PC) [3], Coefficient Entropy (CE) [7] and Index I [8] for validation of our experimental results.

4. Cluster validity indices

4.1. Rand index

For validating clustering algorithm results mainly two types of validity indices are used. One is external validity measure that computes how similar the clusters (returned by the clustering algorithm) are to the bench-

The Rand index computes how similar the clusters (returned by the clustering algorithm) are to the benchmark classifications. One can also view the Rand index as a measure of the percentage of correct decisions made

Fig. 6. Synthetic dataset from a mixture of four Gaussian distribution having uniform densities (a)  (b) 2 (c) 3 (d) 4 (e) 5 (f) 6 (g) 7 (h) 8 (i) 9 (j) 10

234

U. Qamar / A dissimilarity measure based fuzzy c-means (FCM) clustering algorithm

by the algorithm. It can be computed using the following formula: RI =

TP + TN TP + FP + FN + TN

(12)

Where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives. Value of Rand Index ranges from 0 to 1, a high value of Rand Index shows maximum accuracy.

4.3. Coefficient entropy It measures the fuzziness of the clustering results. A value nearer to 0 shows good clustering results. While a value near to its upper bound i.e. loga k indicates the inability of clustering algorithm to produce good results. 1  CE = ui,j loga (ui,j ) n k

n

(14)

j=1 i=1

4.2. Partition coefficient

4.4. Index I

PC measures the amount of overlapping between clusters. It value ranges from 1/k to one. The more nearer the value of index to one more crisper will be the clustering results. While a value near to 1/k exposes the fuzziness of clustering results.

This index measures separation based on the maximum distance between cluster centers, and measures compactness based on the sum of distances between objects and their cluster center. A maximum value of Index I show the clustering results to be more accurate. p  1 d(x, c) x∈X I= .  (15) .maxi,j d(ci , cj ) k i x∈Ci d(x, ci )

1  2 ui,j n k

PC =

n

j=1 i=1

(13)

Fig. 7. (a) Rand index vs data variance. (b) PC vs data variance. (c) CE vs data variance. (d)I index vs data variance.

U. Qamar / A dissimilarity measure based fuzzy c-means (FCM) clustering algorithm



5. Experiential results + 0.25 Guassian 5.1. Experiment no.1



In this experiment 1320 synthetic data points from a mixture of four Gaussian distribution having uniform densities given below are tested with changing variance from  to 10 . The experiment is repeated 50 times for all variance values and average values are taken to get the stable results.  0.25 Guassian

1.0 1.0

,

0.02 0.005 0.005 0.02



+ 0.25 Guassian  + 0.25 Guassian

2.5 2.5 2.0 4.0 4.0 2.0

235

, , ,

0.02 0.0 0.0 0.02



0.02 0.005 0.005 0.02 0.02 0.005 0.005 0.02

 

The variance of data is shown in Fig. 6(a) to Fig. 6(j) and the results are given in Fig. 7(a) to Fig. 7(d). Figure 7(a) contains the results for external validity index Rand measure, which shows that the accuracy

Fig. 8. Synthetic dataset from a mixture of four Gaussian distribution having Multi densities  (b) 2 (c) 3 (d) 4 (e) 5 (f) 6 (g) 7 (h) 8 (i) 9 (j) 10

236

U. Qamar / A dissimilarity measure based fuzzy c-means (FCM) clustering algorithm

of both algorithms are almost same for changing variance in case of uniform density clusters. While looking into internal validity indices PC, CE and I in Fig. 7(b), Fig. 7(c) and Fig. 7(d) show a clear improvement achieved by modified FCM. PC and CE indices additionally show that the crisp in clustering results of modified FCM is increased as the variance of the input data is increased. Index I shows that the modified FCM still had the more capability than the FCM in recovering the cluster structure even if the data gets overlapped. 5.2. Experiment no.2 Here the experiment no.1 is repeated. But this time with 1100 synthetic data points and multi density clusters given by   1.0 0.02 0.005 0.11 Guassian , 1.0 0.005 0.02

 + 0.22 Guassian  + 0.33 Guassian  + 0.44 Guassian

2.5 2.5 2.0 4.0 4.0 2.0

, , ,

0.02 0.0 0.0 0.02



0.02 0.005 0.005 0.02 0.02 0.005

 

0.005 0.02

The variance of data is shown in Fig. 8(a) to Fig. 8(j) and the results in Fig. 9(a) to Fig. 9(d). In case of multi density clusters the Rand Index as in Fig. 9(a) shows some improvement in accuracy of modified FCM as the variance of data is increased. However internal validity indices shows similar trend as in case of uniform density clusters. But the values of PC and CE for modified FCM over FCM in case of multi density clusters are increased more as compared to that of uniform density clusters. So it can be added that the performance of the proposed work further improves when the clusters are of multi density.

Fig. 9. (a) Rand index vs data variance. (b) PC vs data variance. (c) CE vs data variance. (d) I index vs data variance.

U. Qamar / A dissimilarity measure based fuzzy c-means (FCM) clustering algorithm

237

Fig. 10. (a) Well separated compact clusters. (b) Well separated compact clusters with noise. (c) Multi density clusters. (d) Sub-clusters. (e) Skew distributed clusters. Table 1 Determined number of clusters using different validity indices

5.3. Experiment no.3 The most popular problem in clustering is to find the right number of clusters in data. In this experiment the modified FCM is compared with FCM against different aspects indicated below and some real datasets for finding optimal number of clusters in data. 1) Well Separated Compact Clusters (Fig. 10(a)) 2) Well Separated Compact Clusters with Noise (Fig. 10(b)) 3) Multi Density Clusters (Fig. 10(c)) 4) Sub-Clusters (Fig. 10(d)) 5) Skew Distributed Clusters (Fig. 10(e)) 6) Iris Dataset 7) Wine Dataset Results are shown in Table 1. For two datasets well separated compact and well separated compact with noise all the three validity indices recovers true number of clusters for both algorithms. In case of multi density and skew distributed clusters FCM fails to find true number of clusters while modified FCM find true clusters in multi density using PC index and in skew

Dataset (True # of Clusters)

Algorithm

I

PC

CE

Well separated compact (5)

FCM Modified FCM FCM

5 5 5

5 5 5

5 5 5

Modified FCM FCM Modified FCM FCM Modified FCM FCM Modified FCM FCM Modified FCM FCM Modified FCM

5 2 2 5 5 2 3 2 3 2 3

5 2 3 3 3 2 3 2 2 2 2

5 2 2 3 3 2 2 2 2 2 2

Well separated compact with noise (5) Multi-density (3) Sub-cluster (5) Skew distributed (3) Iris (3) Wine (3)

Bold values indicate best results.

distributed using both PC and I index. In case of subclusters both found actual number of clusters using I index. Once again modified FCM shows its effectiveness when tested for two real world datasets Iris and Wine. FCM fails to discover true number of clusters while modified FCM finds it using I index.

238

U. Qamar / A dissimilarity measure based fuzzy c-means (FCM) clustering algorithm

6. Conclusion In this paper, an effort was made to include the effect of second objective of both clustering and its validity indices to Fuzzy c-means. For this first we proposed a dissimilarity term to measure dissimilarity between a data point and clusters. This term is equal to the mean of Euclidian distance between data point and rest of cluster centers. Then integrated the effect of that dissimilarity term into Fuzzy c-means algorithm. For this a new step was added to algorithm in which we calculate a dissimilarity matrix based on the second objective of clustering i.e. to maximize inter-cluster distance. Dissimilarity matrix is then multiplied with fuzzy membership matrix and then the resultant fuzzy membership matrix is normalized to fulfill condition imposed on it. Hence produces a new fuzzy membership matrix that is closer to the desired fuzzy membership matrix. Experiments were carried out on both synthetic datasets covering different aspects of data and real datasets Iris and Wine. The outcome of validity indices confirmed the effectiveness of integrating the dissimilarity measure into Fuzzy c-means algorithm. It can be concluded that modified FCM outperforms the

traditional FCM in case of multi density, skew distributed clusters and sparse data. References [1]

[2]

[3] [4] [5]

[6]

[7] [8]

Y. Liu, Z. Li, H. Xiong, X. Gao and J. Wu, Understanding of internal clustering validation measures, In 10th IEEE International Conference on Data Mining (2010), 911–916. Dunn, A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters, Journal of Cybernetics 3(Issue 3) (1974), 32–57. J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, New York, Plenum Press, 1981. Frigui, Clustering by competitive agglomeration, Pattern Recognition 13(Issue 7) (1997), 86–92. J. Mark, Agglomerative fuzzy k-means clustering algorithm with selection of number of clusters, IEEE Transactions on Knowledge and Data Engineering 20(Issue 11), (2008), 284–287. W.M. Rand, Objective criteria for the evaluation of clustering methods, Journal of the American Statistical Association 66(336) (1971), 846–850. J.C. Bezdek, Cluster validity with fuzzy sets, J Cybernet 3 (1974), 58–73. U. Maulik and S. Bandyopadhyay, Performance evaluation of some clustering algorithms and validity indices, IEEE PAMI 24 (2002), 1650–1654

Copyright of Journal of Intelligent & Fuzzy Systems is the property of IOS Press and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.

Suggest Documents