Intra-cluster Similarity Index Based on Fuzzy. Rough Sets for Fuzzy C-Means Algorithm. Fan Li, Fan Min, and Qihe Liu. School of Computer Science and ...
Intra-cluster Similarity Index Based on Fuzzy Rough Sets for Fuzzy C -Means Algorithm Fan Li, Fan Min, and Qihe Liu School of Computer Science and Engineering University of Electronic Science and Technology of China Chengdu, 610051, P.R. China {lifan,minfan,qiheliu}@uestc.edu.cn
Abstract. Cluster validity indices have been used to evaluate the quality of fuzzy partitions. In this paper, we propose a new index, which uses concepts of Fuzzy Rough sets to evaluate the average intra-cluster similarity of fuzzy clusters produced by the fuzzy c-means algorithm. Experimental results show that contrasted with several well-known cluster validity indices, the proposed index can yield more desirable cluster number estimation. Keywords: Fuzzy c-means algorithm, Fuzzy Rough sets, Intra-cluster similarity, Cluster validity index.
1
Introduction
Cluster analysis for revealing the structure existing in a given data (patterns) set can be viewed as the problem of dividing the data set into a few compact subsets. The fuzzy c-means (FCM) algorithm [1] for cluster analysis has been the dominant approach in both theoretical and practical applications of fuzzy techniques for the last two decades. The aim of FCM is to partition a given set of data points (patterns) X = {x1 , x2 , · · · , xn } ⊂ Rp into c clusters represented as fuzzy sets F1 , F2 , · · · , Fc . The FCM objective function has the form of Jm (U, V ) =
c n
2 um ij xj − vi ,
(1)
i=1 j=1
where vi is the centroid of the fuzzy cluster Fi , · is a certain distance function, the exponent m > 1 is a fuzzifier, uij = Fi (xj ) is the membership of xj c value n belonging to Fi satisfying i=1 uij = 1 (j = 1, 2, · · · , n) and 0 < j=1 uij < n (i = 1, 2, · · · , c), U = [uij ] is the partition matrix, and V = {v1 , v2 , · · · , vc } is the set of all cluster centroids. FCM iteratively updates U and V to minimize Jm (U, V ) until a certain termination criterion has been satisfied. In FCM, a fuzzy partition is denoted as (U, V ). In FCM, if c is not known a priori, a cluster validity index must be used to evaluate the quality of fuzzy partitions for different values of c to find out the optimal cluster number. In most cited indices, e.g. the Xie-Beni index [2] and the G. Wang et al. (Eds.): RSKT 2008, LNAI 5009, pp. 316–323, 2008. c Springer-Verlag Berlin Heidelberg 2008
Intra-cluster Similarity Index Based on Fuzzy Rough Sets
317
Fukuyama-Sugeno index [3], the intra-cluster similarity of a fuzzy partition is estimated by using distances between data points and cluster centroids. But this approach is not effective for large values of c, because limc→n xj − vi 2 = 0 (see [4,5]). To overcome this shortcoming, the Kwon index [5] is proposed, and another kind of index has been proposed in recent years [6,7]. This kind of index only considers the inter-cluster proximity, which is evaluated by the membership values of each data point belonging to all fuzzy clusters whereas the distance function is not taken into account. In this paper, we propose a new method to assess the intra-cluster similarity of a fuzzy cluster by using the concepts of Fuzzy Rough sets. And the intracluster similarity index of a fuzzy partition obtained from FCM is defined as the average intra-cluster similarity of all fuzzy clusters. Experimental results indicate that the proposed index can find the correct cluster number and is reliable in comparison with several well-known cluster validity indices.
2
Basic Concepts
The concepts of Fuzzy Rough sets, which were proposed by Dubois and Prade [8,9], aim at extending the classical Rough sets theory [10,11] to fuzzy information systems. Let U be a nonempty set of objects. A fuzzy binary relation R on U is called a T -similarity relation if R satisfies: (1) Reflectivity: R(x, x) = 1, ∀x ∈ U ; (2) Symmetry: R(x, y) = R(y, x), ∀x, y ∈ U ; and (3) T -transitivity: R(x, z) ≥ T (R(x, y), R(y, z)), ∀x, y, z ∈ U , where T is a t -norm. Definition 1. Let F be a fuzzy subset of U, and R a T-similarity relation, the R-lower approximation and R-upper approximation of F, denoted by two fuzzy sets R(F ) and R(F ) respectively, are defined as: R(F )(x) = inf max{1 − R(x, y), F (y)},
(2)
y∈U
R(F )(x) = sup min{R(x, y), F (y)}.
(3)
y∈U
The pair (R(F ), R(F )) is called a Fuzzy Rough set. The above definitions were generalized in [12]. The R-lower approximation and R-upper approximation of F are defined as: R(F )(x) = inf IT {R(x, y), F (y)},
(4)
R(F )(x) = sup T {R(x, y), F (y)},
(5)
y∈U
y∈U
where IT is the residuation implication of T, i.e. IT (a, b) = sup{c ∈ [0, 1] : T (a, c) ≤ b} for every a, b ∈ [0, 1].
318
F. Li, F. Min, and Q. Liu
In general, the distance between two data points can qualify their similarity. The longer distance indicates the less degree of similarity, and vice versa. This intuition can be used to construct a fuzzy binary relation, which reflects the whole structure of the given data set. Thus, based on it, we can construct the lower approximations of F1 , F2 , · · · , Fc , and use these approximations to estimate the quality of the corresponding fuzzy partition.
3
Proposed Intra-cluster Similarity Index
Definition 2. Let X = {x1 , x2 , · · · , xn } ⊂ Rp be a given set of data points. A fuzzy binary relation S on X is defined as: ∀xi , xj ∈ X, S(xi , xj ) = 1 −
xi − xj , dmax
(6)
where dmax = maxi,j { xi − xj }. Proposition 1. S is a TL -similarity relation, where TL is the Lukasiewicz tnorm: TL (a, b) = max{0, a + b − 1} for every a, b ∈ [0, 1]. For the Lukasiewicz t-norm TL , ITL (a, b) = min{1, 1−a+b} for every a, b ∈ [0, 1]. Let Fi be a fuzzy cluster of X. By Eq. 4, we have: S(Fi )(xi ) = inf min{1, 1 − S(xi , xj ) + uij }. xj ∈X
(7)
S(Fi )(xi ) can be seen as the certainty degree of the event that a data point in X belongs to the fuzzy cluster Fi according to the similarity between this data point and xi . Intuitively, since S reflects the structure of the data set, we can estimate the intra-cluster similarity of each fuzzy cluster based on this concept. Definition 3. Let X = {x1 , x2 , · · · , xn } ⊂ Rp be a given set of data points, and F = {F1 , F2 , · · · , Fc } a fuzzy partition of X. ∀Fi ∈ F , the intra-cluster similarity of Fi is defined as: 1 IS(Fi ) = S(Fi )(x), (8) |S(Fi )| x∈Bi
where Bi = {x ∈ X|Fi (x) ≥ l} is the l-level set of Fi , A(x i ) is the cardinality of a fuzzy set A. i
1 c
≤ l < 1, and | A |=
Generally speaking, Bi contains “important” data points of the fuzzy cluster Fi . IS(Fi ) reflects the proportion of the sum of those “important” data points’ membership values belonging to the S -lower approximation of Fi to the cardinality of the S -lower approximation of Fi . Definition 4. Let X = {x1 , x2 , · · · , xn } ⊂ Rp be a given set of data points, and F = {F1 , F2 , · · · , Fc } a fuzzy partition of X. The intra-cluster similarity index (IS) of F is defined as: c 1 IS(Fi ). (9) IS(F ) = c i=1
Intra-cluster Similarity Index Based on Fuzzy Rough Sets
319
Table 1. Three existing validity indices for FCM Index XB FS K v1 =
Functional description Jm (U,V ) n mini=j vi −vj 2 c n Jm (U, V ) − Σi=1 Σj=1 um ij c 2 Jm (U,V )+ 1 Σ v −v 2 i c i=1 2 mini=j vi −vj n 1 c Σ v , v2 = n1 Σi=1 xi c i=1 i
vi − v1 2
IS(F ) is the average intra-cluster similarity of all fuzzy clusters in the fuzzy partition F. A large value of IS(F ) indicates a good intra-cluster similarity of the fuzzy partition F. In [13], two validity indices, DBr and Dr , are also defined in rough-fuzzy framework. The two indices extend the traditional Davies-Bouldin index and Dunn index (see [14]), which are used for crisp clustering, respectively. The main differences between the two indices and IS(F ) are as follows. Firstly, DBr and Dr are used for Rough-Fuzzy c-means algorithm (a variation of Rough cmeans algorithm [15]), whereas IS(F ) is used for FCM. Since a crisp set can be viewed as a special case of fuzzy sets, IS(F ) can be used for crisp clustering either. Secondly, DBr and Dr use the distance between each data point and the corresponding cluster center to evaluate the intra-cluster similarity, whereas IS(F ) uses the concept of the lower approximation to do so. This concept can be interpreted based on Zadeh’s possibility theory [9]. Finally, two parameters, wlow and wup , must be assigned in DBr and Dr . These two parameters correspond to the relative importance of the lower and upper approximations respectively. In IS(F ), the threshold value l must be assigned. But in general, the task of deciding the value of l is easier than that of wlow and wup .
4
Experimental Results
In order to evaluate the performance of the proposed index (IS ), we applied IS and several well-known cluster validity indices, including the extended Xie-Beni index (XB ) [2], the Fukuyama-Sugeno index (FS ) [3] and the extended Kown index (K ) [5], to fuzzy partitions obtained from FCM for two data sets. The functional description of the above three index is listed in Table 1. The first data set is a synthetic data set, which is shown in Fig. 1. It consists of five clusters with 10 data points per cluster. The second one is the IRIS data set from the UCI repository of machine learning databases [16], which represents different categories of irises with four features. There are three classes in this data set: Setosa, Versicolor and Virginica, with 50 samples per class. It is known that two classes Versicolor and Virginica have a substantial overlap while the class Setosa is linearly separable from the other two. Thus, the most suitable cluster number is two or three. For the mentioned data sets, we ran FCM for different values of c (c=2–9). For a particular c and data set, FCM started from the same initial partition and
320
F. Li, F. Min, and Q. Liu
5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0
0
1
2
3
4
5
Fig. 1. Synthetic data set
Table 2. Preferable values of c for the Synthetic data set chosen by each index (c = 2– 9) m 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5
XB 5 5 5 5 5 5 5 5 5 5 5
FS 9 9 7 5 5 5 5 5 5 5 4
K 5 5 5 5 5 5 5 5 4 4 4
IS 5 5 5 5 5 5 5 5 5 5 5
Table 3. Preferable values of c for the IRIS data set chosen by each index (c = 2– 9) m 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5
XB 2 2 2 2 2 2 2 2 2 2 2
FS 9 4 5 5 5 5 5 5 5 5 5
K 2 2 2 2 2 2 2 2 2 2 2
IS 2 2 2 2 2 2 2 2 2 2 2
Intra-cluster Similarity Index Based on Fuzzy Rough Sets
321
Table 4. Values of the four indices on the Synthetic data set for c =2–9 c 2 3 4 5 6 7 8 9
XB 0.362 0.161 0.079 0.045 0.528 0.466 0.498 0.392
FS 55.647 -79.486 -192.881 -231.212 -232.163 -235.497 -231.379 -246.722 (a) m=1.5
K 18.366 8.388 4.508 3.029 36.637 32.527 34.406 29.577
IS 0.864 0.817 0.894 0.957 0.874 0.833 0.807 0.824
c 2 3 4 5 6 7 8 9
XB 0.371 0.152 0.063 0.043 0.365 0.436 0.461 0.368
FS 63.161 -67.492 -193.353 -226.713 -225.784 -226.840 -221.323 -221.530 (b) m=1.7
K 18.809 7.954 3.685 2.914 24.984 31.159 32.741 26.066
IS 0.793 0.748 0.832 0.893 0.814 0.758 0.713 0.671
c 2 3 4 5 6 7 8 9
XB 0.337 0.124 0.047 0.037 0.342 0.353 0.335 0.289
FS 59.517 -56.816 -182.982 -200.951 -193.478 -188.133 -179.649 -175.988 (c) m=2.0
K 17.101 6.563 2.844 2.625 24.817 27.196 25.975 24.622
IS 0.709 0.670 0.750 0.753 0.686 0.625 0.609 0.593
c 2 3 4 5 6 7 8 9
XB 0.275 0.087 0.028 0.024 0.230 0.202 0.164 0.168
FS 49.525 -39.217 -136.591 -136.372 -122.279 -109.660 -102.443 -104.169 (d) m=2.5
K 14.012 4.716 1.937 2.011 20.219 18.412 16.733 20.062
IS 0.619 0.609 0.611 0.643 0.564 0.523 0.500 0.466
Table 5. Values of the four indices on the IRIS data set for c =2–9 c 2 3 4 5 6 7 8 9
XB 0.062 0.156 0.183 0.507 0.228 0.345 0.548 0.392
FS -431.455 -515.259 -568.406 -533.600 -554.874 -559.269 -546.903 -577.570 (a) m=1.5
K 9.553 24.812 29.390 83.792 39.040 59.150 96.007 67.790
IS 0.959 0.893 0.840 0.801 0.773 0.737 0.722 0.737
c 2 3 4 5 6 7 8 9
XB 0.059 0.150 0.174 0.262 0.211 0.314 0.307 0.338
FS -424.416 -496.773 -542.094 -628.713 -497.527 -498.966 -509.351 -510.541 (b) m=1.7
K 9.153 23.899 28.183 43.251 36.547 54.866 54.274 59.708
IS 0.927 0.846 0.752 0.679 0.658 0.664 0.651 0.629
c 2 3 4 5 6 7 8 9
XB 0.054 0.137 0.159 0.228 0.175 0.537 0.254 0.308
FS -401.801 -450.495 -476.000 -544.972 -389.115 -344.699 -389.922 -384.021 (c) m=2.0
K 8.376 21.955 26.020 38.241 31.386 99.514 47.002 57.666
IS 0.877 0.777 0.663 0.578 0.577 0.549 0.568 0.556
c 2 3 4 5 6 7 8 9
XB 0.044 0.108 0.124 0.162 0.114 0.433 0.197 0.288
FS -341.890 -344.731 -332.390 -348.558 -213.866 -162.868 -201.132 -149.523 (d) m=2.5
K 6.865 17.709 21.073 28.637 22.871 93.035 42.886 69.877
IS 0.824 0.693 0.585 0.512 0.525 0.554 0.524 0.537
322
F. Li, F. Min, and Q. Liu
ran for different values of m (m=1.5–2.5). After the fuzzy partition is obtained, the four indices were computed. In the computation of IS, l = 1c . The results are shown in Tables 2– 5. As indicated in Tables 2 and 3, only IS and XB correctly recognize correct cluster numbers of the two data sets for all values of m. Furthermore, a cluster validity index is considered as a reliable index when it is insensitive to changes in m [4,6]. From this point of view, IS provides more reliable results compared to other indices, as shown in Tables 4 and 5, where the optimal value of each index is marked by boldface. Thus we can conclude that the proposed index provides the best cluster number estimation for all test data sets.
5
Conclusions
The fuzzy c-means (FCM) algorithm is an effective tool for cluster analysis. In FCM, if the cluster number c is not known a priori, a validation index must be used to find out the optimal number of clusters. By using the concepts of Fuzzy Rough sets, this paper presents a new intra-cluster similarity index to assess the intra-cluster similarity of fuzzy partitions obtained from FCM. Experimental results show that contrasted with some existing cluster validity indices, the proposed index yields the correct cluster number and is reliable in comparison with several well-known cluster validity indices. In future works, we plan to carry out extensive experiments and theoretical analysis to firmly establish the utility of the proposed index. We also plan to apply the basic ideas described in this paper to the evaluation of the intercluster proximity as well as to the cluster validity analysis for crisp clustering algorithms.
Acknowledgement This work was supported by an information distribution project under grant No. 9140A06060106DZ223, Program for New Century Excellent Talents in University (NCET-06-0811), and Young Foundation of UESTC, grant No. L080106015X0748.
References 1. Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum, New York (1981) 2. Xie, X.L., Beni, G.: A validity measure for fuzzy clustering. IEEE Trans. Pattern Anal. Mach. Intell 13(8), 841–847 (1991) 3. Fukuyama, Y., Sugeno, M.: A new method of choosing the number of clusters for the fuzzy c-mean method. In: 5th Fuzzy Systems Symposium, Japan, pp. 247–250 (1989) 4. Pal, N.R., Bezdek, J.C.: On cluster validity for the fuzzy c-means model. IEEE Trans. Fuzzy Syst. 13(3), 370–379 (1995)
Intra-cluster Similarity Index Based on Fuzzy Rough Sets
323
5. Kwon, S.H.: Cluster validity index for fuzzy clustering. Electron. Lett 34(22), 2176– 2177 (1998) 6. Kim, D., Lee, K.H., Lee, D.: On cluster validity index for estimation of the optimal number of fuzzy clusters. Pattern Recognition 37(10), 2561–2574 (2004) 7. Kim, Y., Kim, D., et al.: A cluster validation index for GK cluster analysis based on relative degree of sharing. Information Science 168(1-4), 225–242 (2004) 8. Dubois, D., Prade, H.: Rough fuzzy sets and fuzzy rough sets. Internat. J. General Systems 17(2–3), 191–209 (1990) 9. Dubois, D., Prade, H.: Putting rough sets and fuzzy sets together. In: Slowinski, R. (ed.) Intelligent Decision Support: Handbook of Applications and Advances of the Rough Sets Theory, pp. 203–222. Kluwer, The Netherlands (1992) 10. Pawlak, Z.: Rough sets. International J. Comp. Inform. Science 11, 341–356 (1982) 11. Pawlak, Z.: Some Issues on Rough Sets. In: Peters, J.F., Skowron, A., Grzymala´ Busse, J.W., Kostek, B.z., Swiniarski, R.W., Szczuka, M. (eds.) Transactions on Rough Sets I. LNCS, vol. 3100, pp. 1–58. Springer, Heidelberg (2004) 12. Morsi, N.N., Yakout, M.M.: Axiomatics for fuzzy rough set. Fuzzy Sets Syst. 100(13), 327–342 (1998) 13. Mitra, S., Banka, H., Pedrycz, W.: Rough-fuzzy collaborative clustering. IEEE Trans. Syst., Man, Cybern. B, Cybern. 36(4), 795–805 (2006) 14. Bezdek, J.C., Pal, N.R.: Some New Indexes of Cluster Validity. IEEE Trans. Syst., Man, Cybern. B, Cybern. 28(3), 301–315 (1998) 15. Lingras, P., Yan, R., West, C.: Comparison of conventional and rough k-means clustering. In: Wang, G., et al. (eds.) RSFDGrC 2003. LNCS (LNAI), vol. 2639, pp. 130–137. Springer, Heidelberg (2003) 16. UCI Repository of machine learning databases, http://www.ics.uci.edu/∼ mlearn/MLRepository.html