Document not found! Please try again

Cluster Validity For Kernel Fuzzy Clustering - IEEE Xplore

4 downloads 0 Views 2MB Size Report
Timothy C. Havens. Department of Computer Science and Engineering. Michigan State University. East Lansing, MI 48824 USA. Email: [email protected].
WCCI 2012 IEEE World Congress on Computational Intelligence June, 10-15, 2012 - Brisbane, Australia

FUZZ IEEE

Cluster Validity For Kernel Fuzzy Clustering Timothy C. Havens

James C. Bezdek, Marimuthu Palaniswami

Department of Computer Science and Engineering Michigan State University East Lansing, MI 48824 USA Email: [email protected]

Department of Electrical and Electronic Engineering University of Melbourne Parkville, Victoria 3010, Australia Email: [email protected], [email protected]

Abstract—This paper presents cluster validity for kernel fuzzy clustering. First, we describe existing cluster validity indices that can be directly applied to partitions obtained by kernel fuzzy clustering algorithms. Second, we show how validity indices that take dissimilarity (or relational) data D as input can be applied to kernel fuzzy clustering. Third, we present four propositions that allow other existing cluster validity indices to be adapted to kernel fuzzy partitions. As an example of how these propositions are used, five well-known indices are formulated. We demonstrate several indices for kernel fuzzy c-means (kFCM) partitions of both synthetic and real data. Index Terms—cluster validity, kernel clustering, fuzzy clustering.

I. I NTRODUCTION Clustering or cluster analysis is a form of exploratory data analysis in which data are separated into groups or subsets such that the objects in each group share some similarity. Clustering has been used as a pre-processing step to separate data into manageable parts [1], as a knowledge discovery tool [2], for indexing and compression [3], etc., and there are many good books that describe its various uses [4– 8]. The most popular use for clustering is to assign labels to unlabeled data—data for which no pre-existing grouping is known. Any field that uses or analyzes data can utilize clustering; the problem domains and applications of clustering are innumerable. Finding good clusters involves more than just separating objects into groups; three major problems comprise clustering and all three are equally important for most applications. Tendency: Are there clusters in the data? And, if so, how many? Partitioning: Which objects should be grouped together (and to what degree)? Validity: Which partition is best? The main contribution of this paper is to provide methods for assessing cluster validity for partitions obtained from kernelized fuzzy clustering algorithms, such as kernel fuzzy c-means (kFCM). We first discuss existing indices that can be directly applied with no modification. Then we show how to apply indices that take dissimilarity data D as input. We then prove four propositions which allow other indices to be adapted to kernel fuzzy partitions and demonstrate the use of these propositions by adapting 5 well-known indices, including the popular Fukuyama-Sugeno and Xie-Beni indices. Because we are specifically addressing fuzzy clustering in kernel spaces, we will use the kFCM algorithm for partitioning. Note, however, that our validity methods would work with

U.S. Government work not protected by U.S. copyright

partitions produced by any kernel fuzzy (and crisp, for that matter) clustering algorithm. Section II presents the necessary background and describes related work. The kernel validity indices are proposed in Section III. We present empirical results in Section IV and Section V provides a short summary and talks about future work. II. BACKGROUND AND R ELATED W ORK Consider a set of n objects O = {o1 , . . . , on }, e.g., patients at a hospital, bass players in afro-Cuban bands, or wireless sensor network nodes. Each object is typically represented by numerical object or feature-vector data that has the form X = {x1 , . . . , xn } ⊂ Rd , where the coordinates of xi provide feature values (e.g., weight, length, insurance payment, etc.) describing object oi . A c-partition of the objects is defined as a set of nc values {uij }, where each value represents the degree to which object oi is in the jth cluster. The c-partition is often represented as an n × c matrix U = [uij ]. There are three main types of partitions, crisp, fuzzy (or probabilistic), and possibilistic [9, 10]. Crisp partitions of the unlabeled objects are nonempty mutually-disjoint subsets of O such that the union of the subsets equals O. The set of all non-degenerate (no zero columns) crisp c-partition matrices for the object set O is n Mhcn = U ∈ Rn×c |uij ∈ {0, 1}, ∀j, i; (1) n c o X X 0< uij < n, ∀j; uij = 1, ∀i , i=1

j=1

where uij is the membership of object oi in cluster j; the partition element uij = 1 if oi is labeled j and is 0 otherwise. When the columns of U are considered as vectors in Rn then we denote the jth column as uj . Fuzzy (or probabilistic) partitions are more flexible than crisp partitions in that each object can have membership in more than one cluster. Note, if U is probabilistic, say U = P = [pik ], then pik is interpreted as the posterior probability p(k|oi ) that oi is in the kth class. Since this paper focuses on fuzzy partitions, we do not specifically address this difference. However, we stress that most, if not all, the indices described here can be directly applied to probabilistic c-partitions produced by Gaussian Mixture Model / Expectation Maximization (GMM/EM) algorithm, which is

the most popular way of finding probabilistic clusters. The set of all fuzzy c-partitions is n Mf cn = U ∈ Rn×c |uij ∈ [0, 1], ∀j, i; (2) n c o X X 0< uij < n, ∀j; uij = 1, ∀i . i=1

j=1

Each row of the fuzzy partition U must sum to 1, thus ensuring that every object has unit cluster membership in total P ( j uij = 1). A. Fuzzy c-means (FCM) One of the most popular methods for finding fuzzy partitions is FCM [9]. The FCM algorithm is generally defined as the constrained optimization of the squared-error distortion Jm (U, V ) =

c X n X

2 um ij ||xi − vj ||A ,

(3)

j=1 i=1

where U is an (n × c) partition matrix, V = {v1 , . . . , vc } is a set of c cluster centers in Rd , m > 1 is the fuzzification constant, √ and || · ||A is any inner product A-induced norm, i.e., ||x||A = xT Ax. Typically, the Euclidean norm (A = I) is used, but there are many examples where the use of another norm-inducing matrix has been shown to be effective, e.g., using A = S −1 , the inverse of the sample covariance matrix. Only the Euclidean norm will be used in this paper; we ease the notation by dropping the subscript (A = I) in what follows. The FCM/AO algorithm approximates solutions to (3) using alternating optimization (AO) [11]. Other approaches to optimizing the FCM model include genetic algorithms, particle swarm optimization, etc. The FCM/AO approach is by far the most common, and the only algorithm used here. We ease the notation by dropping the “/AO” notation. Algorithm 1 outlines the steps of the FCM/AO (FCM/AO) algorithm. There are many ways to initialize FCM; we choose c objects randomly from the data set itself to serve as the initial cluster centers, which seems to work well in almost all cases. But any initialization method that adequately covers the object space and does not produce any identical initial centers would work. The alternating steps of FCM in Eqs. (4) and (5) are iterated until the algorithm terminates, where termination is declared

Algorithm 1: FCM/AO Input: X, c, m,  Output: U , V Initialize V while max1≤k≤c {||vk,new − vk,old ||2 } >  do " c  2 #−1 X ||xi − vj ||  m−1 , ∀i, j uij = ||xi − vk || k=1 Pn (uij )m xi vi = Pi=1 , ∀j n m i=1 (uij )

when there are only negligible changes in the cluster center locations: more explicitly, max1≤j≤c {||vj,new −vj,old ||2 } ≤ , where  is a pre-determined constant. B. Kernel fuzzy c-means Now consider some non-linear mapping function φ : x → φ(x) ∈ RdK , where dK is the dimensionality of the transformed feature vector φ(x). With kernel clustering, we do not explicitly transform x, we simply represent the dot product φ(xi ) · φ(xl ) = κ(xi , xl ) by a kernel function κ. The kernel function κ can take many forms, with the polynomial κ(xi , xl ) = (xTi xl + 1)p and radial-basis-function (RBF) κ(xi , xl ) = exp(σ||xi −xl ||2 ) being two of the most popularly used. Given a set of n feature vectors X, we construct an n×n kernel matrix K = [Kij = κ(xi , xj )]n×n . The kernel matrix K represents all pairwise dot products of the feature vectors in the transformed high-dimensional space—the Reproducing Kernel Hilbert Space (RKHS). Given a kernel function κ, kernel FCM (kFCM) can be generally defined as the constrained minimization of Jm (U ) =

n c X X

2 um ij ||φ(xi ) − φ(vj )||A ,

(6)

j=1 i=1

where, like FCM, U ∈ Mf cn and m is the fuzzification parameter. kFCM approximately solves the optimization problem in (6) by computing iterated updates of dκ (xi , vj ) = ||φ(xi ) − φ(vj )||2 ,  1 −1 Pc  dκ (xi ,vj )  m−1 , ∀i, j, uij = k=1 dκ (xi ,vk )

(7) (8)

where dκ (xi , vj ) is the kernel distance between input datum xi and cluster center vj . Like FCM, the cluster centers are linear combinations of the feature vectors, Pn m l=1 ulj φ(xl ) Pn φ(vj ) = . (9) m l=1 ulj Equation (7) cannot be computed directly, but by using the identity P Kij = κ(xi , xj ) = φ(xi ) · φ(xj ), denoting m m m m m T ˆ j = um u / j i uij where uj = (u1j , u2j , . . . , unj ) , and substituting (9) into (7) we get Pn Pn m m l=1 s=1 ulj usj φ(xl ) · φ(xs ) Pn dκ (xi , vj ) = 2m l=1 ulj Pn m l=1 ulj φ(xl ) · φ(xi ) Pn +φ(xi ) · φ(xi ) − 2 m l=1 ulj ˜ Tj K u ˜ j + eTi Kei − 2˜ = u uTj Kei

(4)

(5)

˜ Tj K u ˜ j + Kii − 2(˜ = u uTj K)i ,

(10)

where ei is the n-length unit vector with the ith element equal to 1. This formulation of kFCM is equivalent to that proposed in [12] and is identical to relational FCM if the standard dotproduct kernel κ(xi , xk ) = hxi , xj i is used [13].

C. Some cluster validity indices



Cluster validity indices all attempt to answer the question: which partition among a set of candidate partitions found by clustering algorithm(s) is best for this data set? Choosing the best partition from several candidates implicitly identifies the appropriate c from a set of different c values (say c = 2, 3, . . . , 10). In this paper, our experiments focus on choosing the appropriate c, which is arguably the most popular use of validity indices. Validity indices fall in two major categories: internal and external. Internal indices take as input only the data and results of the clustering algorithm (typically the partition U and/or the cluster centers V ). External indices compare the partition against some external criteria, e.g. known class labels or must-link / cannot-link constraints. References [14– 19] describe studies of fuzzy validity indices that use both theoretical analysis and empirical results for comparison. We now outline several validity indices, which are described in detail in [18].1 Indices with the superscript (+) indicate the preferred partition by their maximum value, while those with the superscript (−) indicate the preferred partition by their minimum. Some of these indices use the minimized value of Jm (U, V ) at Eq. (3). Note that J2 (U, V ) is Jm (U, V ) for m = 2. 1) Internal indices based on only U : • Partition coefficient (Bezdek): n

c

1 XX 2 u n j=1 i=1 ij

(+)

VP C = •

Partition entropy (Bezdek): c



n

1 XX uij log uij =− n j=1 i=1

(12)

Modified partition coefficient (Dave): c (+) (+) (1 − VP C ) VM P C = 1 − c−1 Kim:

(13)

(−) VP E



(11)

(−)

VKI =

c n c−1 X X X 2 c · hi · (uij ∧ uik ) (14) c(c − 1) j=1 i=1 k=j+1

hi = −

c X

uil log uil

l=1

2) Indices based on (U, V ) and X: • Fukuyama and Sugeno: (−)

VF S =Jm (U, V ) −

c X n X

2 um ij ||vj − v||

(15)

j=1 i=1 c

1X v= vj c j=1 1 Note that, because we are page-limited, we do not include the original references for all the indices here. However, reference [18] is a comprehensive source for finding the original references. We do indicate the inventor of each index in parentheses, where appropriate.

Generalized Xie Beni (Xie, Beni, Pal, and Bezdek): (−)

VXB = •

Jm (U, V ) n minj6=k {||vj − vk ||2 }

(16)

Kwon: (−) VK

=

1 c

J2 (U, V ) +

Pc

j=1

||vj − x||2

(17)

minj6=k {||vj − vk ||2 } n 1X x= xi n i=1



(18)

Tan Sun: (−) VT S

=

J2 (U, V ) +

1 c(c−1)

Pc

j=1

Pc

minj6=k {||vj − vk

k=1 ||vj − 2 || } + 1/c

vk ||2 (19)



PCAES (Wu and Yang): (+)

VP CAES =

c X n X

u2ij /uM

(20)

j=1 i=1 c X

   exp − min ||vj − vk ||2 /βT



j6=k

j=1

uM = min j

( n X i=1

) u2ij

, βT = (1/c)

c X

||vj − x||2

j=1

where x is calculated by (18). 3) External indices: External indices compare the resulting partition against “ground-truth” labels or some other external criteria, typically class labels. While these indices can be very useful for measuring the match between partitions and the external criteria, say for the purposes of comparing clustering algorithms or eliciting the behavior of a clustering algorithm, we dissuade their use for cluster validity. External indices are naturally biased towards ground-truth; hence, they can miss the “true” grouping of the data, which may not match the class labels. In Section IV, we show the values for the generalized Rand index VRand proposed in [20], which compares a fuzzy partition U against a reference partition Uref . For our experiments, we set the reference partition to the crisp partition that represents the known class labels. 4) Other indices: Other indices compare partitions against dissimilarity values, produce visualizations which suggest the number of clusters, etc. In this paper, we present values for the Correlation Cluster Validity (CCV) family of indices, specifically CCVp and CCVs [21]. The CCV indices first induce a partition dissimilarity   UUT DU = [1]n − , (21) maxij {(U U T )ij } where [1]n is the n × n matrix where each element is 1.2 The partition dissimilarity is then compared against the data dissimilarity matrix D, where Dil = ||xi −xl ||. Two statistical correlation measures are used for this comparison. 2 In [21] the authors represent the partition matrix as a c × n matrix, while in this paper U is n × c.



Pearson (CCVp): (+)

VCCV p =



hA, Bi2 , ||A||2 ||B||2

(22)

where Aij = Dij − Dij and Bij = [DU ]ij − [DU ]ij . The matrices D and DU are n×n matrices where every entry is the average value of D and DU , respectively. Spearman (CCVs): (+)

VCCV s =

hr, r∗ i2 , ||r||2 ||r∗ ||2

(23)

where rk is the rank of the kth element of the n(n−1)/2 off-diagonal upper-triangular values of D and rk∗ is the rank of the kth element of the corresponding values in DU . The CCV indices are unique in that they work not only with vector data X, but also pure relational data D (where the data we start with is D and we do not have access to X). Hence, CCV falls into a category of indices for relational data. Another form of validity that we do not focus on here, due to space limitations, is visualization. A popular visual cluster validity method is the VCV method proposed in [22, 23]. The VCV indices could be easily applied to kernel clustering by the method we will show for the CCV indices in Section III-A. III. K ERNEL C LUSTER VALIDITY Kernel clustering algorithms take a kernel matrix K as an additional input and produce at least a partition U . Hence, any validity index that only takes U as input, such as VP C and VP E , can be used directly with kernel-based algorithms (or, for that matter, any clustering algorithm that produces a partition matrix). We will show results of these types of indices in Section IV. The matrix K provides an additional source of information about the input data, since it indirectly represents a transformation of X into a distance matrix D. In view of this, it may be that incorporating K into validity indices improves their utility for assessing cluster validity. However, indices that take U and X (which would be φ(X) for kernel clustering) as input are not directly applicable for kernel clustering because we do not have access to the highdimensional projection φ(X). In III-B, we will prove four propositions that show how these types of indices can be adapted for use with kernel clustering outputs. First, we talk about an existing kernel clustering validity index and then we adapt the CCV indices with a simple transformation. To our knowledge, the only cluster validity index specifically designed (to date) for kernel clustering is the PK index proposed in [24], which is very similar to the CCV method. The PK index compares a proximity matrix P to the kernel matrix K with the idea that if P and K are of similar structure then the partition is good. The proximity matrix is computed by c X Pil = min{uij , ulj }, i, l = 1, . . . , n, (24) j=1

and the validity index is computed as hK, P iF (+) VP K = p , hK, KiF hP, P iF

(25)

where h·, ·iF indicates the Frobenius norm. Notice the similarity between (25) and the CCV equations, (22) and (23). There is a serious drawback to the PK validity index however. Namely, the index compares P directly to the kernel matrix K. This operation is okay for kernel matrices that have a constant diagonal, like the RBF kernel. But this validity index fails for kernels that do not have a constant diagonal, like the dotproduct and polynomial kernel. For these kernels the Frobenius norm in the numerator of (25) can be dominated by sub-blocks of K that have large diagonal (and subsequently large offdiagonal) values. For this reason, we do not recommend VP K as a validity index. Instead, we recommend the adaptation of the CCV indices, described next. A. Adapting indices that take relational data D as input It is well known and easily proven that the Euclidean distance between two kernel representations of vectors can be computed by ||φ(xi ) − φ(xl )||2 = Kii + Kll − 2Kil .

(26)

Hence, any validity index that takes D = [Dij = ||φ(xi ) − φ(xl )||] as input, such as CCV, can be adapted to kernel clustering partitions by computing p (27) D = [Dil = Kii + Kll − 2Kil ] and then applying CCV directly to D (and DU ). B. Adapting indices that take (U, V ) and X as input Indices such as Xie-Beni and PCAES are not as easily adapted to kernel clustering as they take X as input. However, we can transform these indices by noticing a common property among them: all the indices are dependent on weighted Euclidean distances involving X and V (and combinations thereof). Examining equations (15) through (20) reveals that there are four quantities that must be adapted: i) the objective function, Jm (U ); ii) ||φ(vj ) − φ(vk )||2 ; iii) ||φ(vj ) − φ(v)||2 ; and iv) ||φ(vj ) − φ(x)||2 . The following four propositions show how each of these can be calculated using the partition U and the kernel matrix K. Proposition 1. For a given kernel function φ : x → φ(x) ∈ RdK , the kFCM objective can be formulated as c X n X 2 Jm (U ) = um ij ||φ(xi ) − φ(vj )|| , j=1 i=1

=

n X i=1

Kii

c X

um ij −

j=1

= diag(K)T

c X

c X T ˜m (˜ um j ) Ku j , j=1

h i m T m ˜ ˜ um − trace ( U ) K U , (28) j

j=1

˜ m = [˜ ˜m ˜m U um 1 ,...,u c ], u j =

 m T um 1j , . . . , unj qP . n m i=1 uij

Proof: Expanding Jm (U ) gives Jm (U ) =

c X n X

 T um ij φ(xi ) φ(xi )

j=1 i=1

+φ(vj ) · φ(vj ) − 2φ(xi ) · φ(xi )] .

(29)

Using (9) to substitue for φ(vj ) and substituting Kir = φ(xi )· φ(xr ) into (29) gives " Pn Pn c X n m X um kj ulj φ(xk ) · φ(xl ) k=1 m Pl=1 P Jm (U ) = uij Kii + n n m m k=1 l=1 ukj ulj j=1 i=1 # Pn m k=1 ukj φ(xk ) · φ(xi ) Pn −2 , m k=1 ukj " Pn Pn n c m m X X k=1 l=1 ukj ulj φ(xk ) · φ(xl ) m Pn = Kii uij + m k=1 ukj i=1 j=1 # Pn Pn m m k=1 l=1 ukj ulj φ(xk ) · φ(xl ) Pn −2 m k=1 ukj =

n X i=1

Kii

c X

um ij −

j=1

c T m X (um j ) Kuj P n m . k=1 ukj j=1

˜ m into the above equation finishes the proof. Substituting U Proposition 2. The squared Euclidean distance between two cluster centers is ||φ(vj ) − φ(vk )||2 = T T T ˆm ˆm ˆm (ˆ um um um j ) Ku j + (ˆ k ) Ku k − 2(ˆ j ) Ku k , Pn m m T m ˆm where u j = (u1j , . . . , unj ) / i=1 uij .

(30)

Proof: We follow the same expansion as in the proof of Proposition 1 and using (9) to substitute for φ(vj ) and φ(vk ) gives Pn Pn m m i=1 l=1 uij ulj φ(xi ) · φ(xl ) 2 ||φ(vj ) − φ(vk )|| =  Pn m 2 l=1 uij Pn Pn m um ik ulk φ(xi ) · φ(xl ) + i=1 l=1 Pn 2 um ( ik ) Pn Pn l=1 m um ij ulk φ(xi ) · φ(xl ) −2 i=1 Pl=1 n Pn m m i=1 l=1 uij ulk T m T m T m (um (um (um j ) Kuj j ) Kuk k ) Kuk P P = Pn − 2 +  P n n 2 m m. n m 2 ( l=1 um l=1 l=1 uij ulk ik ) l=1 uij

ˆ m into the above equation finishes the proof. Substituting u

Proof: Using (9) to expand the expression of φ(v) gives c Pn φ(xi ) 1 X i=1 um Pn ij m φ(v) = c j=1 u i=1 ij  Pn  1 Pc m i=1 c j=1 uij φ(xi )   P = P n c 1 m u i=1 c j=1 ij Comparing this equation to (9) shows that φ(v) can be considered Pc a cluster center defined by the membership vector (1/c) j=1 um j . Thus, the same process used to prove Proposition 2 can be applied here. Proposition 4. The squared EuclideanPdistance between a n cluster center φ(vj ) and φ(x) = (1/n) i=1 φ(xi ) is T T T ˆm ||φ(vj ) − φ(x)||2 = (ˆ um um j ) Ku j + 1n K1n − 2(ˆ j ) K1n , (32) where 1n is the n-length vector with each element equal to 1/n.

Proof: We can write φ(x) as Pn (1 ) φ(xi ) Pn n i φ(x) = i=1 i=1 (1n )i which shows that φ(x) can also be considered a cluster center, of sorts, with the membership vector 1n . Following the same steps as in Proposition 2 finishes this proof. Let us denote the quantities in Propositions 1-4, respectively, as Jm (U ), V V (j, k; m) = ||φ(vj ) − φ(vk )||2 , V V (j; m) = ||φ(vj )−φ(v)||2 , and V X(j; m) = ||φ(vj )−φ(x)||2 . We can now reformulate the indices at (15-20) in terms of these four quantities. (−)

VF S =Jm (U ) −

T T ˆm ˆ m − 2(ˆ ˆ m, (ˆ um um )T K u um j ) Ku j + (ˆ j ) Ku

ˆm = where u

Pc um Pc j m . (1/c) Pn j=1 i=1 j=1 uij

um ij

(33)

i=1

(−)

(−)

minj6=k {V V (j, k; m)} Pc Pc 1 J2 (U ) + c(c−1) j=1 k=1 V V (j, k; m)

VK = VT S = (+)

n X

Jm (U ) n minj6=k {V V (j, k; m)} Pc J2 (U ) + (1/c) j=1 V X(j; m)

(−)

VXB =

VP CAES

(34)

minj6=k {V V (j, k; m)} + (1/c) c X n X = u2ij /uM

(35) (36) (37)

j=1 i=1



c X

  exp − min {V V (j, k; m)/βT } j6=k

j=1

uM = min j

(31)

V V (j; m)

j=1

Proposition 3. The squared Euclidean between a Pdistance c cluster center φ(vj ) and φ(v) = (1/c) j=1 φ(vj ) is ||φ(vj ) − φ(v)||2 =

c X

( n X i=1

) u2ij

, βT = (1/c)

c X

V X(j; m)

j=1

Remark 1. The kernel reformulations in Eqs. (33-37) are equivalent to their vector-data counterparts at (15-20) . If the standard dot-product kernel K = XX T is used, then the

TABLE I: Data Sets Name iris glass dermatology ionosphere ecoli sonar wdbc wine

n 150 214 366 351 336 208 569 178

d 5 10 34 35 8 61 31 14

Classes 3 6 6 2 8 2 2 3

indices that take U and K as input will return the same value as their respective counterparts that take U and X as input. The advantage of the reformulations is that they work with any kernel matrix K. IV. E XPERIMENTS We tested the validity indices for both synthetic and real data. For each data set, we ran kFCM for each integer c, cmin ≤ c ≤ cmax . We then stored the preferred number of clusters (at the respective optimum) for each validity index. We did this for 31 different tests, each with a different initialization. The kFCM iterations were terminated when maxij {|(unew )ij − (uold )ij |} < 10−3 or when the number of iterations exceeded 10,000 (this max iteration termination criteria was never reached in all tests). For all data sets, unless noted, cmin = 2 and cmax = 10. A. Synthetic data Figure 1 shows plots of 8 synthetic data sets used to study the behavior of the cluster validity indices. Data set 10D-10 in view (e) is a 10-dimensional data set and view (e) shows the top two PCA components of this data set. View (e) suggests that there are 10 clusters in 10D-10. For 10D-10, cmin = 5 and cmax = 15. For 2D-15 and 2D-17 in views (b,c), cmin = 10 and cmax = 20. The ‘uniform’ data set is composed of 500 draws from a uniform distribution in the 2D unit box. This data set contains 1 cluster (or no clusters, depending on your viewpoint). We tested against this dataset to see how the validity indices would behave for data that do not contain clear cluster structure. Table II contains the results for the synthetic data tests. The last row of the table shows the total number of times that each validity index voted c (most often) as equal to the number of classes. For example, the entry 6(21) for the Rand index applied to partitions of the 2 curves data, which has the preferred value c = 2, means that this index chose c = 6 most, viz., 21 times in 31 trials. As another example, 5(31) for the Rand index, data set 2D-5, means that the Rand index chose the 5-partition of X in all 31 trials—a perfect score. And, for example, the 4 in the total row in the Rand column means that the Rand index chose the preferred number of clusters more times in 31 trials than for any other value of c in 4 of the 10 data sets. The 2D-5, 2D-17, and 10D-10 data have separated, somewhat-spherical clusters. For these data sets, most of the validity indices chose c as the number of classes. The

exceptions to this were the FS, PCAES, CCVp, and CCVs indices (CCVp did choose 10 clusters for 10D-10). In the synthetic data sets overall, the FS, PCAES, CCVp, and CCVs indices had the weakest performance in terms of choosing c to match the number of classes, with PCAES failing for all data sets and FS, CCVp, and CCVs matching on 1 test. Interestingly, CCVs was the only index to indicate 3 clusters for the 3 lines data set. This is a good example that there is no perfect index; every index will fail (or sometimes succeed) for some data set. For the 4 rings data, all the indices “fail,” suggesting 2, 7, or 10 clusters (albeit with very little variance). The failure in this data set is kFCM, which fails to partition this dataset in the apparent 4 clusters (even with several other kernel choices). Interestingly, when we used hard kernel c-means on the 4 rings data, the apparent c = 4 partition was found and was the clear choice among the validity indices. We will further investigate this phenomenon in the future. The best performing index overall for the 10 synthetic data sets was PC. However, this index has the serious drawback that VP C asymptotically increases to 1 as c → n. We alleviated this symptom by limiting the maximum number of clusters cmax ; however, in practice, you may not have a good idea of how large (or small) cmax should be. Furthermore, most real data sets do not contain well-separated, compact clusters and we now turn to testing on some real data. B. Real data The 8 data sets used here are available from the UCI Machine Learning Repository [25]. Table I shows the name of the data sets, the number of objects n, the feature dimensions d, and the number of physically labeled classes (Note that the number of labeled classes may or may not correspond to the number of clusters defined by any algorithm in any data set). Table III contains the results of testing the validity indices on the real data. The last row of this table shows the number of times that each index voted the number of classes as the preferred number of clusters. In contrast to the synthetic data, for the real data the CCVs index was the most successful at matching the number of classes with its choice of c. We hesitate to say that this makes it the “best” index as the cluster structure of these data could be completely different than the class structure. But this does show the utility of using different indices as each has its own strengths and weaknesses. Again, the PC, MPC, and PE indexes are also somewhat successful at matching the number of classes with their choice of c. The Rand index is also moderately successful, but it uses the class labels in its determination; hence, its choice of c will be always be biased towards the number of classes. Like the synthetic data tests, the FS and PCAES indices fail to match c to the number of classes in most tests—they are successful in 1 test each. However, the FS index is the only index to successfully choose 6 clusters for the glass data set—no other index was even close in this regard.

1

0.6

0.4

0.7

0.4

0.6

0.2

0.2 0

0.5 0.4

−0.2

0.3

−0.4

0.2

−0.6 −0.4

−0.2

0 0.2 Feature 1

0.4

0.6

0.8

0

1

0.8

1

−1

−0.5

0 Feature 1

0.5

(c) 2D-17s, n = 1, 700, c = 17

0.4

0.4

0.2

0.2

0 −0.2

−0.2

0

0.2 0.4 Feature 1

0.6

0.8

−1

1

(e) 2D-3, n = 1, 000, c = 3

−0.4 −0.2 0 0.2 1st PCA Coefficient

0.4

0.6

0.4

0 −0.2

−0.5

0 Feature 1

0.5

1

(f) 3 lines, n = 600, c = 3

0.2 0 −0.2 −0.4 −0.6

−0.6

−0.8 −0.4

−0.6

0.6

−0.4

−0.6

−0.2

−0.8

1

−0.4

0

−1

0.8

Feature 2

Feature 2

0.2

0 −0.2

(d) 10D-10, n = 1, 000, c = 10

0.6

0.6

0.4

0.2

−0.6

1

0.8

0.6 Feature 2

0.4 0.6 Feature 1

(b) 2D-15, n = 7, 500, c = 15

0.8

−0.6

0.2

0.4

−0.4

−0.8

1

(a) 2D-5, n = 5, 000, c = 5

−0.2

−0.6

0.1 −0.8

0

−0.4

Feature 2

−0.6

0.6

2nd PCA Coefficient

0.8

0.8

Feature 2

0.9

0.6

Feature 2

Feature 2

0.8

−0.8 −0.8 −0.6 −0.4 −0.2

0 0.2 Feature 1

0.4

0.6

0.8

1

(g) 2 curves, n = 1, 000, c=2

−1

−1

−0.5

0 Feature 1

0.5

1

(h) 4 rings, n = 1, 000, c = 4

Fig. 1: Synthetic data sets used to test cluster validity indices TABLE II: Preferred Number of Clusters For Several Validity Indices On Synthetic Data Sets Data Set

classes

(+)

VRand

(−)

VXB

(−)

VK

(−)

VF S

2D-5 5 5 (31) 5 (31) 5 (31) 6 (10) 2D-15 15 16 (12) 14 (13) 14 (13) 17 (8) 2D-17s 17 17 (11) 17 (11) 17 (11) 19 (11) 10 10 (30) 10 (30) 10 (30) 10 (30) 10D-10 2D-3 3 3 (31) 3 (31) 3 (31) 6 (9) 3 lines 3 5 (16) 6 (13) 6 (13) 10 (23) 2 6 (21) 8 (11) 10 (11) 10 (24) 2 curves 4 rings 4 10 (31) 7 (29) 7 (29) 10 (31) uniform ? 9 (31) 9 (31) 10 (31) Total 4 4 4 1 Bold indicates the number of clusters chosen by an index equals Numbers in parentheses indicates number of trials out of 31 that

(−)

VT S

(+)

VP CAES

(+)

VP C

(−)

VP E

(−)

VKI

(+)

VM P C

5 (31) 10 (9) 5 (31) 5 (31) 5 (31) 5 (31) 14 (13) 18 (8) 16 (12) 16 (12) 16 (12) 16 (12) 17 (11) 19 (9) 17 (11) 17 (11) 17 (11) 17 (11) 10 (30) 11 (16) 10 (30) 10 (30) 10 (30) 10 (30) 3 (31) 2 (15) 3 (31) 3 (31) 3 (31) 3 (31) 6 (13) 10 (10) 6 (11) 2 (31) 10 (17) 10 (12) 8 (11) 8 (17) 8 (14) 2 (31) 10 (23) 10 (23) 7 (29) 7 (26) 2 (31) 2 (31) 7 (29) 7 (21) 9 (31) 4 (31) 2 (31) 2 (31) 4 (31) 4 (31) 4 0 4 5 4 4 the preferred number of classes more times than any other choice of an index indicated the preferred number of clusters.

V. C ONCLUSIONS We showed how to adapt several popular cluster validity indices for use with kernel clustering. The four propositions given in this paper allow many well-known cluster validity indices to be directly formulated for use in kernel fuzzy clustering. We demonstrated how to use these propositions to reformulate five popular indices. Furthermore, we showed how validity indices that take dissimilarity data D as input, such as CCV, can be adapted. We showed the application of these indices in choosing the best fuzzy c-partition in several synthetic and real data sets with known cluster or class structure. We used kernel FCM as the clustering algorithm. Not surprisingly, there was no best cluster validity index; some performed better than others for these data sets, but there were also data sets that stymied the “best” indices and were “solved” by an overall less effective index. For this reason, we stress that in practice one should use many indices with the hope of having them come to a consensus. In the future, we are going to look at three unanswered questions. i) How can cluster validity indices help choose the best kernel? This question can take two forms: choosing between

VCCV p

(+)

VCCV s

(+)

4 (31) 10 (31) 10 (19) 10 (30) 4 (23) 4 (16) 4 (31) 7 (29) 4 (31) 1 c.

4 (31) 19 (9) 14 (6) 6 (9) 6 (9) 3 (15) 7 (7) 10 (28) 9 (7) 1

types of kernels; and setting kernel parameters, e.g., RBF width or polynomial degree. ii) How can other cluster validity indices be adapted to kernel fuzzy clustering, such as those that use quantities based on sampled-based covariance estimates? iii) Are there other, perhaps more specialized, cluster validity indices that could be useful for kernel clustering? In [26], the authors examine validity for shell-shaped clusters, such as those found by fuzzy c-shells. The main reason for using kernels is their ability to define strangely-shaped, non-linear boundaries between clusters or classes. Hence, we believe that indices like those proposed in [26] could be very useful for kernel clustering. ACKNOWLEDGEMENTS Havens is supported by the National Science Foundation under Grant #1019343 to the Computing Research Association for the CI Fellows Project. This material is based upon work supported by the Australian Research Council. R EFERENCES [1] H. Frigui, Advances in Fuzzy Clustering and Feature Discrimination with Applications. John Wiley and

TABLE III: Preferred Number of Clusters For Several Validity Indices On Real Data Sets Data Set

classes

(+)

VRand

(−)

VXB

(−)

VK

(−)

VF S

(−)

VT S

(+)

VP CAES

(+)

VP C

(−)

VP E

(−)

VKI

(+)

VM P C

iris 3 3 (25) 2 (31) 2 (31) 4 (7) 2 (31) 2 (30) 2 (31) 2 (31) 2 (31) 2 (31) glass 6 10 (22) 3 (30) 3 (30) 6 (15) 4 (30) 3 (28) 2 (31) 2 (31) 2 (31) 3 (30) 6 10 (31) 2 (26) 2 (23) 10 (31) 2 (31) 10 (7) 2 (31) 2 (31) 2 (31) 2 (8) dermatology ionosphere 2 2 (31) 10 (7) 7 (6) 10 (31) 2 (31) 2 (9) 2 (31) 2 (31) 2 (31) 2 (12) ecoli 8 4 (15) 3 (31) 3 (31) 10 (20) 3 (31) 3 (22) 2 (31) 2 (31) 2 (31) 3 (31) 2 2 (31) 2 (24) 2 (22) 10 (31) 2 (30) 4 (7) 2 (31) 2 (31) 2 (31) 2 (11) sonar wdbc 2 3 (31) 2 (31) 2 (31) 9 (30) 8 (30) 9 (31) 2 (31) 2 (31) 10 (30) 2 (31) wine 3 5 (28) 2 (31) 2 (31) 7 (8) 8 (12) 2 (26) 2 (31) 2 (31) 9 (16) 2 (31) Total 3 2 2 1 2 1 3 3 2 3 Bold indicates the number of clusters chosen by an index equals the preferred number of classes more times than any other choice Numbers in parentheses indicates number of trials out of 31 that an index indicated the preferred number of clusters.

[2]

[3]

[4] [5]

[6] [7]

[8] [9] [10]

[11]

[12]

[13]

[14]

Sons, 2007, ch. Simultaneous Clustering and Feature Discrimination with Applications, pp. 285–312. S. Khan, G. Situ, K. Decker, and C. Schmidt, “GoFigure: Automated Gene Ontology annotation,” Bioinf., vol. 19, no. 18, pp. 2484–2485, 2003. S. Gunnemann, H. Kremer, D. Lenhard, and T. Seidl, “Subspace clustering for indexing high dimensional data: a main memory index based on local reductions and individual multi-representations,” in Proc. Int. Conf. Extending Database Technology, Uppsala, Sweden, 2011, pp. 237–248. A. Jain and R. Dubes, Algorithms for Clustering Data. Englewood Cliffs, NJ: Prentice-Hall, 1988. L. Kaufman and P. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Blackwell, 2005. R. Xu and D. Wunsch II, Clustering. Psicataway, NJ: IEEE Press, 2009. D. A. Johnson and D. W. Wichern, Applied Multivariate Statistical Analysis, 6th ed. Englewood Cliffs, NJ: Prentice Hall, 2007. J. A. Hartigan, Clustering Algorithms. New York: Wiley, 1975. J. C. Bezdek, Pattern Recognition With Fuzzy Objective Function Algorithms. New York: Plenum, 1981. R. Krishnapuram and J. M. Keller, “A possibilistic approach to clustering,” IEEE Trans. on Fuzzy Sys., vol. 1, no. 2, May 1993. J. C. Bezdek and R. J. Hathaway, “Convergence of alternating optmization,” Neural, Parallel, and Scientific Computations, vol. 11, no. 4, pp. 351–368, Dec. 2003. Z. Wu, W. Xie, and J. Yu, “Fuzzy c-means clustering algorithm based on kernel method,” in Proc. Int. Conf. Computational Intelligence and Multimedia Applications, September 2003, pp. 49–54. R. J. Hathaway, J. M. Huband, and J. C. Bezdek, “A kernelized non-euclidean relational fuzzy c-means algorithm,” in Proc. IEEE Int. Conf. Fuzzy Systems, 2005, pp. 414–419. R. Dubes and A. Jain, “Clustering techniques: the user’s dilemma,” Pattern Recognition, vol. 8, pp. 247–260, 1977.

VCCV p

(+)

VCCV s

(+)

2 (31) 3 (30) 5 (6) 2 (10) 3 (16) 5 (9) 2 (31) 2 (31) 2 of c.

3 (20) 3 (30) 6 (6) 2 (6) 2 (31) 9 (5) 2 (29) 2 (14) 4

[15] J. C. Bezdek, M. Windham, and R. Ehrlich, “Statistical parameters of fuzzy cluster validity functionals,” Int. J. Computing and Information Sciences, vol. 9, no. 4, pp. 232–336, 1980. [16] M. Windham, “Cluster validity for the fuzzy c-means clustering algorithm,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 4, no. 4, pp. 357–363, 1982. [17] N. Pal and J. C. Bezdek, “On cluster validity for the fuzzy c-means model,” IEEE Trans. Fuzzy Systems, vol. 3, no. 3, pp. 370–376, 1995. [18] W. Wang and Y. Zhang, “On fuzzy cluster validity indices,” Fuzzy Sets and Systems, vol. 158, pp. 2095– 2117, 2007. [19] M. Halkidi, Y. Batisakis, and M. Vazirgiannis, “Cluster validity checking methods: part II,” ACM SIGMOD Record, vol. 31, no. 3, pp. 19–27, 2002. [20] D. Anderson, J. C. Bezdek, M. Popescu, and J. M. Keller, “Comparing fuzzy, probabilistic, and possibilistic partitions,” IEEE Trans. Fuzzy Systems, vol. 18, no. 5, pp. 906–917, October 2010. [21] M. Popescu, J. M. Keller, J. C. Bezdek, and T. C. Havens, “Correlation cluster validity,” in Proc. IEEE Int. Conf. Systems, Man, and Cybernetics, October 2011, pp. 2531– 2536. [22] R. J. Hathaway and J. C. Bezdek, “Visual cluster validity for prototype generator clustering models,” Pattern Recognition Letters, vol. 24, pp. 1563–1569, 2003. [23] J. M. Huband and J. C. Bezdek, Computational Intelligence: Research Frontiers. Berlin / Heidelberg / New York: Springer, June 2008, ch. VCV2 - Visual Cluster Validity, pp. 293–308. [24] F. Queiroz, A. Braga, and W. Pedrycz, “Sorted kernel matrices as cluster validity indexes,” in Proc. IFSA/EUASFLAT Conf., 2009, pp. 1490–1495. [25] A. Asuncion and D. J. Newman, “UCI machine learning repository,” http://www.ics.uci.edu/∼mlearn/ MLRepository.html, 2007. [26] R. Krishnapuram, O. Nasraoui, and H. Frigui, “The subsurface density criterion and its applications to linear/circular boundary detection and planar/spherical surface approximation,” in Proc. IEEE Int. Conf. Fuzzy Systems, vol. 2, 1993, pp. 725–730.

Suggest Documents