A Fuzzy Clustering and Fuzzy Merging Algorithm Carl G. Looney Computer Science Department/171 University of Nevada, Reno Reno, NV 89557
[email protected]
Abstract. Some major problems in clustering are: i) find the optimal number K of clusters; ii) assess the validity of a given clustering; iii) permit the classes to form natural shapes rather than forcing them into normed balls of the distance function; iv) prevent the order in which the feature vectors are read in from affecting the clustering; and v) prevent the order of merging from affecting the clustering. The k-means algorithm is the most efficient, easiest to implement and has known convergence, but it suffers from all of the above deficiencies. We employ a relatively large number K of uniformly randomly distributed initial prototypes and then thin by deleting any prototypes that are too close to another in a manner to leave fewer uniformly distributed prototypes. We then employ the k-means algorithm, eliminate empty and very small clusters and iterate a process of computing a new type of fuzzy prototypes and reassigning the feature vectors until the prototypes become fixed. At that point we merge the K small clusters into a smaller number K of larger classes. The algorithm is tested on both simple and difficult data sets. We modify the Xie-Bene validity measure to determine the goodness of the clustering for multiple values of K.
Keywords. fuzzy clustering, fuzzy merging, classification, pattern recognition
1. Introduction The classification of a set of entities by a learning system is a powerful tool for acquiring knowledge from observations stored in data files. Given a set of feature vectors, a self-organizing process clusters them into classes with similar feature values, either from a top-down (divisive) or bottom-up (agglomerative) process. Classes can also be learned under a supervised training process where each input vector is labeled and the process parameters are adjusted until the process output matches the codeword for the correct label. Nearest neighbor methods use a labeled reference set of exemplar vectors and put any novel input vector into the cluster to which a majority of its nearest labeled neighbors belong. Artificial neural networks [14] also learn this way. Here we investigate an agglomerative self-organizing (clustering) approach. The major problems in clustering are: i) how do we find the number K of clusters for a given set of vectors {x(q): q = 1,...,Q} when K is unknown? ii) how do we assess the validity of a given clustering of a data set into K clusters? iii) how do we permit clusters to take their own natural shape rather than forcing them into the shapes of normed unit balls? iv) how do we assure that the clustering is independent of the order in which the feature vectors are input? and v) how do we assure the independence of the order in which clusters are merged? A ball of uniformly distributed vectors has no cluster structure [22]. But if a set of vectors is partitioned into multiple subsets that are compactly distributed about their centers and the centers are relatively far apart, then there is strong cluster structure. We say that such clusters are compact and well-separated. These are criteria for well clustered data sets when the clusters are contained in nonintersecting normed balls. Not all data sets have strong cluster structure. Section 2 reviews some important clustering algorithms, although none of them solves the major problems given above. Section 3 develops our new type of fuzzy clustering algorithm, while Section 4 extends this agglomerative process by merging an excess of small clusters into fewer larger clusters.
Together the fuzzy clustering and fuzzy merging overcome to a significant extent the problems listed above. Section 5 provides the results of computer runs on various data sets. Section 6 gives some conclusions.
2. Clustering Algorithms In what follows, {x(q): q = 1,...,Q} is a set of Q feature vectors that is to be partitioned (clustered) into K clusters {Ck: k=1,...,K}. For each kth cluster Ck its prototype, or center, is designated by c(k). The dimension of the feature vectors x(q) = (x1(q),...,xN(q)) and centers c(k) = (c1(k),...,cN(k)) is N. The feature vectors are standardized independently in each component so the vectors belong to the N-cube [0,1]N. 2.1 The k-means Algorithm (Forgy [10], 1965 and MacQueen [15], 1967). This is the most heavily used clustering algorithm because it is used as an initial process for many other algorithms [18]. The number K of clusters is input first and K centers are initialized by taking K feature vectors as seeds via c(k) x(k) for k = 1,...,K. The main loop assigns each x(q) to the Ck with closest prototype (minimum distance assignment); averages the members of each Ck to obtain a new prototype c(k); reassigns all x(q) to clusters by minimum distance assignment; and it stops if no prototypes have changed, else it repeats the loop. This is the Forgy version. MacQueen’s version recomputes a new prototype every time a feature vector is assigned to a cluster. Selim and Ismail [20] show that convergence is guaranteed to a local minimum of the total sum-squared error SSE. Letting clust[q] = k designate that x(q) is in Ck, we can write
SSE =
{q: clust[q] = k}
(k=1,K)
x(q) - c(k)
2
(1)
The disadvantages are: i) K must be given (it is often unknown); ii) the clusters formed can be different for different input orders of the feature vectors; iii) no improper clusters are merged or broken up; iv) the cluster averages are not always the most representative vectors for the clusters; and v) the cluster boundaries have the shape of the normed unit ball (usually the Euclidean or Mahalanobis norm, see [14]), which may be unnatural. Peña et al. [18] have shown in 1999 that random drawing of the seed (initial) centers is more independent of the ordering than is the Forgy-MacQueen selection method. While the Kaufman method is slightly better than random selection (see [18]), it is computationally much more expensive). Snarey et al. [21] proposed a maximin method for selecting a subset of the original feature vectors as seeds. 2.2 The ISODATA. The ISODATA method of Ball and Hall [2] in 1967 employs rules with parameters for clustering, merging, splitting, creating new clusters and eliminating old ones (Fu [11] abstracted this process into a generalized algorithm in 1982). K is ultimately determined, starting from some user-given initial value K0. A cluster is broken into smaller ones if the maximal spread in a single dimension is too great. Some disadvantages are: i) multiple parameters must be given by the user, although they are not known a priori; ii) a considerable amount of experimentation may be required to get reasonable values; iii) the clusters are ball shaped as determined by the distance function; iv) the value determined for K depends on the parameters given by the user and is not necessarily the best value; and v) a cluster average is often not the best prototype for a cluster. 2.3 Other Basic Methods. Graphical methods (e,g,, see [14]) connect any pair of feature vectors if the distance between them is less than a given threshold , which can result in long dendritic clusters. A variation of this constructs a minimal spanning tree and then removes the longest branches to leave connected clusters. Such methods are generally inefficient relative to k-means types of algorithms and different thresholds can yield very different clusterings. Syntactic methods (e.g., see [14]) hold promise but are not widely applicable.
-2-
2.4 Fuzzy Clustering Algorithms. The fuzzy c-means algorithm of Bezdek [3,5] in 1973 (also called the fuzzy ISODATA) uses fuzzy weighting with positive weights {wqk} to determine the prototypes {c(k)} for the K clusters, where K must be given. The weights minimize the constrained functional J, where J(wqk,c(k)) =
(wqk)p x(q) - c(k)
(q=1,Q)
(k=1,K)
2
(2)
(k=1,K)
wqk = 1 for each q,
0
1.
J(wqk,c(k)) =
(wqk)p x(q) - c(k)
(q=1,Q)
(k=1,K)
wqk = (1/ x(q) - c(k) 2)1/(p-1) /
c(k) =
2
+
(r=1,K)
(q=1,Q)
q
(
(k=1,K)
wqk - 1)
(4)
(1/ x(q) - c(r) 2)1/(p-1), k = 1,...,K, q = 1,...,Q
Wqkx(q) , k=1,...,K;
(q=1,Q)
(5)
Wqk = wqk /
(r=1,Q)
wrk, q = 1,...,Q
(6a,b)
As p approaches unity, the close neighbors are weighted much more heavily, but as p increases beyond 2, the weightings tend to become more uniform. The fuzzy c-means algorithm allows each feature vector to belong to multiple clusters with various fuzzy membership values. The algorithm is: i) input K (2 K Q) and draw initial fuzzy weights {wqk} at random in [0,1]; ii) standardize the initial weights for each qth feature vector over all K clusters via
wqk = wqk /
(r=1,K)
wqr
(7)
iii) compute Jold from Equation (2); iv) standardize the weights over k = 1,...,K for each q via Equation (6b) to obtain Wqk; v) use Equation (6a) to compute new prototypes c(k), k = 1,...,K; vi) update the weights {wqk} from Equation (5); vii) compute Jnew from Equation (2); and viii) if |Jnew - Jold| < , then stop, else Jold Jnew and go to Step iv above. Convergence to a fuzzy weight set that minimizes the functional is not assured in general for the fuzzy clustering algorithms due to local minima and saddle points of Equation (4).
A fuzzy version of the ISODATA algorithm was developed by Dunn [9] in 1973. Keller et al. [12] published a fuzzy k-nearest neighbor algorithm in 1985 (see [7], 1968 for nearest neighbor rules). The crisp (nonfuzzy) algorithm uses a set S = {x(q): q = 1,...,Q} of labeled feature vectors and assigns any novel input vector x to a class based upon the class to which a majority of its k nearest neighbors belong. The fuzzification of the process utilizes a fuzzy weight wqk for each labeled qth feature vector and each kth class such that for each q the weights sum to unity over the K classes. The gradient based fuzzy c-means algorithm (1994) of Park and Dagher [17] minimizes the partial functionals and constraints given below (also see [14, p. 253]). A single feature vector x(q) is input at a time and any nearby c(k) are moved toward x(q) by a small amount controlled by the learning rate and fuzzy weight wqk, unlike the fuzzy c-means algorithm.
J(q)(wqk,c(k)) =
(k=1,K)
wqk x(q) - c(k) 2, q = 1,...,Q;
-3-
(k=1,K)
(wqk) = 1, for each q
(8a,b)
wqk = {1/ x(q) - c(k) 2} /
J(q)(wqk,c(k))/ wqk = 0;
cn(k)
(r=1,K)
{1/ x(q) - c(r) 2}
(9a,b)
cn(k) - { Jq/ wqk} = cn(k) + (wqk)2 [xn(q) - cn(k)], n = 1,...,N
(10)
The unsupervised fuzzy competitive learning (UFCL) algorithm (Chung and Lee [8], 1994 or see [14]) extends the unsupervised competitive learning (UCL) algorithm. The constrained functional to be minimized and the resulting optimal weights and component adjustments are J(wqk,c(k)) =
(wqk)p x(q) - c(k) 2, 0 < wqk < 1;
(q=1,Q)
(k=1,K)
wqk = {1/ x(q) - c(k) 2}1/(p-1) /
cn(k)
cn(k) +
(r=1,K)
(wqk)p[xn(q) - cn(k)]
(k=1,K)
wqk = 1 (for each q)
(11a,b)
{1/ x(q) - c(r) 2}1/(p-1)
(12)
(n = 1,...,N;
(13)
k = 1,...,K)
A set of prototypes is selected and each feature vector is input, one at a time, and the nearest (winning) prototype is moved a small step towards it according to Equation (13). After all feature vectors have been put through this training, the process is repeated until the prototypes become stable ( decreases over the iterations)..
The mountain method of Yager and Filev [25], 1994 sums Gaussian functions centered on feature vectors to build a combined “mountain” function over the more densely located points. The peaks are the cluster centers. While clever, this method is very computationally expensive. Chiu’s method [6] in 1994 reduces the computation somewhat. The mountain function is actually a fuzzy set membership function. 2.4 Clustering Validity. A clustering validity measure provides a goodness-of-clustering value. Some general measures for fuzzy clusters [16] are the partition coefficient (PC) and the classification entropy (CE) (invented by Bezdek [4]) and the proportional exponent (PE). The uniform distribution functional (UDF) [22] is an improved measure that was proposed by Windham in 1982. These are not as independent of K as is the Xie-Beni [24] clustering validity that is a product of compactness and separation measures. The compactness-to-separation ratio is defined by
= {(1/K)
2
(k=1,K)
k
}/{Dmin}2;
2
k
=
(q=1,Q)
wqk x(q) - c(k) 2, k = 1,...,K
(14a,b)
where Dmin is the minimum distance between the prototypes (cluster centers). A larger Dmin value means a lower reciprocal value. Each k2 is a fuzzy weighted mean-square error for the kth cluster, which is smaller for more compact clusters. Thus a lower value of means more compactness and greater separation.
We modify the Xie-Beni fuzzy validity measure by summing over only the members of each cluster rather than over all Q exemplars for each cluster. We also take the reciprocal = 1/ so that a larger value of indicates a better clustering and we call the modified Xie-Beni clustering validity measure. !
3. A New Fuzzy Clustering Algorithm The weighted fuzzy expected value (WFEV) of a set of real values x1,..,xP was defined by Schneider and Craig [19] to be a prototypical value that is more representative of the set than is either the average or the median. It uses the two-sided decaying exponential but we use the bell shaped Gaussian function instead in -4-
Equation (16) below because it is a canonical fuzzy set membership function for the linguistic variable CLOSE_TO_CENTER. After computing the arithmetic mean (0) of a set of real values x1,..,xP as the initial center value, we employ Picard iterations to obtain the new value "
(r+1) "
(r) $
p
=
(r) #
(p=1,P) $
= exp[-(xp -
p
( 2)(r+1) =
xp,
%
(p=1,P)
(r) 2 '
) /(2 2)(r)] / {
(r) &
%
&
(m=1,P)
$
p
(xp -
exp[-(xm -
(r) 2
)
'
(15a,b)
(r) 2 '
) /(2 2)(r)]}
(16)
%
The value = ( ) to which this process converges is our modified weighted fuzzy expected value (MWFEV). Usually 5 to 8 iterations are sufficient for 4 or more digits of accuracy. For vectors the MWFEV is computed componentwise. The 2 in Equations (15b, 16) is the weighted fuzzy variance (WFV). It changes over the iterations also and an initial value (for the spread parameter ) can be set at about 1/4 of the average distance between cluster centers. An open question currently is that of guaranteed convergence of the MWFEV and WFV. Our preliminary investigation shows empirically that the WFV can oscillate with increasing iterations, but that a subsequence converges (this is true of the fuzzy c-means weights). Therefore we substitute a fixed nonfuzzy mean-square error for the WFV in our algorithm. (
'
'
)
)
Table 1. Number of Initial Cluster Centers. N = Dimension of Inputs
Q = Number of Feature Vectors
K = max{ 6N + 12log2 Q, Q}
2, 2, 2
16, 32, 64
60, 72, 84
4, 4, 4
32, 64, 128
84, 96, 128
8, 8, 8
32, 64, 128
108, 120, 132
16, 16,16
64,128, 256
168, 180, 256
32, 32, 64
128, 256, 512
276,288, 512
The MWFEV method weights outliers less but weights the central and more densely distributed vectors more heavily. It provides a more typical vector to represent a cluster. Figure 1 shows an example of the MWFEV versus the mean and the median for a simple 2-dimensional example. The 5 vectors are (1,2), (2,2), (1,3), (2,3) and (5,1). The outlier (5,1) influences the mean (2.2,2.2) and the median (2,2) too strongly. The MWFEV vector (1.503,2.282), however, is influenced only slightly by the outlier in the y-direction and even less in the x-direction. We first standardize the Q sample feature vectors by affinely mapping all Q values of the same component into the interval [0,1] (we do this independently for each nth component over n = 1,...,N). The standardized vectors are thus in the unit N-cube [0,1]N. We circumvent the problem that the order of input vectors will affect the clustering by using a large number K of uniformly distributed initial prototypes to spread out the initial cluster centers. To cluster the sample {x(q): q=1,...,Q} we use a relatively large initial number K of uniformly randomly drawn prototypes, thin out the prototypes, assign the vectors to clusters via k-means and delete empty and very small clusters, then do our special type of fuzzy clustering followed by fuzzy merging. The initial K is empirically derived as shown in Table 1 by K = max{6N + 12log2Q, Q}. For example, for N = 32 and Q = 256 the computed K would be 288. In the main loop we compute the fuzzy weights and the MWFEV -5-
(componentwise) for each kth cluster to obtain the prototype c(k). We also compute the mean-squared error 2 th k for each k cluster. Then we again assign all of the feature vectors to clusters based on the minimum distance assignment. This process is repeated until the fuzzy centers do not change. The fuzzy clustering algorithm follows. *
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
A Fuzzy Clustering Algorithm Step 1: I = 0; input I0; input N and Q; compute K; for q = 1 to Q do input x(q); for k = 1 to K do c(k) = random-vector(); for k=1 to K do count[k] = 0; = 1.0/(K)1/N; Step 2: repeat dmin = 9999.9; for k=1 to K-1 do for kk=k+1 to K do if ( c(k) - c(kk) < dmin) then dmin = c(k) - c(kk) ; kmin = kk; if (dmin < ) then for k= kmin to K-1 do c(k) = c(k+1); K = K-1; again = true; else again = false; until (again = false); Step 3: for i = 1 to 20 do for q = 1 to Q do k* = nearest(q); clust[q] = k*; count[k*] = count[k*]+1; for k=1 to K do c(k) = (0,...,0); for q = 1 to Q do if (clust[q] = k) c(k) = c(k) + x(q) c(k) = c(k)/count[k]; Step 4: eliminate(p); Step 5: for k = 1 to K do fuzzyweights(); 2 k = variance(); (k) c = fuzaverage(); for k=1 to K do count[k] = 0; Step 6: for q = 1 to Q do ,
-
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
/*Set current and max. iteration numbers. */ /*N = no. components; Q = no. feature vectors. */ /* K = max{ 6N + 12 log2 Q, Q}. */ /*Read in Q feature vectors in file to be clustered.*/ /* Generate uniform random prototypes in [0,1]N .*/ /*Initialize number of vectors in each cluster*/ /*Threshold for prototypes being too close. */ /*Parameter for smallest distance between centers. */ /*Check all distances between prototypes */ /*to obtain minimum such distance. */
-
-
,
/
+
-
/*Index for possible deletion. */ /*If deletion criterion is met then */ /*remove prototype too close to another one */ /*by moving all higher prototypes down one place. */ /*Decrement K upon each removal. */ /*Do process again if prototype is too close. */ /*Do not repeat if smallest distance is too large */ /*in which case “again” is a false boolean variable. */ /*Do 20 k-means iterations. */ /*Assign each x(q) via nearest prototype c(k). */ /*c(k*) is nearest prototype to x(q). */ /*Assign vector x(q) to cluster k*. */ /*Update count of cluster k*. */ /*Zero out all prototypes (centers) for averaging. */ /*For each feature vector, check if it belongs */ /*to cluster k, and if so, then add it to the sum*/ /*for averaging to get new prototype (componentwise). */ /*Eliminate clusters of count p (user given small p). */ /*Begin fuzzy clustering. */ /*Use Eqn. (16) to compute fuzzy wts. in kth cluster. */ /*Get variance (mean-square error) of each cluster. */ /*Fuzzy prototypes (Eqn. 15a, componentwise WFEV). */ /*For each kth cluster, zero out counts for reassignment.*/ /*Assign each x(q) to a Ck via minimal distance. */ .
-6-
/*Assign x(q) to nearest c(k*). */ /*Record this with index clust[q] = k*. */ /*Increment count for assigned vector. */ /*Increment iteration number. */ /*Check for stopping criterion. */ /*Compute modified Xie-Beni validity measure. */ /*Call function to merge, eliminate empty results. */ /*Stop if no prototypes change: stop_criterion = TRUE*/ /*or if number of iterations is exceeded, else repeat loop*/
k* = nearest(q); clust[q] = k*; count[k*] = count[k*]+1; I = I + 1; Step 7: if (I > I0) then compute ; merge(); stop; else go to Step 5; 0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
/*For each cluster k, eliminate it if its */ Elimination Algorithm (eliminate(p)) for k = 1 to K-1 do /*count p (empty clusters also). */ if (count[k] p) then /*First, test for size < p. If so, then move*/ for i = k to K-1 do /*all count indices down by 1. */ count[i] = count[i+1]; for q = 1 to Q do /*Move assignment indices down by 1 */ if (clust[q] = i+1) then /*for any vector in kth cluster. */ clust[q] = i; K = K - 1: /*Decrement no. clusters on elimination. */ if (count[K] p) then K = K - 1; /*Check to eliminate last cluster center. */ 2
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
4. Fuzzy Merging of Clusters We first find the R = K(K-1)/2 distances {d(r): r = 1,...,R} between unique pairs of prototypes. For each such distance, the indices k1(r) and k2(r) record the two cluster indices for which the distance d(r) was computed, ordered so that k1(r) < k2(r). Each rth prototypical pair k1(r) and k2(r) is merged only if the following condition is met: d(r) < D, where D is the WFEV of all of the distances {d(r): r = 1,...,R} and satisfies 0.0 < < 1.0 ( = 0.5 is a good first trial value). An optional criterion is: the ball of radius ( k1(r) + k2(r))/2 centered on the midpoint vector y = ½[c(k1(r)) + c(k2(r))] between the cluster centers contains at least 25% of the vectors in one cluster and atleast 15% of the other (these default percentage values can be changed). 3
3
3
3
4
4
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
Algorithm to Merge Clusters (merge()) Step 1: r = 1; for k = 1 to K-1do for kk = k+1 to K do d[r] = c(k) - c(kk) ; k1[r] = k; k2[r] = kk; r r + 1; for k=1 to K do Merge[k] = 0; Step 2: D = MWFEV({d[r]}r); Step 3: r* = 1; for r = 2 to K(K-1)/2 do if d[r] < d[r*] then r* = r; 7
6
6
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
/*For every one of the K(K-1)/2 unique */ /*pairs of prototypes, get the distance*/ /*between them and get the indices for*/ /*the prototypical pairs (k1() and k2(). */ /*Increment r (rth pair), 1 r R=K(K-1)/2. */ /*Zero out merge indices. */ /* Find MWFEV of the d(r) values. */ 8
8
/*Over all inter-prototype distances */ /*find a pair with least distance for merge test. */
-7-
5
5
5
5
[Step 4:
K1 = 0; K2 = 0; /*Optional: Zero out counters 1 and 2. */ (k1[r*]) (k2[r*]) +c ]; /*Compute midway vector between prototypes. */ y = ½[c = k1(r*) + k2(r*) ; /*Compute for ball centered at y. */ for q=1 to Q do /*Find feature vectors in overlapping ball. */ if ( x(q) - y < ) then /*x(q) must be close to y and also be in either */ if (clust[q] = k1[r*]) then K1 = K1+1; /*cluster k1(r*) or cluster k2(r*). */ if (clust[q] = k2[r*]) then K2 = K2+1; P1 = K1/count[k1[r*]]; /*Compute proportion of Ck1(r) points in ball, */ 9
9
9
:
9
:
;
]
P2 = K2/count[k2[r*]; Step 5: if (d[r*] < D)
/*proportion of Ck2(r) points in ball (optional).*/ /*Test merging criteria (try = 0.5 first). */
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
/*Eliminate empty clusters (p = 0). */ /*Test for stopping( no previous merging). */ /*If no stopping then repeat merge test. */ >
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
This algorithm merges successfully without the optional criteria in Steps 4 and 5. When cluster k2(r) is merged with cluster k1(r) we record that as Merge(k2(r)) = k1(r). It is always true that k1[r] < k2[r]. A variation is to save the prototype of the merged cluster that was eliminated (see Step 5). If new feature vectors are to be classified, they can be checked for closeness to each current prototype c(k), but can also be checked against the saved prototypes of the form z(s). If a feature vector is closest to a saved prototype z(s), then the index Merge(s) yields the next cluster to which it belongs. This means that the index Merge(Merge(s)) is to be checked, and so forth. This allows a string of small clusters to be merged to form a longer one, provided that the stringent merge criteria are met, with a final shape that can be different from that of a normed ball.
5. Computer Results 5.1 Tests on Data. To show the affects of ordering on the usual (Forgy [10]) k-means algorithm we used the data file testfz1a.dta shown in Figure 2 and the file testfz1b.dta formed from it by exchanging feature vector number 1 with feature vector number 8 in the ordering. With K = 5 the different results are shown in Figures 3 (testfz1a.dta) and 4 (testfz1b.dta). Other changes in the ordering caused yet other clusterings. The shown clusterings are not intuitive. Our merging algorithm applied to the clusters in both Figures 3 and 4 with = 0.5 yielded the identical results shown in Figure 5, which agrees more with our intuition of the classes. ?
Figures 6 and 7 show the respective results of our new fuzzy clustering and fuzzy merging algorithms on the file testfz1a.dta. The results were the same for testfz1b.dta. The modified Xie-Beni validity values -8-
were also identical for the two files. This, and other tests, showed that the strategy of drawing a large number of uniformly distributed prototypes and then thinning them out to achieve a smaller number of more sparsely uniformly distributed prototypes prevented the order of inputs from affecting the final clustering in these cases. The final merged clusters with K = 3 were also intuitive. The initial K was 59, which was reduced to 8 by eliminating empty clusters and then to 5 (very intuitive as seen in Figure 6) by elimination with p = 1. K was reduced to 3 by merging with = 0.5. Table 2 sums the results. @
Table 2. Clustering/Eliminating/Merging via New K-means Algorithm. Data File No. Clusters K testfz1a.dta 5
Clusters Validity 3, 12 9.7561 8, 14, 15 5, 7, 13 2, 9, 11 1, 4, 6, 10 -----------------------------------------------testfz1a.dta 3 2, 5, 7, 9, 11,13 9.9123 3, 8, 12, 14, 15 1, 4, 6, 10 ------------------------------------------------
Comments Initial K = 59, deleted to K = 8, eliminated with p=1 to K=5 (no merging).
Initial K=59, deleted to K = 8, and with p=1, eliminated to K=5, then merged to K = 3 ( =0.5). @
The second data set is Anderson’s [1] well-known set of 150 feature vectors of that are known to be noisy and nonseparable [12]. The data is labeled as K = 3 classes that represent 3 subspecies (Sestosa, Versicolor and Virginica) of the iris species. The given sample contains 50 labeled feature vectors from each class for a total of Q = 150. The feature vectors have N = 4 features: i) sepal length; ii) sepal width; iii) petal length; and iv) petal width. Table 3 presents the results of our complete fuzzy clustering and fuzzy merging algorithm. Our modified Xie-Beni clustering validity measure shows that K = 2 classes are best, which coincides with the PC and CE validity values found by Lin and Lee [13]. Lin and Lee also used the PE that gave the best value for K = 3 rather than K = 2. This yields 3 votes to 1 that K = 2 is the best number of clusters for the iris data set. We note that in each clustering of Table 3 the Sestosa class contained the correct number of 50 feature vectors (underlined) and the same weighted fuzzy variance (underlined). The second class therefore has two subclasses that are not well separated. Initially, our value for K was 150, but after deletions (due to close prototypes) it was reduced to 57. K was further reduced to 16, 9 and then 7 by elimination with p = 2, 6 and 10, respectively. After 40 fuzzy clustering iterations, the fuzzy merging with = 0.5 reduced K to 4 as shown in Table 3. This was followed by further fuzzy merging with = 0.66 to yield K = 3 clusters, and then with = 0.8 to yield K = 2. The modified Xie-Beni validities are also shown in Table 3. @
@
@
Table 3. Fuzzy Clustering/Merging the Standardized Iris Data. Data File No. Clusters K Validity Cluster Sizes k iris150.dta 4 0.5766 50, 40, 29, 31 0.192, 0.174, 0.182, 0.227 3 1.1520 50, 51, 49 0.192, 0.248, 0.215 2 50, 150 0.192, 0.323 5.8216 A
C
B
-9-
Our third data set is taken from Wolberg and Mangasarian [23] at the University of Wisconsin Medical School. We randomly selected 200 of the more than 500 feature vectors. As usual, we standardized each of the 30 features separately to be in [0,1]. The vectors are labeled for two classes that we think are benign and malignant. One label is attached to 121 vectors while the other is attached to 79 vectors. Our radial basis functional link net (neural network) “learned” the two groups of 121 and 79 vectors “correctly”. However, the vectors do not naturally fall precisely into these groups because of noise and/or mislabeling (a danger in supervised learning with assigned labels). Table 4 shows that K = 2 (with sizes of 128 and 72) has the best clustering. The original K was 272, which was not reduced by deletion of close clusters with the computed (N = 30 features provided a large cube). Eliminations with p = 1, 7, and 11, followed by fuzzy clustering and then fuzzy mergings with = 0.5, 0.66, 0.8 and 0.88, respectively, brought K down to 5, 4, 3, and 2 respectively. We conclude from the low modified Xie-Bene validity values that the data contains extraneous noise and does not separate into compact clusters. D
E
Table 4. Results on the Wolberg-Mangasarian Wisconsin Breast Cancer Data. Data File No. Clusters K Validity Cluster Sizes wbcd200.dta 5 0.1050 30, 53, 69, 22, 26 4 0.2575 35, 117, 22, 26 3 0.4114 54, 109, 37 2 128, 72 1.1442 G
F
The fourth data set was from geological data that is labeled for K = 2 classes. Each of the Q = 70 feature vectors has N = 4 features. The data were noisy and the labels were estimates by humans that gave 35 to one class and 35 to the other one. Our neural network “learned” the labeled classes “correctly.” Table 5 shows some of the results. The initial K was 98, but this was reduced to 33 by deletion of close prototypes and then reduced to 13 and 8 by elimination with p =1 and p = 4, respectively. After 40 fuzzy clustering iterations, the fuzzy mergings with = 0.5, 0.66, 0.8 and 0.88 brought it down to the respective K values of 7, 5, 3 and 2. Other paths through deleting, eliminating and merging also yielded the same results. E
Data File geo70.dta
Table 5. Results on the Geological Labeled Data. No. Clusters K Validity Cluster Sizes 5 0.6267 14, 26, 7, 16, 7 3 1.0112 17, 13, 40 2 51, 19 2.1545 G
F
6. Conclusions Of the many methods available for clustering feature vectors the k-means algorithm is the quickest and simplest, but it is dependent upon the order of the feature vectors and requires K to be given. We have investigated an approach that makes the clustering independent of the ordering. Our method of deletion to thin out a dense initial set of prototypes leaves fewer but yet uniformly distributed prototypes for initial clusters. Our method of elimination is a continuing process of removing clusters that are either empty or very small and reassigning their vectors to the remaining larger clusters. Before merging, we apply a few tens of -10-
iterations of our simplified fuzzy clustering that moves centers into regions of more densely situated feature vectors. These centers more accurately represent the clusters. From these prototypical centers, we do our fuzzy merging. This is accomplished by first finding the modified weighted fuzzy expected value (MWFEV) of the distances between the fuzzy centers and using a user selected value times this MWFEV. If the distance between two fuzzy centers is less than the threshold, then the clusters are merged into the first of the two (the pairs of clusters are ordered). Optional criteria can be used in the decisionmaking to merge or not by forming a temporary center at the midpoint of the line between the pair of centers being considered, taking the summed mean-squared error of the two clusters as radius and checking to see if sufficiently many points from each cluster fall into this new overlapping ball. If so, and if the centers are sufficiently close, then the merging of the two clusters is done. H
In our experiments, we have had good success when deleting (thinning) using the threshold computed by the fuzzy clustering algorithm. When eliminating empty and small clusters, we find that p = 1 gives a good first pass because there are so many clusters that most are either empty or very small and we don’t want to eliminate too many clusters that could have prototypes in good locations. If there are still too many clusters, then we can use p = 2, or greater after observing the cluster sizes on the screen. The merging should use a value of = 0.5 on the first try to see what the result is on the screen. The merging can be continued until the modified Xie-Beni validity stops increasing or K = 2 is reached. I
H
The user can learn from and employ the information output to the screen concerning the intermediate results and the modified Xie-Beni validity values so that a particular data set can be processed multiple times to achieve a very good result. This makes the combined fuzzy clustering and fuzzy merging algorithm with user interaction very useful for analyzing data.
References [1] E. Anderson, “The iris of the Gaspe peninsula,” Bulletin American Iris Society, vol. 59, 2-5, 1935. [2] G. H. Ball and D. J. Hall, “ISODATA: an iterative method of multivariate analysis and pattern classification,” Proc. IEEE Int. Communications Conf., 1966. [3] J. C. Bezdek, Fuzzy Mathematics in Pattern Classification, Ph.D. thesis, Center for Applied Mathematics, Cornell University, Ithica, N.Y., 1973. [4] J. C. Bezdek, “Cluster validity with fuzzy sets,” J. Cyber., vol. 3, no. 3, 58-73, 1973. [5] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, Plenun Press, N.Y., 1981. [6] S. L. Chiu, “Fuzzy model identification based on cluster estimation,” J. Intelligent and Fuzzy Systems, vol. 2, no. 3, 1994. [7] T. M. Cover, “Estimates by the nearest neighbor rule,” IEEE Trans. Information Theory, vol. 14, no. 1, 50-55, 1968. [8] F. L. Chung and T. Lee, “Fuzzy competitive learning,” Neural Networks, vol. 7, no. 3, 539-552, 1994.
-11-
[9] J. C. Dunn, “A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters,” J. Cybernetics, vol. 3, no. 3, 32-57, 1973. [10] E. Forgy, “Cluster analysis of multivariate data: efficiency versus interpretability of classifications,” Biometrics 21, 768, 1965. [11] K. S. Fu, "Introduction," appeared in Applications of Pattern Recognition, edited by K. S. Fu, CRC Press, Boca Raton, 1982. [12] J. M. Keller, M. R. Gray and J. A Givens, Jr., “A fuzzy k-nearest neighbor algorithm,” IEEE Trans. Systems, Man and Cybernetics, vol. 15, no. 4, 580-585, 1985. [13] C. T. Lin and C. S. George Lee, Neural Fuzzy Systems, Prentice-Hall, Upper Saddle River, NJ, 1995. [14] Carl Looney, Pattern Recognition Using Neural Networks, Oxford University Press, N.Y., 1997. [15] J. B. MacQueen, "Some methods for classification and analysis of multivariate observations," Proc. 5th Berkeley Symp. on Probability and Statistics, University of California Press, Berkeley, 281-297, 1967. [16] N. R. Pal and J. C. Bezdek, “On cluster validity for the c-means model,” IEEE Trans. on Fuzzy Systems, vol. 3, no. 3, 1995. [17] D. C. Park and I. Dagher, "Gradient based fuzzy c-means (GBFCM) algorithm," Proc. IEEE Int. Conf. Neural Networks, 1626-1631, 1984. [18] J. M. Peña, J. A. Lozano and P. Larrañaga, “An empirical comparison of four initialization methods for the K-means algorithm,” Pattern Recognition Letters 20, 1027-1040, Jul. 1999. [19] M. Schneider and M. Craig, "On the use of fuzzy sets in histogram equalization," Fuzzy Sets Syst., vol. 45, 271-278, 1992. [20] S. Z. Selim and M. A. Ismail, “K-means type algorithms: a generalized convergence theorem and characterization of local optimality,” IEEE Trans. Pattern Analysis and Machine Intelligence, 6, 81-87, 1984. [21] M. Snarey, N. K. Terrett, P. Willet, and D. J. Wilton, “Comparison of algorithms for dissimilarity-based compound selection.” J. Mol. Graphics and Modelling 15, 372-385, 1997. [22] M. P. Windham, “Cluster validity for the fuzzy c-means clustering algorithm,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 4, no. 4, 357-363, 1982. [23] W. H. Wolberg and O. L. Mangasarian, “Multisurface method of pattern separation for medical diagnosis applied to breast cytology,” Proc. of the Nat. Academy of Sciences, U.S.A., Volume 87, 9193-9196, December, 1990. [24] X. L. Xie and G. Beni, “A validity measure for fuzzy clustering,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 13, no. 8, 841-847, 1991. -12-
[25] R. R. Yager and D. P. Filev, “Approximate clustering via the mountain method,” IEEE Trans. Systems, Man and Cybernetics, vol. 24, 1279-1284, 1994.
-13-
Figure 1. MWFEV vs. mean and median.
Figure 2. Original sample.
-14-
Figure 3. K-means, original order, K = 5.
Figure 4. K-means, 1 and 8 exchanged, K = 5.
-15-
Figure 5. Fuzzy merging of k-means clusters.
Figure 6. New method, thinned to K=5.
-16-
Figure 7. Fuzzy clustering/merging.
-17-