Agglomerative/divisive hierarchical clustering algorithms [6] ... In the hierarchical structure result- .... the complexity of computing the cluster validity index. ... minimum distance from the whole set of clusters (i.e., two clusters with global mini-.
A New Clustering Algorithm Based on Cluster Validity Indices Minho Kim and R.S. Ramakrishna Department of Information and Communications, GIST 1 Oryong-dong, Buk-gu, Gwangju 500-712, Republic of Korea {mhkim,rsr}@gist.ac.kr
Abstract. This paper addresses two most important issues in cluster analysis. The first issue pertains to the problem of deciding if two objects can be included in the same cluster. We propose a new similarity decision methodology which involves the idea of cluster validity index. The proposed methodology replaces a qualitative cluster recognition process with a quantitative comparison-based decision process. It obviates the need for complex parameters, a primary requirement in most clustering algorithms. It plays a key role in our new validation-based clustering algorithm, which includes a random clustering part and a complete clustering part. The second issue refers to the problem of determining the optimal number of clusters. The algorithm addresses this question through complete clustering which also utilizes the proposed similarity decision methodology. Experimental results are also provided to demonstrate the effectiveness and efficiency of the proposed algorithm.
1 Introduction Clustering operation attempts to partition a set of objects into several subsets. The idea is that the objects in each subset are indistinguishable under some criterion of similarity [1], [5], [6], [9]. The core problem in clustering is similarity decision. We look for similarity while deciding whether or not two objects may be included in the same group, i.e., cluster. For the purpose of measuring similarity (or dissimilarity) between two objects, the concept of distance between them is most widely used. The most common way to arrive at a similarity decision employs thresholding. The easiest way to decide similarity of two objects is to compare the distance between them (subject to the userspecified threshold value). That is, if the distance is less than the threshold, they can be included in the same cluster; otherwise, each of them should be placed in two different clusters [9]. Agglomerative/divisive hierarchical clustering algorithms [6] also employ a similar method (as outlined above). In the hierarchical structure resulting from agglomeration/division (for example, dendrogram), each independent subgraph below a user-specified distance is a cluster and thus objects below that subgraph are in the same cluster and can be said to be similar. In partitional clustering algorithms [6], the similarity decision may be affected by the predetermined number E. Suzuki and S. Arikawa (Eds.): DS 2004, LNAI 3245, pp. 322–329, 2004. © Springer-Verlag Berlin Heidelberg 2004
A New Clustering Algorithm Based on Cluster Validity Indices
323
k of clusters. It will also be influenced by the threshold. The threshold and thresholdrelated parameters play a very important role in cluster analysis. They are usually determined by trial and error, or through very complex processes. These methodologies are thought to be computationally quite involved. The focus in cluster analysis has shifted to cluster validity indices in recent times. The main objective is to determine the optimal number of clusters [1], [3], [4], [7], [8], [10]. However, the indices are not specifically targeted at similarity decision, per se. In this paper, we address the relationship between similarity decision and cluster validity index, and also propose a validation-based clustering algorithm (VALCLU) centered around the (extracted) relationship. VALCLU consists of two parts: random clustering and complete clustering. The former builds a clustering pool in a random fashion and, then iteratively decides the similarity of two objects. The optimal number of clusters is found by complete clustering. We also present a new cluster validity index, VI, which can be used in complete clustering. VALCLU finds the optimal number of clusters without requiring painstakingly determined complex parameters for (similarity) decision making. The rest of the paper is organized as follows. Section 2 discusses similarity decision. In section 3, the proposed VALCLU algorithm is described. Experimental results and conclusions are given, in sections 4 and 5, respectively.
2 Similarity Decision How can we decide similarity of objects? Let us discuss the problem through an example. (Please see Fig. 1)
Fig. 1. Recognition of similarity between two white objects.
Under what condition(s) can we say that the two objects within the oval are similar and hence, can be included in the same cluster? To begin with, assume that there are only two white objects within the oval and a gray object and a black object without. It might be difficult to conclude that the two white objects are dissimilar to the gray object, and that only the two white objects can be grouped together and they are similar. Let us consider another situation wherein there are two white objects and a black object (instead of the gray object). It is now possible to decide that the two white objects belong to the same cluster while the black object does not. The examples above indicate that an important factor that affects the recognition of a cluster is relativity (relative similarity/dissimilarity). That is, if two objects (two white objects in Fig. 1) are relatively similar (are located closely) compared with the other object (the black one in Fig. 1), one can easily separate the objects.
324
M. Kim and R.S. Ramakrishna
What kind of adjustment in our recognition affects the decision (of similarity)? The intra-cluster distance of a new cluster generated by merging two objects appears to be greater than those of two independent objects. In other words, we have to sacrifice compactness when merging two objects. However, merging two objects makes it easier to discriminate objects from one another, since they are thought of as a single (abstract) object (cluster). That is, we gain ‘separability’ of objects (clusters). This discussion is summarized in below. Observation 1. Two objects may be comprehended as being similar (and hence, may be included in the same group of objects) if we sacrifice (relatively small) compactness. In the process, we gain separability while merging two objects. Obs. 1 provides a qualitative methodology for testing similarity. However, we need to note that the rationale is the same as that of a cluster validity index. A cluster validity index is a quantitative parameter. Therefore, adopting a cluster validity index in Obs. 1 leads to a new quantitative methodology for similarity decision as outlined below. Definition 1. Two objects are similar (and hence, can be included in the same group of objects) if the corresponding value of a cluster validity index is smaller after merging (them) than before merging. (Here we are assuming that an optimal value of validity index is its minimum). By virtue of the above definition, we can make similarity decisions by comparing values of the cluster validity index before and after merging. This is the key idea in the VALCLU algorithm we propose in the next section.
3 Validation-Based Clustering Algorithm For purposes of efficiency, we divide the proposed algorithm into two parts according to the way the clusters to be tested are chosen. The random clustering part uses only a small number of clusters while the complete clustering part uses all the clusters. We note here that an object is a unit of similarity test in section 2, but that a cluster, which is a group of objects, is a unit of the test in the validation-based clustering algorithm. 3.1 Random Clustering All the existing cluster validity index based clustering algorithms make use of all the clusters at each level. Thus, the larger the number of clusters at a level, the higher is the complexity of computing the cluster validity index. Moreover, we encounter a large number of singleton clusters if a level is close to the initial state (of cluster analysis). However, in order to make similarity decisions by means of cluster validity index proposed in section 2, we need two clusters for merging and at least one cluster for comparison.
A New Clustering Algorithm Based on Cluster Validity Indices
325
As for choosing two clusters to be merged, we may choose a pair of clusters with minimum distance from the whole set of clusters (i.e., two clusters with global minimum distance). But this is computationally very expensive. Calculating the cluster validity index turns out to be expensive as well for the same reason. In order to resolve these problems, we propose random clustering, which randomly forms a clustering pool (CP) with size |CP|, from the set of all the clusters. Similarity decision as proposed in section 2 follows thereafter. Ideally, |CP| ≥ 3 from the above considerations. The algorithm is outlined in below. The details about EXIT1 and EXIT2 are described later in section 3.3. Randomly compose a cluster pool (CP); Calculate index(|CP|); Virtually merge 2 clusters with minimum distance in CP; Calculate index(|CP| - 1); IF index(|CP| - 1) is optimal THEN Merge the 2 clusters; IF EXIT1 condition THEN GOTO complete clustering; ELSE IF EXIT2 condition THEN GOTO complete clustering; GOTO step 1 3.2 Cluster Validity Indices Cluster validity indices can be classified into two categories. Ratio type indices are characterized by the ratio of intra-cluster distance to inter-cluster distance. The summation type index is defined to be the sum of intra-cluster distance and inter-cluster distance with appropriate weighting factors. However, the ratio type index cannot be used in the initial state of random clustering, which is mainly comprised of singleton clusters. This is so because the intra-cluster distance of a cluster with one member is 0 and the index value is 0 or ∞, and hence, similarity comparisons are not meaningful. Therefore, only summation type indices can be used for random clustering. Recently proposed summation type indices include: SD [4], vsv [7]. (Due to space limitations, details have been omitted.) Since complete clustering can use indices of ratio type (unlike the random clustering), we propose a new cluster validity index in this category. This validity index, VI is defined in eqn. (1): VI ( nc ) = 1 Si = ni
{S i + S k } 1 nc k =max 1... nc , k ≠ i ∑ nc i =1 min {d i ,l } l =1...nc ,l ≠i
∑c
x∈X i
i
(1)
− x , d i ,l = ci − cl
In the equations above, nc stands for the number of clusters, ci for a representative of cluster i. The optimal value of the index VI is its minimum value.
326
M. Kim and R.S. Ramakrishna
The proposed index VI can be explained by focusing on three features. First, the term min {d i ,l } in eqn. (1) points to the clusters which need to be merged by virtue l =1...nc ,l ≠i
of their exhibiting a very small value, implying very high VI-value. Second, a large max {S i + S k } implies that unnecessary merging has taken place. Finally, the averk =1... nc , k ≠ i
aging in eqn. (1) combines the total information contained in the current state of the cluster structure and thereby imparts robustness. 3.3 Complete Clustering Random clustering using the clustering pool has two weaknesses. First, it does not use the full set of clusters when arriving at similarity decisions by computing the cluster validity index. This may lead to wrong decisions and may fail to find the optimal nc. In order to address this problem, we need to make use of the whole set of clusters as a clustering pool. That is, switching to an algorithm with |CP| = nc is imperative. The switching time could be nc = N as per a well known rule of thumb. Here N is the total number of data objects. This is an EXIT1 condition in section 3.2, i.e., Exit if nc < N . Another weakness of random clustering is that it can be trapped in a local minimum. This can be avoided by merging repeatedly until nc = 1 irrespective of the similarity decision and looking for just the cluster structure with the optimal value of the cluster validity index. The intuitive way to decide the entrapment in a local minimum is to see if merging has failed repeatedly over a certain number (e.g., ncC|CP|) of tests. This is the EXIT2 condition in section 3.2. Now, we propose the second part of the validation-based clustering algorithm, called complete clustering, by taking the above points into account. The complete clustering algorithm is given in below. [Initialize] nc = # clusters; index_optimal = MAX_VALUE; WHILE ( nc >= 2 ) { Calculate index(nc); //(|CP| == nc) IF index(nc) is optimal than index_optimal THEN index_optimal = index(nc); Store the current configuration of clusters; Merge two clusters with minimum distance; nc--; } VALCLU is similar to agglomerative hierarchical clustering algorithms. However, one of the major drawbacks of the latter is the absence of refinement. VALCLU also suffers from this drawback. In order to address this problem, we refine the results of random clustering and complete clustering through the well known k-means algo-
A New Clustering Algorithm Based on Cluster Validity Indices
327
rithm. As is well known, k-means algorithm guarantees acceptable results, given proper initial representatives and number of clusters, k. Random clustering and complete clustering satisfy both these requirements.
4 Experimental Results For the purpose of evaluating the effectiveness of the proposed VALCLU algorithm, five synthetic datasets and one real world dataset were used. The synthetic datasets are shown in Fig. 2. We used iris dataset for the real world dataset test [2].
(a) Dataset 1
(b) Dataset 2
(c) Dataset 3
(d) Dataset 4
(e) Dataset 5
Fig. 2. Synthetic datasets.
To begin with, for the purpose of evaluating the performance of random clustering, we measured the average number of similarity decision tests for each nc until it succeeded in merging two clusters (n1). If we look for two clusters with global minimum distance (dmin) as in agglomerative hierarchical clustering, the number of distance computations for each nc is nc⋅(nc – 1)/2. Thus, the total number of evaluations over the range N ≤ nc ≤ N (the same range as that of the random clustering), is n2 = (N + 1)⋅N⋅(N - 1)/6 - ( N + 1)⋅ N ⋅( N – 1)/6. On the other hand, for a clustering pool in random clustering, we need to perform n3 = |CP|⋅(|CP| – 1)/2 + (|CP| – 1)⋅(|CP| – 2)/2 = (|CP| – 1)2 evaluations. With a view to fairly compare random clustering with clustering using global dmin, we look at n1 and n2/(n3⋅(N – N )) = n4. Table 1 shows the results. Since any value greater than or equal to 3 can be selected for |CP|, we arbitrarily selected |CP| = 15. In Table 1, it can be seen that the value n1 of (random) clustering pool is much smaller than the value n4 of clustering using global dmin. That is, random clustering is much more efficient than clustering using entire clusters as in agglomerative hierarchical clustering. While comparing vsv with SD, the latter requires fewer tests than the former with random clustering. In addition, vsv and SD have almost the same clustering error rate (in average 0.026 and 0.024, respectively), where clustering error rate refers to the rate at which member data differs from the majority class in the same cluster. Therefore, SD is seen to be a more efficient index for random clustering. We will now examine the performance of complete clustering in computing the optimal nc. Here, we adopt the recently proposed index I [8] and the index VI proposed in section 3.2, as well as the indices used in random clustering. Fig. 3 shows the findings. Note that in Fig. 3, the optimal nc is not provided for real data, i.e., the
328
M. Kim and R.S. Ramakrishna
Table 1. Comparisons of n1, n4, and N for vsv and SD with respect to various datasets in random clustering. Data index Dataset vsv 1 SD Dataset vsv 2 SD Dataset vsv 3 SD Dataset vsv 4 SD Dataset vsv 5 SD Real vsv data SD
n1 28.215 1.033 69.673 1.073 34.663 1.000 33.047 1.004 30.255 1.014 13.225 1.040
N
n4
500
222.34
800
563.93
550
268.43
130 0
1,477.97
130 0
1,477.97
150
20.78
Fig. 3. Found number of clusters through complete clustering by using various cluster validity indices.
Iris dataset. The reason why we did not show the optimal nc for this dataset is that the optimal nc for the this dataset is debatable [1]. Therefore, in this paper we work with nc = 2 as well as nc = 3. From Fig. 3, we see that the index VI and the optimal nc almost perfectly match in performance while the others have some mismatches. In other words, the index VI gives the best result in complete clustering. Table 2. Error rate comparisons between VALCLU and K-means. VALCLU K-means
Dataset 1 Dataset 2 Dataset 3 Dataset 4 Dataset 5 0.0000 0.0000 0.0000 0.0062 0.0135 0.1400 0.0000 0.1233 0.1084 0.1364
Iris 0.0000 0.0200
In Table 2, error rates of clustering results are provided to demonstrate the labeling performance of VALCLU. Also, the results for K-means algorithm are included for performance comparison purposes. It is well known that the clustering result of Kmeans algorithm depends on the initialization of its seeds and the number of clusters. In our evaluation, seeds are randomly initialized. Even though we provide the exact number of clusters in the evaluation, the algorithm yields different error rates for each run. Thus, for error rates of K-means algorithm (the second row) in Table 2, we show the average error rates obtained from 30 runs. From Table 2, we can see that the performance on the clustering quality of VALCLU surpasses that of K-means algorithm and it guarantees identical results for each run.
5 Conclusions In this paper we have proposed a validation-based clustering algorithm, called VALCLU, that utilizes cluster validity indices; and evaluated its effectiveness and
A New Clustering Algorithm Based on Cluster Validity Indices
329
efficiency. The methodology proposed for deciding the similarity between two clusters (or objects) are based on cluster validity indices. It plays a key role in the two main parts of the validation-based clustering algorithm, viz., random clustering and complete clustering. It can effectively determine if two clusters can be merged into one cluster through a (quantitative) change of values of the cluster validity index. Also, it determines the optimal number of clusters in complete clustering. Experimental results show that random clustering requires much less computations than agglomerative hierarchical clustering. As for similarity decision, several cluster validity indices were evaluated. Experimental results indicate that the index SD is the most efficient for random clustering and that the index VI proposed in this paper shows the best results among various indices for complete clustering. Also, VALCLU performs better than the well known K-means algorithm. Further work on various aspects of cluster validity indices is in progress.
Acknowledgement This work was supported by the Ministry of Education (MOE) through the Brain Korea 21 (BK21) project.
References 1. Bezdek, J.C., Pal, N.R.: Some new indexes of cluster validity. IEEE Trans. Sys., Man, and Cyber. PART B: Cyber. 28(3) (1998) 301-315 2. Blake, C.L., Merz, C.J.: UCI Repository of machine learning databases (http://www.ics.uci.edu/~mlearn/MLRepository.html). Univ. of California, Irvine, Dept. of Info. & Comp. Sci. (1998) 3. Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI) 1(2) (1979) 224-227 4. Halkidi, M., Vazirgiannis, M.: Quality scheme assessment in the clustering process. European Conf. Principles and Practice of Knowledge Discovery in Databases (PKDD). Lecture Notes in Artificial Intelligence Vol. 1910 (2000) 265-276 5. Han, J., Kamber, M.: Data mining: concepts and techniques. Morgan Kaufmann (2001) 6. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Computing Surveys 31(3) (1999) 264-323 7. Kim, D.-J., Park, Y.-W., Park, D.-J.: A novel validity index for determination of the optimal number of clusters. IEICE Trans. Inf. & Syst. E84-D(2) (2001) 281-285 8. Maulik, U., Bandyopadhyay, S.: Performance evaluation of some clustering algorithms and validity indices. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI) 24(12) (2002) 1650-1654 9. Monmarché, N., Slimane, M., Venturini, G.: On improving clustering in numerical databases with artificial ants. European Conf. Advances in Artificial Life (ECAL). Lecture Notes in Artificial Intelligence Vol. 1974 (1999) 626-635 10. Schwarz, G.: Estimating the dimension of a model. Annals of Statistics 6(2) (1978) 461464