Determining the Number of Clusters with Rate-Distortion Curve ...

1 downloads 0 Views 299KB Size Report
Keywords: Clustering, Cluster Analysis, Number of Clusters, Rate-Distortion .... case based on the Rate-Distortion theory [11] while in our method the parameter ...
Determining the Number of Clusters with Rate-Distortion Curve Modeling Alexander Kolesnikov1,* and Elena Trichina2 1

Arbonaut Ltd., Joensuu, Finland Ciber Services and Technologies, Nagra, Kudelski Group, Cheseauy-sur-Lausanne, Switzerland [email protected], [email protected] 2

Abstract. In this paper we consider a problem of an unsupervised clustering of multidimensional numerical data. We propose a new method for determining an optimal number of clusters in a data set which is based on a parametric model of a Rate-Distortion curve. The proposed method can be used in conjunction with any suitable clustering algorithm. It was tested with artificial and real numerical data sets and the results of experiments demonstrate empirically not only effectiveness of the method but also its ability to cope with “difficult” cases where other known methods failed. Keywords: Clustering, Cluster Analysis, Number of Clusters, Rate-Distortion Curve, Parametric Model.

1

Introduction

An estimation of the number of clusters in a data set is an important theoretical and practical problem in cluster analysis. Many algorithms were suggested [2, 6, 10], however the most reliable method is yet to be found. Our paper contributes to a quest for finding an optimal number of clusters in an efficient manner where the criterion of “optimality”, while being general enough, takes into account “natural” structures of multidimensional data sets. The rest of the paper is organized as follows. In Section 2 we formulate the problem after which we review the known methods in Section 3. Section 4 is dedicated to our solution to the problem. It introduces a new parametric model of a Rate-Distortion curve and derives a validity criterion from this model. In Section 5 we discuss results of the experiments for a number of interesting data sets and provide comparisons with other methods. In Section 6 we draw conclusions.

2

Problem Formulation

Cluster analysis can be characterized as an attempt to represent a large population (data set) by a smaller number of points (centroids) thus “sacrificing” some of the information in favor of more economic representation and more efficient processing A. Campilho and M. Kamel (Eds.): ICIAR 2012, Part I, LNCS 7324, pp. 43–50, 2012. © Springer-Verlag Berlin Heidelberg 2012

44

A. Kolesnikov and E. Trichina

of data. A large body of work had been dedicated to clustering validity criteria; a fundamental study of Milligan and Cooper [6] provides a detailed set of references which can be complemented by more recent surveys [2,10]. Although finding the most suitable clustering method for a given set of data is an important problem, our paper is not concerned with it − we assume that an algorithm for data clustering is selected taking into account a nature and structure of the data. We concentrate solely on the problem of finding an optimal number of clusters for a given data set and a suitable clustering algorithm. This is not an easy problem; especially for large ddimensional data sets where “visual inspection” is unrealistic for any d>>2. Let X={x1, …, xN}be a set consisting of N data points where each xi=(xi,1, …, xi,d) is a point in a d-dimensional space. The data are clustered into M clusters {C1, …, CM}. A cluster Cj is defined by the indices of data points in this cluster, the centroid cj, and the number of data points nj in the cluster. The error caused by clustering can be measured with different criteria such as within-class variance, Integral Square Error, Mahalanobis distance, or any other metrics. As we work with numerical data, we chose a standard within-cluster sum-ofsquares Wm as a distortion measure with the Euclidian distance metric: 2 m Wm =   x i − c j . We also will need a between-cluster sum-of-squares j =1

x i ∈C j

criteria defined as Bm =



m j =1

2

n j c j − x . Here x is the centroid of the data set X.

Our objective is to find an optimal balance between the clustering error and the number of clusters.

3

Known Solutions to the Problem of Optimal Number of Clusters

The problem can be solved by introducing a cost function that incorporates a number of clusters and an error (so-called optimization-like criteria [10]). There are a number of simple and computationally efficient optimization-like criteria that use quantization errors only; the proposed in this paper method belongs to this group. For comparison, we selected a number of known algorithms based on analysis of the cost function that includes the clustering error and the number of clusters. The popular criterion CH [1] is based on the ratio of within-class and betweenclass variances (for uniformity, we use the inverted criterion CH*):

 W ⋅ (m − 1)  M = arg min{CH * (m)}= arg min  m .  Bm ⋅ ( N − m) 

(1)

The modification of the CH* criterion was suggested in [12] as follows:

W ⋅ m  M = arg min{ZXF ( m )} = arg min  m .  Bm 

(2)

Determining the Number of Clusters with Rate-Distortion Curve Modeling

45

More general criterion for d-dimensional data clustering was proposed in [11]:

M = arg min

{ Xu ( m ) } = arg

min {W m ⋅ m ( 2 / d ) }.

(3)

A solution to the problem can also be given by analyzing a clustering error or any other criterion in use as a function of a number of clusters in order to detect an inflection point (i. e., “elbow” or “knee”) [2, 5-8,10]. These algorithms (aka difference-like criteria) are usually less accurate than methods based on optimizationlike criteria [10]. The difference-like criteria with the 1st and 2nd difference calculation are sensitive to unavoidable variations in the calculated error.

4

Parametric Model of Rate-Distortion Curve

We approach the problem from another angle. Let’s introduce a parameterized model of a Rate-Distortion (R-D) curve, where the distortion is defined by some error measure and the rate is the number of clusters m. As a distortion measure for the R-D curve building, we consider a within-cluster sum-of-squares measure Wm. Our assumption is that the function Wm in a logarithmic scale can be approximated by the following linear model with the coefficients a and b:

( )

2 lg Wˆ m + lg(m) + b = 0 a

(4)

Here, the parameter a is calculated as the coefficient of the linear regression for m∈[M1, M2]: ( M 2 − M 1 + 1) M 2 lg 2 (m) − M

a=2

1

(

M2 M1

)

2

lg(m)

( M 2 − M 1 + 1) M 2 lg(m) lg(W m ) −  M 2 lg(m) M 2 lg(W m ) M

M

M

1

1

.

(5)

1

From (4) it follows that the function I RDC = Wˆ m ⋅ m ( 2 / a ) is constant for a linear model of the R-D curve. Based on this invariant, we introduce the multiplicative cost function RDCa = Wm⋅m(2/a). Although we cannot simultaneously minimize the number of clusters m and the error measure Wm, we can find the pairs (m, Wm) for which the value of the cost function RDCa is smaller than for all other values of m:

M = arg min{RDCa(m )} = arg min{Wm ⋅ m( 2 / a ) }. M 1 ≤m≤ M 2

(6)

M1 ≤m≤ M 2

The minimum of the criterion RDCa at the point M gives us the best solution in terms of the multiplicative cost function. The introduced parameterized criterion RDCa is similar to the criterion Xu. However, the criterion Xu for a d-dimensional space has been derived for a general case based on the Rate-Distortion theory [11] while in our method the parameter a is estimated from the R-D curve constructed for the actual data set X.

46

A. Kolesnikov and E. Trichina

5

Results and Discussions

We tested the proposed algorithm on a 2.3 GHz Pentium 4. For the tests we used artificial and real data sets from [3, 9, 12], see Tab. 1. The set Lena is a histogram of intensity of the classical grayscale image Lena. The results of the experiment are set out for eight test sets. The data have been clustered with the algorithm [4] (software package kmlocal, version 1.7.2). The RateDistortion curves for the test sets are presented in Figs. 2 and 4. The Tab. 2 contains the estimated number of clusters in the data sets obtained with the tested criteria. Figs. 3 and 5 summarize comparisons of the proposed cost function RDCa and the criteria Xu, CH*, and ZXF. We have performed two groups of experiments to compare our R-D based model with three known criteria. To start with, we have shown that the proposed parametric cost model describes the behavior of the R-D curves reasonably well, as Figs. 2 and 4 illustrate. An important observation is that it is not an approximation, but a deviation of the R-D curve from its model that is of a particular interest because the most deviated point of the R-D curve gives us the optimal number of clusters in the given data set; see for example the R-D curves for sets S1 and S4 in Fig. 2. The parameter a can be interpreted as a measure of homogeneity and dimensionality. For more homogeneous data sets like S4, Uniform, Lena, the value of the parameter a is close to the dimensionality d: a≈d. For data with less uniform distribution in the space, the value of a is smaller than the dimensionality d; compare a=1.28 and a=1.82 for 2-dimensional sets S1 and S4, correspondingly (see Fig. 1). Next, we evaluated performances of the known and the proposed criteria for estimation of the number of clusters. Firstly, we compared the ability of the criteria for testing a case when the data set has just one cluster versus the case with more than one clusters [9]. The criteria CH* and ZXF cannot be used for this purpose because the between-cluster distortion measure degrades to zero: B1≡0. When the criteria Xu and RDCa have the global minimum at the point M1=1, the data X can be treated as a single cluster; see the results for the set Uniform in Fig.3. However, if the criterion has a distinct local minimum at some other point M2, this indicates that the data set X might have the corresponding number of clusters. For example, the criterion RDCa for sets S1 and S4 has, beside the global minimum at M1=1, the first local minimum at M2=15, which corresponds to the correct number of clusters. The distinct local minimums of the criterion RDCa can reveal a finer structure of the data, possibly a hierarchical partition of the data into clusters and sub-clusters. Secondly, we compared performances of four criteria with respect to their ability to estimate the optimal number of clusters. The criterion Xu has been derived under an assumption that the dimensionality of clusters is the same as the dimensionality of data [11]. This criterion works well as long as the assumption holds, as can be seen

Determining the Number of Clusters with Rate-Distortion Curve Modeling

47

from the Tab. 2 for sets S1, S4, Uniform, and Lena. But the method failed for the set Diagonal, where two elongated one-dimensional clusters are located on the diagonal of a 3-dimensional cube. As for the criteria CH* and ZXF, the number of clusters can be evaluated with these methods mostly for data in a 2-dimensional space (such as S1 and S4) or for data with a≈2 (like Glass and Iris). Indeed, the between-class sum-of squares Bm is almost constant for m>>1. In such case, the criterion ZXF(m) can be rewritten as follows: ZXF(m)=Wm⋅m/Bm ≈ Wm⋅m(2/2). Thus, for large m, the criterion ZXF(m) is practically equivalent to the criterion Xu(m) for a 2-dimensional space. For example, compare the criteria Xu and ZXF (aka WB-index) for the sets S1-S4 given in [12] in Figs. 2e and 2f. With the proposed algorithm we overcome the mentioned problems because the actual dimensionality of the data set is taken into account by the R-D curve based model with the parameter a (see Tab. 2). The proposed parametric criterion RDCa gives a correct number of clusters in all test sets. However, an applicability of the method depends on the clustering algorithm in use. Partition of the data X into classes should be done with a properly selected algorithm, depending on the structure of the data and the purpose of the clustering. Table 1. Type, size N, dimensionality d of the test data and the number of clusters in the data

Type Artificial Artificial Artificial Artificial Real Real Real Real

S1 S4 Uniform Diagonal Lena Iris Glass Wdbc

N 5000 5000 2000 600 256 150 214 569

D 2 2 10 3 1 4 9 30

M0 15 15 1 2 − 3 7 2

Table 2. The optimal number of clusters M in the data sets obtained with the criteria Xu, CH*, ZHF and with the cost function RDCa; n/a means that the number of clusters is not available with a certain method. The value of the parameter a is given in the last column.

d S1 S4 Uniform Diagonal Lena Iris Glass Wdbc

2 2 10 3 1 4 9 30

M0 15 15 1 2 5 3 7 2

Xu M 15 15 1 n/a 5 12 n/a n/a

CH* M 15 15 2 n/a n/a 3 2 n/a

ZHF M 15 15 5 n/a n/a 6 7 n/a

RDCa M a 1, 15 1.28 1, 15 1.82 10.46 1 1.07 2 1, 5 1.03 1.67 3 1, 7 2.52 1.35 2

48

A. Kolesnikov and E. Trichina

Fig. 1. Artificial sets S1, d=2, M0=15, a=1.28 (left) and S4, d=2, M0=15, a=1.82 (right)

Fig. 2. Rate-Distortion curves with the parametric models for the artificial data tests

Fig. 3. The results for the criteria Xu, CH*, ZXF and RDCa for the artificial data tests

Determining the Number of Clusters with Rate-Distortion Curve Modeling

49

Fig. 4. Rate-Distortion curves with the parametric models for the real data tests

Fig. 5. The results for the criteria Xu, CH*, ZXF and RDCa for the real data tests

6

Conclusions

We considered the problem of evaluation of the optimal number of clusters for multidimensional numerical data. We proposed a new computationally efficient

50

A. Kolesnikov and E. Trichina

algorithm based on modeling of the Rate-Distortion curve constructed for a given data set with a suitable clustering algorithm. The cost criteria is derived from the properties of the model and although it is intrinsically based on the structure of the data, finding an optimal number of clusters does not involve any sophisticated data analysis and relies only on a suitable for numerical data measure of distortion. The introduced parameter of the model can be interpreted as a characteristic of the data dimensionality and homogeneity; hence our method seems to be more powerful than other known methods, as was empirically demonstrated on well-known data sets. The proposed algorithm is simple, computationally efficient, and has demonstrated good results for numerical data set, both real and artificially generated, especially when one compares it with the other tested algorithms. As a future work, we would like to investigate if the proposed method can be extended to other error measures for categorical and mixed data clustering.

References 1. Calinski, T., Harabasz, J.: A Dendrite Method for Cluster Analysis. Communication in Statistics 3, 1–27 (1974) 2. Dimitriadou, E., Dolnicar, S., Weingassel, A.: An Examination of Indexes for Determining the Number of Clusters in Binary Data Sets. Psychometrika 67, 137–160 (2002) 3. Frank, A., Asuncion, A.: UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine, CA (2010), http://archive.ics.uci.edu/ml 4. Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: An Efficient k-Means Clustering Algorithm: Analysis and Implementation. IEEE Trans. Pattern Analysis and Machine Intelligence 24, 881–892 (2002) 5. Krzanowski, W.J., Lai, Y.T.: A Criterion for Determining the Number of Groups in a Data Set Using Sum-of-Squares Clustering. Biometrics 44, 23–34 (1984) 6. Milligan, G.W., Cooper, M.C.: An Examination of Procedures for Determining the Number of Clusters in a Data Set. Psychometrika 50, 159–179 (1985) 7. Salvador, S., Chan, P.: Determining the Number of Clusters/Segments in Hierarchical Clustering/Segmentation Algorithms. In: Proc. 16th IEEE Int. Conf. on Tools with Artificial Intelligence, pp. 576–584 (2004) 8. Sugar, C.A., James, G.M.: Finding the Number of Clusters in a Data Set: An Information Theoretic Approach. J. American Statist. Association 98, 750–763 (2003) 9. Tibshirani, R., Walther, G., Hastie, T.: Estimating the Number of Clusters in a Data Set via the Gap Statistics. J. R. Statist. Soc. 63, 411–423 (2001) 10. Vendramin, L., Campello, R.J.G.B., Hrushka, E.R.: Relative Clustering Validity Criteria: A Comparative Overview. Statistical Analysis and Data Mining 3, 209–235 (2010) 11. Xu, L.: Bayesian Ying-Yang machine, Clustering and Number of Clusters. Pattern Recognition Letters 18, 1167–1178 (1997) 12. Zhao, Q., Xu, M., Fränti, P.: Sum-of-Squares Based Cluster Validity Index and Significance Analysis. In: Kolehmainen, M., Toivanen, P., Beliczynski, B. (eds.) ICANNGA 2009. LNCS, vol. 5495, pp. 313–322. Springer, Heidelberg (2009)

Suggest Documents