A Cluster Validity Index for Comparing Non ...

4 downloads 0 Views 119KB Size Report
Karl Pearson statistic on the contingency table derived from the comparison of the actual clusters versus the clusters obtained by a clustering method. Provided ...
EITI2002

1

A Cluster Validity Index for Comparing Non-hierarchical Clustering Methods Erika Johana Salazar G. [email protected]

Ana Clara Velez [email protected]

Carlos Mario Parra M. [email protected]

Oscar Ortega L. [email protected]

Engineering Faculty University of Antioquia

Abstract—When trying to discover knowledge on a collection of data, one of the first arising tasks is to identify groups of similar objects, that is, to carry out cluster analisys for obtaining data partitions. There are several clustering methods that can be used for cluster analisys. Yet, for a given data set, each clustering metod may identify groups whose member objects are different. Thus, a decision must be taken for choosing the clustering method that produces the best data partition for a given data collection. In order to support such a decision, indexes for measuring the quality of a data partitioning must be constructed. The construction of such indexes, known as cluster validity indexes, is specially difficult provided that, in the real world data, usually, the underlying distribution in the data is unknown or, even, there is no certainty that ther exists a clustering structure at all. In such a conditions, is difficult to use reliable statistics for the formulation of the index. So far, several cluster validity indexes have been formulated in the literature. Each of those indexes has strengths and drawbacks when compared with the others. In the present study, an alternative cluster validity index is formulated. The proposed validity index relies on the application of the Karl Pearson statistic on the contingency table derived from the comparison of the actual clusters versus the clusters obtained by a clustering method. Provided that the statistic employed is applied on the contingency table, not on the data, the validy index derived from it is expected to be more reliable over different data conditions, independently of the underlying distribution of the data. An experimental design was devised in order to determine the comparative performance of the proposed cluster validity index against two indexes previously formulated in the literature. In the experiments, a clustering method was applied on artificial data sets with controlled non-hierarchical clustering structure. Index Terms— Cluster Validity, Non-hierarchical Clustering, Contingency Tables.

I. I NTRODUCTION

C

LUSTERING is a useful technique on the data mining area. By using clustering methods is posible to find interesting patterns to form data partitions on a given data set formed with similar objects. This process is known as cluster analisys and any clustering method from a wide variety of them may be chosen. Depending on the structure of a given data set and the parameter values passed to the particular chosen clustering method, we will obtain different results, with

different number of clusters and different objects belonging to clusters. Thus, having several data partitions produced with either different clustering methods or different parameter values, we must decide which clustering schema best fits the underlying data set. The evaluation of clustering results is a process known as cluster validity and is a very important task on cluster analisys. Cluster validity methods use indexes for a quantitative evaluation of clustering results, measuring the quality of the obtained data partitions. There exist three approaches to explore cluster validity[1]. The first one, named external criteria, evaluates the result of a clustering method based on a pre-specified structure imposed on a data set which reflects the user intuition about the clustering structure of the data set. The second approach, named internal criteria, evaluates the clustering result in terms of quantities obtained from the data set itself. The third cluster validity approach, named relative criteria, compares a clustering structure to others obtained from the same clustering method but modifying the parameter values. To choose the optimal clustering schema, there are two proposed criteria:[2] 1) Compactness: Members of each cluster should be as close to each other as possible. 2) Separation: The clusters should be widely spaced from each other. Through this article, a new alternative cluster validity index is formulated and compared with two common indexes presented on the literature. The new index, named Q index, is based on the Karl Pearson statistic, computed on the contingency table derived from the comparison of the actual clusters versus the clusters obtained

EITI2002

2

points in the data set), X(i, j) and Y (i, j) are the (i, j) element of the matrices X, Y respectively that we have to compare. High values of this index indicate a strong similarity between X and Y . For comparing the partition obtained with a clustering method and the real partition that exists on the data set, X and Y represent such patitions respectively and are defined as:

by a clustering method. The results given by the k-means clustering method over fourty data sets randomly generated from normal bivariate distributions, are used to compare the new Q index with the other two indexes. An experimental design is applied following this purpose. The present article is organized as follows. In the next section we get into de problem we faced when trying to develop a new cluster validaty index. Then, in the third section we present some common indexes known in the literature. In the fourth section a new validity index is proposed and compared with two of the known indexes in the fifth section and a discussion about the results obtained is presented on section six. Finally, in the seventh section we conclude and summarize about the results obtained and future work.

X(i, j) = {1, if xi and xj belong to different clusters, and 0 otherwise}, i, j = 1...N Y (i, j) = {1, if yi and yj belong to different clusters, and 0 otherwise}, i, j = 1...N •

Dun Index[4]: This index attempts to identify compact and well separated clusters. The index is defined by:

II. T HE P ROBLEM On real world databases the amount of data is significant and difficult to analize by human experts. That is why cluster analisys is so important and useful. But this data disposition also makes so difficult the validation of the clustering results, because the underlying data distribution is usually unknown and there is no certainty that there exists a clustering structure at all.

Dnc =

min

i=1,...,nc



min

j=i+1,...,nc



d(Ci , Cj ) maxk=1,...,nc diam(ck )



Where d(Ci , Cj ) is the dissimilarity function between two clusters Ci and Cj defined as d(Ci , Cj ) =

min

d(x, y)

Clustering methods always produce a cluster partition of the given dataset without taking into account if there exist or not real clusters on the data distribution. Thus, the construction of cluster validity indexes for a quantitative evaluation of clustering results is a difficult task and trying to find reliable statistics for the formulation of such an index is a real problem to face on the data mining area.

and diam(C) is the diameter of a cluster, which may be considered as a measure of dispersion of the clusters. The diameter of a cluster C can be defined as

III. R ELATED W ORK

Large values of the index indicate the presence of compact and well-separated clusters.

Today, literature present us several discussions about cluster validity indexes. Some of those indexes have become common for the data mining community. Here we will mention some known indexes. Two of these indexes, the Huberts statistic and the Davies-Bouldin index, are used through next sections to be copared with the new index Q proposed. Validity indexes are used as cluster validity methods for a quantitative evaluation of clustering results. They give an indication of the quality of the resulting partitioning. Thus, they can only be considered as a tool at the disposal of the experts for evaluating the clustering results [3]. Some of the most common indexes are: •

diam(C) = max d(x, y) x,y∈C



Davies-Bouldin Index(DB)[5]: This measure attempts to maximize the inter-cluster distance between clusters Ci and Cj and at the same time, tries to minimize the distance between points in a cluster and the cluster centroid. The intra-cluster distance sc (Qk ) in the cluster Qk is P kXi − Ck k sc (Qk ) = i Nk Where Nk is the number of points belonging to the cluster P Qk and Ck = N1k Xi And the inter-cluster distance is defined as

Huberts Statistic (Γ): Γ = (1/M )

x∈Ci ,y∈Cj

N −1 X

N X

dcc = kCk − Cl k X(i, j)Y (i, j)

i=1 j=i+1

Where M is the maximum number of all pairs in the data set (M = N (N − 1)/2, where N is the total number of

Thus, the DB index is defined as   nc 1 X sc (Qk ) + sc (Ql ) DB(nc) = max l6=k nc dcc (Qk , Ql ) k=1

EITI2002

3

It is desirable for the clusters to have the minimum possible similarity to each other; therefore we seek clusterings that minimize DB. •

RMSSTD (Root-mean-square standard deviation)[6]: This index is used to determine the number of clusters inherent to a data set. It measures the homogenity of the resulting clusters. Thus, the index value must be as low as posible for a cluster. 

nij X X

2

(xk − x¯j )   i=1,..,nc k=1   j=1,..,d X RM SST D =   (nij − 1)   i=1,..,nc j=1,..,d



TABLE I C ONTINGENCY TABLE : E ACH CELL SHOWS THE NUMBER OF ELEMENTS BELONGING TO A REAL CLUSTER Cj THAT WERE ASIGNED TO CLUSTER Ci∗ BY THE CLUSTERING METHOD .

 12        

RS (R-squared)[6]: This index measures the difference between the clusters. If its value is 0, there is no difference between the clusters; otherwise, (its value is 1) there is a significant difference between them. RS is defined as:     nj nij  X X   X X  2   (x − x ¯ ) − (xk − x¯j )2  i j     i=1,..,nc k=1 j=1,..,d

i=1,..,nc k=1 j=1,..,d

P

j=1,..,d

Pnj

k=1 (xk

− x¯j )2

Where nc is the number of clusters, d is the number of variables (data dimensions), nj is the number of data values of j dimension while nij corresponds to the number of data values of j dimension that belong to cluster i. x¯j is the mean of data values of j dimension. IV. A PPROACH

Obtained Clusters C1∗ C2∗ ... ∗ CK Total

Thus, we randomly generate a data set with N × K elements or rows. Next, we apply the clustering method to the generated data.

Total N1∗ N2∗ ... ∗ NK K ×N

TABLE II W ORST C ASE : D ISTRIBUTION OF ELEMENTS ON THE CONTINGENCY TABLE REFLECTS A RANDOM DISTRIBUTION MADE BY THE CLUSTERING METHOD .

Obtained Clusters C1∗ C2∗ ... ∗ CK Total

Original Clusters C1 C2 ... CK N/K N/K ... N/K N/K N/K ... N/K ... ... ... ... N/K N/K ... N/K N N ... N

Total N N ... N K ×N

If we denote the real clusters that exist on the data as: ∗ C1 , C2 , .., CK and the clusters obtained as: C1∗ , C2∗ , .., CK , is posible to construct a contingency table as shown on Table I. Where nij is the number of elements belonging to real cluster Cj that were asigned to cluster Ci∗ by the clustering method. Thus, we can use the Karl Pearson statistic to measure the distribution of the elements on the contingency table and so we could know how well the resulting clustering scheme fits the real clusters present on the data set or if the elements were randomly distributed. The index is defined as:

Through this section, we describe the proposed index for cluster validity and present the cases in which the index value will show up the worst and best evaluation of a given clustering result. To evaluate the quality of a clustering result produced by a clustering method, we proceed under controled laboratory conditions. That is, we have K data populations, similar to the real data (same domain) and we know the K clusters and their members. Thus, we choose a sample of size N from each one of the data populations, whose elements are x1i , x2i , .., xN i . Where x1i is the first element belonging to the cluster i, i ∈ {1, 2, .., K}. The samples would be the clusters present on the data set.

Original Clusters C1 C2 ... CK n11 n12 ... n1K n21 n22 ... n2K ... ... ... ... nK1 nK2 ... nKK N N ... N

Q=

K X K X (oij − eij )2 eij i=1 j=1

Where oij is the frecuency observed from row i and column j on the contingency table and eij is the expected frecuency for the cells on the table. eij =

(T otalcolumnj)(T otalrowi) T otal

A. Worst case When the distribution of the elements on the contingency table reflects a random distribution (Table II), the clustering result could be considered not desirable. At this case, the presented index shows the value: Q=0

EITI2002

4

TABLE IV

TABLE III I DEAL C ASE : E ACH

ONE OF THE OBTAINED CLUSTERS

TO ONE OF THE REAL CLUSTERS

Cj

Ci∗

CORRESPONDS

INHERENT TO THE DATA SET,

R ANKING OF VALIDITY I NDEXES PER PARAMETER S ET: T HE QUALITY INDEXES ARE WRITTEN FROM LEFT TO RIGHT IN DESCENDING ORDER BY

MEANING A PERFECT RESULT GIVEN BY THE CLUSTERING METHOD .

Obtained Clusters C1∗ C2∗ ... ∗ CK Total

Original Clusters C1 C2 ... CK N 0 ... 0 0 N ... 0 ... ... ... ... 0 0 ... N N N ... N

THE VALUE OF THE MEAN FOR

∆ki (θ). W HENEVER THERE IS NO

SIGNIFICANT DIFFERENCE BETWEEN SOME INDEXES , THEY ARE GROUPED WITH PARENTHESIS .

Total N N ... N K ×N

µ1 , µ 2 , µ 3 

2 2

     3 4 , , 3 4



0.490 0.245

0.245 0.490



Γ, (DB, Q)



2 2

     3 4 , , 3 4



0.250 0.125

0.125 0.250



(Γ, Q), DB

     3 4 , , 3 4.5



1.330 0.665

0.665 1.330



Γ, DB, Q



0.16 0.08

0.08 0.16



DB, Q, Γ



0.16 0.08

0.08 0.16



DB, (Γ, Q)

B. Ideal case We have the desired result when each one of the obtained clusters Ci∗ corresponds to one of the real clusters Cj inherent to the data set. This is, each row and column on the contingency table have just one cell whose value is not zero (Table III).

Quality Index Ranking

Σ



2 1.5



2 2

     3.3 4.5 , , 3.3 4.5

At this case, the presented index shows the value: Q = (K − 1) × N



2.5 2.5

     4.5 2.5 , , 2.5 4.5

Based on the previous observations, the index takes its values in the interval: 0 ≤ Q ≤ (K − 1) × N Since we prefer the clustering schema that is as far as possible from a random distribution of elements on the contingency table, we will prefer the clustering result for which the index value is as close as posible to (K − 1) × N . V. E VALUATION The experiments were conducted on artificial datasets with three clusters whose members follow the distributions N2 [µ1 , Σ],N2 [µ2 , Σ], N2 [µ3 , Σ] respectively, with constant correlation ρ = 0.5. The set of parameters {µ1 , µ2 , µ3 , Σ} defines a distributional condition. The first two columns of Table IV show the five parameter sets considered. For each distributional condition, ten databases were artificially generated following its distributional parameters. The clustering algotithm k-means was applied several times on each artificial dataset setting the number of clusters k = 2, 3 . . . 8. On the partitions obtained by the clustering algorithm the indexes DB, Γ, and Q were calculated in order to measure the quality of the partitions. Dun, RMSSTD and RS indexes where not taken into account to be compared on these preliminary experiments. Provided that the artificial datasets are generated with three clusters, is assumed that the k-means algorithm should obtain the best partitioning when the algorithm is commanded to produce three clusters. Accordingly, the values of the quality indexes obtained for partitions generated with k = 3 are considered as the optimal values and associated with the best quality. The three indexes evaluated in the experiments have different ranges, so they are not comparable. In order to make them comparable, a normalized variable was defined.

10

∆k (θ) =

1 X θki − θ3i k k 10 i=1 θ3i

k : 2, 4, 5, 6, 7, 8; θ ∈ {D, H, Q} Where θki denotes the value of the index θ for partitions of k clusters obtained by the algorithm k-means over the i-th artificial dataset. High values of ∆k (θ) indicate the index θ sensitiveness when the number of clusters moves away from the real number of clusters that exists on the data set. Thus, the index with highest values for ∆k (θ) is prefered. The hypothesis to be contrasted in the experiments is: H0 : ∆k (D) = ∆k (H) = ∆k (Q), ∀k In order to perform the contrast, a multi-factor variance analysis was conducted. The factors considered were: number of clusters(k), validity index(θ) and the distributional condition. The results are illustrated on the fourth column of Table IV, labeled, ”quality index ranking”. The quality indexes are written from left to right in descending order by the value of the mean for ∆ki (θ). Whenever there is no significant difference between some indexes, they are grouped with parenthesis. That is, for example, Γ, (DB, Q) means that ∆ki (Γ) > ∆ki (DB) > ∆ki (Q), and there was no significant difference between ∆ki (DB) and ∆ki (Q). The index Γ achieves the highest mean under the first three conditions and the index DB achieves the highest means under the last two conditions. VI. D ISCUSSION From an analysis of the covariance matrices corresponding to the conditions under which whether Γ or DB achieves the highest mean, the following assertion can be done. On the experimental conditions considered Γ should be chosen whenever

EITI2002

5

the variance is higher than 0.16, otherwise, DB should be chosen. When computing the three indexes Γ, DB and Q over the clustering results, it was noticeable a significant difference on the computational time complexity. While DB and Q indexes showed its results after some seconds for a given data set, the Γ index result for the same data set only could be obtained after about three hours of running time. VII. C ONCLUSIONS The present work was intended to understand and face the problem of constructing cluster validity indexes for a quantitative evaluation of clustering results. Following this purpose a new index Q was proposed and compared with the Huberts (Γ) and Davies-Bouldin (DB) indexes expecting it to be more reliable. A measure ∆k (θ) was developed to evaluate the sensitiveness of validity indexes to the change of number of clusters k as it moves away from the real number of clusters that exists on the data sets. After a preliminar experiment, the DB index showed to be more reliable whenever the variance on the data set is equal to 0.16, that is, whenever the real clusters present on the data are more compact. Otherwise Γ index is more reliable. Nevertheless, only five distributional conditions were used to evaluate the indexes and under these experimental conditions is not possible to fulfill complete inference about the results. For future and more complete analysis, other validity indexes should be included into the experiments and they could be computed based on data sets from much more distributional conditions. Thus, inferences about the influence of such distributional parameters over the results, could be easily performed. Analytic evaluation of computational time complexity for indexes, could be an important element to take into account for future comparison between indexes and would be included into future experiments. R EFERENCES [1] S. Theodoridis and K. Koutroubas, Pattern Recognition, Academic Press, 1999. [2] Michael J. A. Berry and Gordon Linoff, Data Mining Techniques For Marketing, Sales and Customer Support, John Willey & Sons, Inc, 1996. [3] Maria Halkidi, Yannis Batistakis, and Michalis Vazirgiannis, “On clustering validation techniques,” Journal of Intelligent Information Systems, vol. 17, pp. 107–145, December 2001. [4] J. C. Dunn, “Well separated clusters and optimal fuzzy partitions,” Journal of Cybernetics, vol. 4, pp. 95–104, 1974. [5] DL Davies and D.W Bouldin, “A cluster separation measure,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 1, no. 2, pp. 224–227, 1979. [6] Subhash Sharma, Applied Multivariate Techniques, John Willxy & Sons, 1996.

Suggest Documents