NIVA: A Robust Cluster Validity - Semantic Scholar

3 downloads 0 Views 580KB Size Report
clustering algorithm depends on input parameters. For instance, k-means [16] and CURE [15] algorithms require the number of clusters (k) to be created. In this.
12th WSEAS International Conference on COMMUNICATIONS, Heraklion, Greece, July 23-25, 2008

NIVA: A Robust Cluster Validity ERENDIRA RENDÓN1, RENE GARCIA1, ITZEL ABUNDEZ1, CITLALIH GUTIERREZ1, EDUARDO GASCA1, FEDERICO DEL RAZO1, ADRIAN GONZALEZ1 División de Estudios de Postgrado e Investigación1 Instituto Tecnológico de Toluca Ex. Rancho la Virgen, Metepec, Edo. de México MÉXICO

Abstract: Clustering aims at extracting hidden structures in datasets. Many validity indices have been proposed to evaluate clustering results; some of them work well when clusters have different densities and sizes and others with different shapes. They usually have a tendency to consider one or two characteristics simultaneously. In this paper, we present a cluster validity index that takes advantage of the density, size and shape of cluster characteristics. The proposed index is experimentally compared with PS, CS and S_Dbw indices using 12 synthetic datasets. Our proposed index improves others indices. Key-Words: Cluster validity, cluster algorithm, connectivity and compactness. presents surveys of related works. Section 3 offers a light analysis about some index validation. Section 4 contains details about the index proposed. Section 5 provides the experimental results of our index and discusses some findings from these results. Finally, we conclude by briefly showing our contributions and further works.

1 Introduction The goal of clustering is to determine the intrinsic grouping in a set of unlabeled data, where the objects in each group are indistinguishable under some criterion of similarity. Clustering is an unsupervised classification process fundamental to data mining (one of the most important tasks in data analysis). It finds application in several areas like bioinformatics [14], web data analysis [13], text mining [17] and scientific data exploration [1]. Clustering refers to unsupervised learning and, for that reason, it has no a priori data set information. However, to get good results, the clustering algorithm depends on input parameters. For instance, k-means [16] and CURE [15] algorithms require the number of clusters (k) to be created. In this sense, the question is: What is the optimal number of clusters? Currently, cluster validity indices research has drawn attention as a means to give a solution [7]. Many different cluster validity methods have been proposed [8] [10] without any a priori class information. Clustering validation is a technique to find a set of clusters that best fits natural partitions (number of clusters) without any class information. In this paper, we present an analysis of the indices offered by Chow (PS, CS indices) [8, 10] and Halkidi (S_Dbw index) [7]. We also offer a solution to address their drawbacks. For this purpose, we first show a novel index validation (NIVA) that uses connectivity amount points to capture the shape cluster; secondly, we present a comparative study with CS, PS; SD_bw validation indices. The rest of the paper is organized as follows: section 2 ISSN: 1790-5117

2 Previous works Almost every clustering algorithm depends on the characteristics of the dataset and on the input parameters. Incorrect input parameters may lead to clusters that deviate from those in the dataset. In order to determine the input parameters that lead to clusters that best fit a given dataset, we need reliable guidelines to evaluate the clusters; clustering validity indices have been recently employed. In general, clustering validity indices are usually defined by combining compactness and separability. 1. Compactness: This measures closeness of cluster elements. A common measure of compactness is variance. 2. Separability: This indicates how distinct two clusters are. It computes the distance between two different clusters. The distance between representative objects of two clusters is a good example. This measure has been widely used due to its computational efficiency and effectiveness for hypersphere-shaped clusters. There are three approaches to study cluster validity [11]. The first is based on external criteria. This implies that we evaluate the results of a clustering algorithm based on a pre-specified 241

ISBN: 978-960-6766-84-8

12th WSEAS International Conference on COMMUNICATIONS, Heraklion, Greece, July 23-25, 2008

considering the density of clusters. In other words, the SD_bw index measures the intra-cluster variance and the inter-cluster variance. The intra cluster variance measures the mean scattering of clusters and it is described by Eq. 1. The inter-cluster density is defined by the Eq. 2. 1 nc || σ (v i ) || (1) Scatt = ∑ n c i =1 || σ ( X ) ||

structure, which is imposed on a dataset, i.e. external information that is not contained in the dataset. The second approach is based on internal criteria. We may evaluate the results of a clustering algorithm using information that involves the vectors of the datasets themselves. Internal criteria can roughly be subdivided into two groups: the one that assesses the fit between the data and the expected structure and others that focus on the stability of the solution [12]. The third approach of clustering validity is based on relative criteria, which consists of evaluating the results (clustering structure) by comparing them with other clustering schemes.

Where:

σ (v i ) is the variance of cluster ci and σ ( X ) is the variance of the set data Dens_ bw =

Many different cluster validity measures have been proposed in the past [2]. In general, validity indices can be grouped into two main categories: the first category consists of validity measures that evaluate the properties of the crisp structures imposed on the data by the clustering algorithm [3]. The second works with measures that use membership degrees obtained by fuzzy clustering algorithm. [3].The third category consists of validity measures that take into account not only the membership degrees but also the data themselves. In recent times, many indices have been proposed in the literature, which are used to measure the fitness of the partitions produced by clustering algorithm [2]. The Dunn index [2] measures the ratio between the smallest cluster distance and the largest intra-cluster in a partitioning; several variations of Dunn have been proposed [4][5]. DB measures the average similarity between each cluster and the one that most resembles it. [6]. The SD index [7] is defined based on the concepts of the average scattering for clustering and total separation amount clusters. The S_Dbw index is very similar to SD index; this index measures the intra-cluster variance and inter-cluster variance. The index PS [8] used a nonmetric distance based on the concept of point symmetry [9], and measures the total average symmetry with respect to the cluster centers. [10] proposes the CS index that obtains good clustering results when the densities and sizes are different, but its computational cost is elevated.

(

)

(2)

Where: uij is the middle point of the line segment defined by the vi and v j clusters centers. The density function around a point uij is defined as follows: it counts the number of points in a hyper-sphere whose radius is equal to the average standard deviation of clusters. The standard deviation of clusters is defined as stdev =

1 nc

nc

∑ σ (v i )

(3)

i =1

The S_Dbw index is defined as below: S _ Dbw = Scatt + Dens _ bw

3.2 PS validity Index

PS index was proposed by Chien-Hsing Chou [8]; this index identifies clusters of different forms simultaneously, when there is diversity of forms in the cluster. In order to do this, it uses the distance proposed by Su [9]. The general concept of “point symmetry distance “[9] is defined as follows: ⎧⎪ || ( x j − c) + ( x k − c) || ⎫⎪ d s = ( x j , c) = min ⎨ ⎬ k =1,.. N ⎪ || x j − c || + || x k − c || ⎪ ⎭ and k ≠ j ⎩

(4)

3.3 CS validity Index

Chien-Hsing Chou proposed the CS validity index [11]. This index evaluates clustering results when

3 Analysis of indices

densities and sizes are different. The CS index is

In this section, we offer an overview of the CS, PS, SD_bw validity indices, since our index is based on their disadvantages.

defined as follows:

3.1 SD_bw validity index

CS(c) =

M. Halkidi proposed the SD_bw index [7]. This index is based on cluster compactness and separation

ISSN: 1790-5117

⎧ ⎫ nc ⎪ nc density(uij ) 1 ⎪ ∑⎨∑ ⎬ nc (nc −1) i=1⎪i=1, max density(vi ), density(v j ) ⎪ ⎩i≠ j ⎭

⎫⎪ ⎫⎪ c ⎧⎪ 1 ∑ max{d (x j , xk )}⎬ ∑ max{d (x j , xk )}⎬ ∑ ⎨ x A ∈ x j ∈Ai x k ∈Ai ⎪⎭ i =1⎪⎩ Ai x j ∈Ai − k i ⎪⎭ = c ⎧ c ⎧ ⎫ ⎫ ∑ ⎨ min {d (vi , v j )}⎬ ∑ ⎨ min {d (vi , v j )}⎬ i =1⎩ j∈c, j ≠i i =1⎩ j∈c, j ≠i ⎭ ⎭

1 c ⎪⎧ 1 ∑⎨ c i =1⎪⎩ Ai

Where:

242

ISBN: 978-960-6766-84-8

(5)

12th WSEAS International Conference on COMMUNICATIONS, Heraklion, Greece, July 23-25, 2008

Where:

d is a distance function, max(d ( x j , x k )) measures the

radio of the sum of within-cluster scatter. min {(v i , v j )} measures the between-cluster separation. Thus CS index cluster validity measures scattering as a function to between-cluster separation.

4

ESp (ci ) =

k ⎤ 1 l i ⎡ 1 ni ∑ ⎢ k ∑ ( d ( x j , x j +1 )) ⎥ li k =1 ⎣⎢ ni k =1 ⎦⎥

(8)

and SepxS (c i ) =

The novel cluster validity index

In this section, we describe a novel validity index called NIVA.

{ {

}}

1 li ∑ max (d ( sv p , sv j ) l i k =1

(9)

l i = subclusters number of c i n ik = data number from k subcluster

4.1 NIVA validity index definition

sv k , sv j = they are the k clusters center from c i

We first need to introduce the basic principles. Consider a partition of the data set C = {c i | i = 1,.., N } and the center of each cluster v i (i = 1,2,..., N ) , where N is the number cluster from C .

x j +1 = is the nearest neighbor from x j

- SepxG(C ) : Average separability of C groups. It is calculated using Eq. 10. ⎧ ⎫ 1 N⎪ ⎪ SepxG (C ) = ∑ ⎨min d (v i , v j ) ⎬ j ∈ C N i =1 ⎪ ⎪⎭ ⎩ j ≠i

The cluster validity index NIVA works in two stages: • First stage: known as local evaluation, it carries out a sub-clustering of objects belonging to clusters c i , obtaining l i groups, as seen in Fig.1.

{

}

(10)

The smaller value NIVA(C ) indicates that a valid optimal partition to the different given partitions was found. The clustering algorithm used to find c i subgroups was called OSB.

For i=1 up to N Calculate l i groups of set c i , using the OSB clustering algorithm. Calculate average compact of l i groups using Eq. 8. Calculate average separation of l i groups of c i , using Eq. 9.

4.2. Algorithm clustering OSB The clustering algorithm uses the clustering criteria in order to detect connected components, for which an object x j belongs to a cluster k , if only there is an

End

object x j , h (nearest neighbor from x j ) such that the



Euclidean distance between the two objects is greater than a calculated threshold, called similarity threshold st . In order to calculate the similarity threshold, we use a heuristic, which consists of calculating an average of distances between x j and x j ,h each time an

Fig.1. Steps of first NIVA stage.

Second stage: consists of calculating the NIVA index of partition C . Thus, the NIVA validation index is defined as follows: NIVA(C ) =

Compac(C ) SepxG(C )

object x j ,h enters cluster k .

5. Experimental Results

(6)

In this section, NIVA is experimentally tested using the K-means algorithm. We used 12 synthetic data sets (see Tables 1 and 2). These data sets were used by Maria Halkidi [7] and Chien-Hsing Chou [8][10]. We have used these sets because we also compared their validity indices [7][8][10].

Average of compactness product ( Esp(ci ) , Eq. 8) of c groups and

-

Compac(C ) :

separability between them ( SepxS (ci ) , Eq. 9). Compac(C ) =

1 ∑ Esp(c i ) * SepxS (c i ) N i =1 N

ISSN: 1790-5117

5.1 The best partition To find the best partition, we have used the K-means algorithm with its input parameters ranging between 2 and 8 and, to verify that we really found the best

(7)

243

ISBN: 978-960-6766-84-8

12th WSEAS International Conference on COMMUNICATIONS, Heraklion, Greece, July 23-25, 2008

partition, we applied our validation index to the labeled set. Table 1 Result of NIVA validity index Data

K

sets

2

3

4

5

6

7

8

Labeled data set

1

1.8429

1.5622

1.6228

2.8300

3.7163

4.4661

6.2847

0.8739

0.3526

0.9286

0.8912

0.8944

0.9738

0.8820

3.3475 (3) 2.9725 (5) 1.4099 (3) 0.3495 (3)

0.4933

0.2067

0.875

0.5656

0.4852

1.1428

1.125

0.2067 (3)

5.6576

4.2675

3.1709

4.9379

4.4325

4.1294

3.4741

0.8270

3.1819

2.4987

2.3503

2.7344

2.3134

2.2018

3.0610 (4) 1.2648 (2 ó 3)

1.7737

1.2830

1.9074

1.6105

2.0370

1.7660

1.6800

1.1888 (7)

0.5255

2.1045

2.6171

2.3902

2.5022

2.1543

2.0489

3.0494

5.8880

7.1119

5.56583

5.3401

4.70154

6.2026

0.7341

6.0026

14.5882

3.06824

7.2255

6.90757

7.1157

0.1461 (3) 3.0494 (2) 0.7341 (2)

3.1664

3.8781

4.3854

4.32198

5.3392

6.5760

5.17418

3.1664 (2)

6.5492

2

3.2401 4.7345

4.9583

5.2060

6.7427

5.6580

3.7936

3.3376

3.7356

5.7670 3.8736

5.1463 4.4560

3 4 5 6 7 8 9 10 11 12

ISSN: 1790-5117

244

ISBN: 978-960-6766-84-8

12th WSEAS International Conference on COMMUNICATIONS, Heraklion, Greece, July 23-25, 2008

Table 2. Data sets synthetics

ISSN: 1790-5117

(a) DataSet 1

(b) DataSet 2

(c)DataSet 3

(d) DataSet 4

(e)DataSet 5

(f) DataSet 6

(g) DataSet 7

(h) DataSet 8

245

ISBN: 978-960-6766-84-8

12th WSEAS International Conference on COMMUNICATIONS, Heraklion, Greece, July 23-25, 2008

Table 3. Data sets synthetics

(a) DataSet 9

(b) DataSet 10

(c)DataSet 11

(d) DataSet 12

Table 1 shows NIVA values of the results obtained with the clustering found by K-means. In both cases, NIVA failed (Datasets 8 and 9). On the other hand, to prove the veracity of our results, we applied NIVA to the labeled datasets. The values obtained are depicted in table 1. Last column of Table 1 shows that, in all cases, NIVA found the best partition from datasets. In other words, NIVA was able to find the best partition between the results of the K-means algorithm and the labeled datasets.





5.2 Comparison with other validity indices We used the known validity indices proposed in the literature [7][8][10], such as CS, PS and SD_bw. For comparison purposes, we used datasets very similar to those used by PS, CS and SD_bw indices. Table 4 presents the results summary of CS, PS, SD_bw and NIVA validation indices. For our study, we used the results of the algorithm K-means with their input values, ranging between 2 and 8 and labeled datasets. Like in the last experiment we used 12 datasets (see Tables 2 and 3). We can see that the PS index made four mistakes, CS made two mistakes, SD_bw made six mistakes and NIVA found the correct cluster number in all cases. It is important to say that our index failed in datasets 8 and 9 (see table 1) when it was run to the clustering result of the K-means. But when the labeled datasets are included, NIVA obtained the optimal K (clusters number) in all cases.

6. Conclusions and further work In this paper, we have defined a novel validity cluster called NIVA to find the best partition, the results of clustering algorithms. The NIVA index finds groups with different densities, sizes and shapes. To do that, the compactness of the dataset is measured using the connectivity among data from clusters, whereas the separation among clusters was measured by minimizing the distance between center clusters. We used 12 datasets to carry out the experiments. It is

The following tests were additionally run: •

For clusters with different geometrical shapes (NIVA vs. PS), the comparison with index PS

ISSN: 1790-5117

was carried out; for this purpose, datasets DataSet2 and DataSet3 were used in this test. Both indices obtained the correct number of clusters (5 and 3). For groups with different densities and sizes, as well as separability between them (NIVA vs. CS), the comparison between index CS and datasets DataSet4 and DataSet6 was carried out; NIVA found the correct number of clusters being 3 and 4. For clusters with different compactness and separability between them (NIVA vs. SD_bw), the comparison of the NIVA index with index S_dbw was carried out with datasets DataSet7 and DataSet8, which have compact and well-separated clusters. It is important to point out that NIVA found the correct number of clusters (2) of dataset DataSet7. For dataset DataSet8, it does not find the correct number of clusters obtained by the K-means; however, when the clustered dataset is included, NIVA finds the correct number of clusters (7).

246

ISBN: 978-960-6766-84-8

12th WSEAS International Conference on COMMUNICATIONS, Heraklion, Greece, July 23-25, 2008

important to say that we used the same datasets that were used by Maria Halkidi [7] and Chien-Hsing Chou [8][10] to compare with their indices. The results obtained by the NIVA index were encouraging, because it found the best partition in every case. The performance of the NIVA index was compared with three popular validation indices. When our index was compared with other indices, it always obtained better results. Finally, we hope we can improve the results obtained by making more experiments and using clustering results as Cure.

[5]

[6 ] Davis D. L., Bouldin D.W. 1979. A cluster separation measure. IEEE Trans. Pattern Anal. Mach.Intel. (PAMI)1 (2), pp.224-227 [7]

Halkidi M., Vazirgiannis, M., 2000. Quality scheme assessment in the clustering process. In Proc. PKDD (Principles and Practice of Knowledge in databases). Lyon, France. Lecture Notes in Artificial Intelligence. Spring –Verlag Gmbh, vol.1910, pp. 265-279.

[8]

Chow C.H, Su M.C and Lai Eugene 2002.Symmetry as A new measure for Cluster Validity. 2 th WSEAS Int.Conf. scientific Computation and Soft Computing, Crete, Greece, pp. 209-213.

Table 4. Results of comparisons DataSet s

PS

CS

K

S_Dbw

NIVA

K

K

K 3

;

5

;

3

;

3

;

1

3

;

3

7

2

5

;

5

5

3

3

;

3

;

7

4

3

;

3

;

3

5

2

3

;

8

3

;

6

2

4

;

8

4

;

7

2

;

2

;

2

;

2

;

8

7

;

7

;

7

;

7

;

9

3

;

3

;

3

;

3

;

10

2

;

2

;

2

;

2

;

11

3

2

;

7

2

;

12

4

2

;

8

2

;

# correct

8

10

; ;

6

[9] Su M.C, Chow C.H, 2001. A Modified Version of the K-Means Algorithm with a Distance Based on Cluster Symmetry. IEEE Trans. Pattern Anal. And Machine Intelligence, vol.23. No. 6, pp.674680. [10] Chow C.H, Su M.C and Lai Eugene 2004. A new Validity Measure for Clusters with Different Densities. Pattern Anal. Applications, 7, pp.20052020. [11] Theodoridis, S., Koutroubas, K. (1999). “Pattern Recognition”, Academic Press, USA

12

References: [1]

Jain, A. K., Murty, M.N. Flynn, P.J. Data clustering: A review. ACM Computer. Surveys 31 (3), 1999, pp.264-323.

[2]

Dunn, J. C., 1973. A Fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. J. Cyber. 3, 1973, pp. 32-57.

[3]

Bouguessa M., Wang S. and Sun H. 2006.An Objective Approach to Cluster Validation. Pattern Recognition Letters, Vol. 27, Issue 13, pp. 14191430.

[4]

[12] Volker Roth, Tilman Lange, Mikio Braun, and Joachin Buhmann. A Resampling Approch to Cluster Validation. 2002, Proceeding in Computational Statistics COMPSTAT. 2002 Physika Verlag, pp. 123-128 [13] Athena Vakali, Jaroslav Pokorný and Theodore Dalamagas. An Overview of Web Data Clustering Practices. 2005, Lecture Notes Computer Science, Vol. 3268, pp.597-606. [14] M.J.L. Hoon, S. Imoto, J. Nolan and S. Miyano. 2004, Open source clustering software. Vol. 20 No. 9, pp. 1453-1454. [15] Guha Sudipto, Rastogi Rajeev, Shim Kyuseok. CURE: An Efficient Clustering Algorithm for Large DataBases. In Proceedings of the CAM SIGMOD Conference on Magnagement of Data, Seatle, Washington, U.S.,01-04 Jun., pp. 73-83, 1998.

Pal N. R. and Biswas J., 1997. Cluster Validation using graph theoretic concepts, Pattern Recognition, Vol.30, No. 6. pp. 847-857.

ISSN: 1790-5117

Bezdek J. C. Pal N.R., 1998. Some new indexes of cluster validity. IEEE Trans. Syst.Man, Cyber. Part B 28 (3),pp 3001-315.

247

ISBN: 978-960-6766-84-8

12th WSEAS International Conference on COMMUNICATIONS, Heraklion, Greece, July 23-25, 2008

[16] J.B. MacQueen. Some Methods for classification and Analysis of Multivariable Observations, Proceeding of 5th Berkeley on Mathematical Statistics and Probability, 1967, University of California Press, pp. 281-297. [17] Bernd Drewes. Some Industrial Applications of Text Mining. 2005, Knowledge Mining, Springer Berlin, Vol. 185, pp. 233-238.

ISSN: 1790-5117

248

ISBN: 978-960-6766-84-8