NbClust package for determining the number of ...

91 downloads 38748 Views 197KB Size Report
However, for most of indices proposed in the literature, programs are ... Moreover, the package offers the user the best clustering scheme from different results.
NbClust package for determining the number of clusters in a dataset Malika Charrad 1,2,3 , Nadia Ghazzali 1 , Veronique Boiteau 1 , Azam Niknafs1 1

Laval University, Quebec RIADI laboratory, ENSI, La Manouba University 3 ISIMED, Gabes University 2

A wide variety of indices have been proposed to find the optimal number of clusters in a partitioning of a data set during the clustering process. However, for most of indices proposed in the literature, programs are unavailable to test these indices and compare them. The R package NbClust has been developed for that purpose. It provides 30 indices which determine the number of clusters in a data set and it offers also the best clustering scheme from different results to the user. In addition, it provides a function to perform kmeans and hierarchical clustering with different distance measures and aggregation methods. Any combination of validation indices and clustering methods can be requested in a single function call. This enables the user to simultaneously evaluate several clustering schemes while varying the number of clusters, to help determining the most appropriate number of clusters for the dataset of interest. ABSTRACT.

KEYWORDS:

R package, cluster validity, number of clusters, clustering, indices, kmeans, hierarchical clustering

1. Introduction and related work A variety of measures aiming to validate the results of a clustering analysis have been defined and proposed in the literature. Indeed, [MIL 85] examined thirty indices, with simulated data, where the number of clusters is known beforehand. Thirteen indices among them are programmed in cclust [DIM 12] and clusterSim [WAL 12] packages. In addition to indices described in Milligan and Cooper study, [DUN 74] introduced a validity index based on the distance between clusters and the diameter of the clusters and Rousseeuw and Kaufman proposed the ”silhouette statistic” ([ROU 87], [KAU 90]). More recently, [TIB 01] proposed the “gap statistic”. [LEB 00] proposed a criterion based on the first and second derivatives and Halkidi et al. (2000, 2001) proposed two indices : SD index which is based on the concepts of average scattering for clusters and total separation between clusters [HAL 00], and SDbw index which is based on the criteria of compactness and separation between clusters [HAL 01]. However, only nineteen indices among those mentioned above are implemented in SAS cluster function [INC 12] and R packages [TEA 11] : cclust [DIM 12], clusterSim [WAL 12], clv [NIE 12] and clvalid [BRO 12]. In this paper, we present a novel R package NbClust, which aims to gather all indices available in SAS or R packages in only one package, and to include indices which are not implemented anywhere in order to provide an exhaustive list of validity indices to estimate the number of clusters in a dataset. In NbClust package, validity indices can be applied on outputs of two clustering algorithms : Kmeans and Hierarchical Agglomerative Clustering (HAC), by varying all combinations of number of clusters, distance measures and clustering methods. Distance measures available in NbClust package are : Euclidean distance, Maximum distance, Manhattan distance, Canberra distance, Binary distance and Minkowski distance. Several clustering methods are also provided by NbClust package, namely : Ward [WAR 63], Single [FLO 51],

[SOK 58], Complete [SOR 48], Average [SOK 58], McQuitty [MCQ 66], Median [GOW 67] and Centroid [SOK 58]. One important benefit of NbClust is that the user can simultaneously select multiple indices and number of clusters in a single function call. Moreover, the package offers the user the best clustering scheme from different results. This package is available from the Comprehensive R Archive Network at http ://cran.r-project.org. The remainder of the paper is organised as follows. Section 2 provides the list of validation measures available in NbClust package. Section 3 gives an example of simulated dataset to illustrate the use of NbClust package functions and objects. A brief conclusion follows in section 4.

2. Clustering validity indices In most real life clustering situations, the user faces the dilemma of selecting the number of clusters or partitions in the underlying data. As such, numerous indices for determining the number of clusters in a data set have been proposed. All these clustering validity indices combine information about intracluster compactness and intercluster isolation, as well as other factors, such as geometric or statistical properties of the data, the number of data objects and dissimilarity or similarity measurement. Table 1 presents indices implemented in NbClust.

3. Finding the relevant number of clusters using NbClust In this section, we aim to show how NbClust package works. Thus, we consider a simulated dataset composed of 4 distinct nonoverlapping clusters (Figure 1). The dataset consists of 300 points and the clusters are embedded in a bidimensional Euclidean space.

Figure 1. Simulated dataset plot.

In R, a typical call for using NbClust is R> l i b r a r y ( N b C l u s t ) R> N b C l u s t ( d a t a , d i s s =”NULL” , d i s t a n c e =” e u c l i d e a n ” , min . nc =2 , max . nc =8 , method =” c o m p l e t e ” , i n d e x =” a l l l o n g ” , a l p h a B e a l e = 0 . 1 )

The function documentation regarding explicit instruction on input arguments is given online by the command help(NbClust). Our goal is to cluster rows of the data matrix based on columns (variables) and to evaluate the ability of available indices to identify the optimal number of clusters in the underlying data. The number of clusters varies from 2 to 8. The distance metric is set to “euclidean” ; other available options are “maximum”, “manhattan”, “camberra”, “binary” and “minkowski”. The agglomeration method for hierarchical clustering is set to “ward”. It is also possible to select another method such as “complete”, “single”, “mcquitty”, “average”, “median” or “centroid”.

Name of the index in NbClust 1.“ch” 2. “duda” 3. “pseudot2” 4. “cindex” 5. “gamma” 6. “beale” 7. “ccc” 8. “ptbiserial” 9. “gplus” 10. “db” 11. “frey” 12. “hartigan” 13. “tau” 14. “ratkowsky” 15. “scott” 16. “marriot” 17. “ball” 18. “trcovw” 19. “tracew” 20. “friedman” 21. “mcclain” 22. “rubin” 23. “kl” 24. “silhouette” 25. “gap” 26. “dindex” 27. “dunn” 28. “hubert” 29. “sdindex” 30. “sdbw”

Name of the index in literature Calinski and Harabasz Je(2)/Je(1) P seudot2 C-index Gamma Beale Cubic Clustering Criterion (CCC) Point-Biserial G(+) Davies and Bouldin Frey and Van Groenewood Hartigan Tau c/k .5 n log (|T |/|W |) k 2 |W | Ball and Hall Trace Cov W Trace W Trace W −1 B McClain and Rao |T |/|W | KL Silhouette Gap D Dunn Modified Statistic of Hubert SD SDbw

References [CAL 74] [DUD 73] [DUD 73] [HUB 76] [BAK 75] [BEA 69] [SAR 83] [KRA 82] [ROH 74] [DAV 79] [FRE 72] [HAR 75] [ROH 74] [RAT 78] [SCO 71] [MAR 71] [BAL 65] [MIL 85] [EDW 65], [FRI 67] [FRI 67] [MCC 75] [FRI 67] [KRZ 88] [ROU 87] [TIB 01] [LEB 00] [DUN 74] [HUB 85] [HAL 00] [HAL 01]

Table 1. Overview of the indices implemented in NbClust package.

User can request indices one by one by setting “index” argument to the name of the index as presented in Table 1, for example “index=gap”. In this case, as shown in the example above, NbClust function displays gap values of partitions obtained with number of clusters ranging from min.nc to max.nc (from 2 to 8 in this example), critical values of gap index for each partition and the best number of clusters corresponding to smallest number of clusters such that critical value is positiv (4 clusters in this example). R> l i b r a r y ( N b C l u s t ) R> N b C l u s t ( d a t a , d i s s =”NULL” , d i s t a n c e =” e u c l i d e a n ” , min . nc =2 , max . nc =8 , method =” ward ” , i n d e x =” gap ” , a l p h a B e a l e = 0 . 1 ) [ 1 ] ” A l l 300 o b s e r v a t i o n s were u s e d . ” All . index nc . Ward i n d e x . Gap 2 0.8890281 3 1.6225459 4 2.5829161 5 2.4931972

6 7 8

2.4361195 2.3502468 2.3008353

All . C r i t i c a l V a l u e s nc . C r i t V a l u e C r i t V a l u e G a p 2 −0.65922165 3 −0.89243989 4 0.14245740 5 0.13974792 6 0.13106129 7 0.10150185 8 0.07891177 B e s t . nc [ ,1] Number clusters 4.0000 Value Index 2.5829

Clustering with index argument set to “alllong” requires more time, as the run of some measures, such as Gamma, Tau, Gap and Gplus, is computationally very expensive, especially when the number of clusters and objects in the data set grows very large. The user can avoid running these four indices by setting “index” argument to “all”. In this case, only 26 indices are computed. With “alllong” option, output of NbClust function consists in all validation measures, critical values for Duda, Gap, PseudoT2 and Beale indices, the number of clusters corresponding to the optimal score for each measure and the best number of clusters proposed by NbClust according to majority rule. R> N b C l u s t ( d a t a , d i s s =”NULL” , d i s t a n c e =” e u c l i d e a n ” , min . nc =2 , max . nc =8 , method =” c o m p l e t e ” , i n d e x =” a l l l o n g ” , a l p h a B e a l e = 0 . 1 ) [ 1 ] ” ∗∗∗ : The H u b e r t i n d e x i s a g r a p h i c a l method o f d e t e r m i n i n g t h e number o f c l u s t e r s . I n t h e p l o t o f H u b e r t i n d e x , we s e e k a s i g n i f i c a n t k n e e t h a t c o r r e s p o n d s t o a s i g n i f i c a n t i n c r e a s e of t h e value of t h e measure i . e t h e s i g n i f i c a n t peak i n Hubert index second d i f f e r e n c e s p l o t . ” [ 1 ] ” ∗∗∗ : The D i n d e x i s a g r a p h i c a l method o f d e t e r m i n i n g t h e number o f c l u s t e r s . I n t h e p l o t o f D i n d e x , we s e e k a s i g n i f i c a n t k n e e ( t h e s i g n i f i − −c a n t p e a k i n Dindex s e c o n d d i f f e r e n c e s p l o t ) t h a t c o r r e s p o n d s t o a s i g n i − −f i c a n t i n c r e a s e of t h e value of t h e measure . ” [ 1 ] ” A l l 300 o b s e r v a t i o n s were u s e d . ” [1] [1] [1] [1] [1] [1] [1] [1] [1]

” ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ ” ” ∗ Among a l l i n d i c e s , ” ” ∗ 6 p r o p o s e d 2 a s t h e b e s t number o f c l u s t e r s ” ” ∗ 4 p r o p o s e d 3 a s t h e b e s t number o f c l u s t e r s ” ” ∗ 17 p r o p o s e d 4 a s t h e b e s t number o f c l u s t e r s ” ” ∗ 1 p r o p o s e d 7 a s t h e b e s t number o f c l u s t e r s ” ”∗ ∗∗∗∗∗ C o n c l u s i o n ∗∗∗∗∗ ” ” ∗ A c c o r d i n g t o m a j o r i t y r u l e , t h e b e s t number o f c l u s t e r s i s 4 ” ” ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ ”

B e s t . nc Number clusters Value Index

i n d e x . KL 4.0000 13.0963 i n d e x . TraceW

i n d e x . CH index . Hartigan 4.000 4.0000 4824.085 815.8783

i n d e x . CCC 4.0000 31.0381

index . Scott 4.0000 555.0816

i n d e x . F r i e d m a n i n d e x . Rubin i n d e x . C i n d e x i n d e x . DB

Number clusters Value Index

3.0000 38.3923

4.0000 133.8202

4.0000 −106.1282

4.0000 0.2755

2.0000 0.3565

i n d e x . PseudoT2 i n d e x . B e a l e i n d e x . Ratkowsky i n d e x . B a l l i n d e x . P t B i s e r i a l Number clusters 4.0000 4.0000 2.0000 3.0000 2.0000 34.6947 0.4626 0.6124 32.8865 0.8007 Value Index i n d e x . McClain Number clusters 2.0000 Value Index 0.1873

Number clusters Value Index

i n d e x . Gamma 4.0000 0.9988

i n d e x . SDbw i n d e x . TrCovW 4.000 7.0000 0.114 1.9932

index . Gplus 4.0000 5.1441

i n d e x . Tau 3.000 9792.501

i n d e x . Dunn 2.0000 0.4799

i n d e x . Duda i n d e x . F r e y i n d e x . Dindex 4.0000 2.0000 0 0.6808 1.4313 0

ndex . M a r r i o t i n d e x . S i l h o u e t t e i n d e x . Gap i n d e x . H u b e r t i n d e x . SDindex Number clusters 4.0000 4.0000 4.0000 0 3.0000 Value Index 506.3085 0.7603 2.6761 0 2.9681

Figure 2. Graphical methods for determining the best number of clusters in a simulated dataset.

Dindex and Hubert index are graphical methods. Hence, values of these indices are set to zero in the example above. In this case, the optimal number of clusters is identified by a significant knee in the plot of index values against number of clusters. This knee corresponds to a significant increase or significant decrease of the index, as the number of clusters varies from the minimum to the maximum. In NbClust package, a significant peak in the plot of second differences values indicates the relevant number of clusters. As shown in Figure 2, Hubert index proposes 3 as the best number of clusters and Dindex proposes 4 as the best number of clusters.

Certainly, the results presented in the example above seem to indicate that there is no unanimous choice regarding the optimal number of clusters. Indeed, 17 among 30 indices propose 4 as the best number of clusters, 4 indices propose 3 as the optimal number of clusters, 6 indices select 2 as the relevant number of clusters in this dataset and only one index proposes 7 as the best number of clusters. Consequently, the user faces with the dilemma of choosing one among four available solutions (2 clusters, 3 clusters, 4 clusters or 7 clusters). There are two ways to deal with this problem. The first one is based on the majority rule, which is available in NbClust package. The optimal number of clusters would be 4, as it is selected by 17 indices among 30, which is the correct number of clusters. The second option consists in considering only indices that performed best in simulation studies. For example, the 5 top performers in Milligan and Cooper study are CH index, Duda index, Cindex, Gamma and Beale. As shown in the example above, these 5 indices propose 4 as the optimal number of clusters, which is the correct number of clusters in this dataset.

4. Conclusion In this paper, we present validity indices developed in NbClust package. One major advantage of this new package is that it provides an exhaustive list of indices which most of them are not implemented before in any R package. The current version of NbClust contains up to 30 indices. It enables the user to simultaneously vary the number of clusters, the clustering method and the indices to decide how best to group the observations in his dataset. Moreover, for each index, NbClust proposes the best number of clusters from the different results. The user can thus compare all indices and clustering methods. Lastly, implemeting the validation measures within R package NbClust provides the additional advantage in that it can interface with numerous clustering algorithms in existing R packages. Hence, the NbClust package is a valuable addition to the growing collection of cluster validation software available for researchers. As with many other software packages, NbClust package is continually being augmented and improved. A future direction includes expanding the functionality of NbClust to allow for applying other clustering algorithms such as self organizing maps.

Acknowledgements The authors would like to thank the NSERC-Industrial Alliance Chair for Women in Science and Engineering in Quebec for the support to this research.

5. References [BAK 75] BAKER F. B., H UBERT L. J., “Measuring the Power of Hierarchical Cluster Analysis”, Journal of the American Statistical Association, vol. 70, num. 349, 1975, p. 31-38. [BAL 65] BALL G. H., H ALL D. J., “ISODATA : A Novel Method of Data Analysis and Pattern Classification.”, 1965, Menlo Park : Stanford Research Institute. (NTIS No. AD 699616). [BEA 69] B EALE E. M. L., Cluster Analysis, Scientific Control Systems, London, 1969. [BRO 12] B ROCK G., P IHUR V., DATTA S., “clvalid : Validation of Clustering Results”, 2012, R package version 0.6-4. [CAL 74] C ALINSKI T., H ARABASZ J., “A Dendrite Method for Cluster Analysis”, Communications in Statistics - Theory and Methods, vol. 3, num. 1, 1974, p. 1-27, Taylor & Francis. [DAV 79] DAVIES D. L., B OULDIN D. W., “A Cluster Separation Measure”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 1, num. 2, 1979, p. 224-227. [DIM 12] D IMITRIADOU E., “cclust : Convex Clustering Methods and Clustering Indexes”, 2012, R package version 0.6-16. [DUD 73] D UDA R. O., H ART P. E., Pattern Classification and Scene Analysis, John Wiley & Sons, New York, USA, 1973. [DUN 74] D UNN J., “Well Separated Clusters and Optimal Fuzzy Partitions”, Journal Cybernetics, , 1974, p. 95-104.

[EDW 65] E DWARDS A. W. F., C AVALLI -S FORZA L., “A Method for Cluster Analysis”, Biometrics, vol. 21, num. 2, 1965, p. 362-375. [FLO 51] F LOREK K., L UKASZEWICZ J., P ERKAL J., Z UBRZYCKI S., “Sur la Liaison et la Division des Points d’un Ensemble Fini”, Colloquium Mathematicae, , 1951, p. 282-285. [FRE 72] F REY T., VAN G ROENEWOUD H., “A Cluster Analysis of the D-Squared Matrix of White Spruce Stands in Saskatchewan Based on the Maximum-Minimum Principle”, Journal of Ecology, vol. 60, num. 3, 1972, p. 873-886. [FRI 67] F RIEDMAN H. P., RUBIN J., “On Some Invariant Criteria for Grouping Data”, Journal of the American Statistical Association, vol. 62, num. 320, 1967, p. 1159-1178. [GOW 67] G OWER J., “A Comparison of Some Methods of Cluster Analysis”, Biometrics, , 1967, p. 623-637. [HAL 00] H ALKIDI M., VAZIRGIANNIS M., BATISTAKIS I., “Quality Scheme Assessment in the Clustering Process”, PKDD2000, , 2000, p. 265-276. [HAL 01] H ALKIDI M., VAZIRGIANNIS M., “Clustering Validity Assessment : Finding the Optimal Partitioning of a Data Set”, ICDM’01 Proceedings of the 2001 IEEE International Conference on Data Mining, , 2001, p. 187-194. [HAR 75] H ARTIGAN J. A., Clustering Algorithms, John Wiley & Sons, New York, NY, USA, 1975. [HUB 76] H UBERT L. J., L EVIN J. R., “A General Statistical Framework for Assessing Categorical Clustering in Free Recall”, Psychological Bulletin, vol. 83, num. 6, 1976, p. 1072-1080. [HUB 85] H UBERT L. J., A RABIE P., “Comparing Partitions”, Journal of classification, vol. 2, 1985, p. 193-218. [INC 12] I NC . S. I., “SAS/STAT (R) 12.1 User’s Guide”, Cary, NC, 2012. [KAU 90] K AUFMAN L., ROUSSEEUW P., Finding Groups in Data : An Introduction to Cluster Analysis, John Wiley & Sons, New York, NY, USA, 1990. [KRA 82] K RAEMER H. C., ”Biserial Correlation”, John Wiley & Sons, 276-279, 1982. [KRZ 88] K RZANOWSKI W. J., L AI Y. T., “A Criterion for Determining the Number of Groups in a Data Set Using Sum-ofSquares Clustering”, Biometrics, vol. 44, num. 1, 1988, p. 23-34. [LEB 00] L EBART L., M ORINEAU A., P IRON M., Statistique Exploratoire Multidimensionnelle, Dunod, Paris, France, 2000. [MAR 71] M ARRIOT F. H. C., “Practical Problems in a Method of Cluster Analysis”, Biometrics, vol. 27, num. 3, 1971, p. 501-514. [MCC 75] M C C LAIN J. O., R AO V. R., “CLUSTISZ : A Program to Test for The Quality of Clustering of a Set of Objects”, Journal of Marketing Research, vol. 12, num. 4, 1975, p. 456-460. [MCQ 66] M C Q UITTY L., “Similarity Analysis by Reciprocal Pairs for Discrete and Continuous Data”, Educational and Psychological Measurement, vol. 26, 1966, p. 825-831. [MIL 85] M ILLIGAN G., C OOPER M., “An Examination of Procedures for Determining the Number of Clusters in a Data Set”, Psychometrika, vol. 50, num. 2, 1985, p. 159-179. [NIE 12] N IEWEGLOWSKI L., “clv : Cluster Validation Techniques”, 2012, R package version 0.3-2. [RAT 78] R ATKOWSKY D. A., L ANCE G. N., “A Criterion for Determining the Number of Groups in a Classification”, Australian Computer Journal, vol. 10, 1978, p. 115-117. [ROH 74] ROHLF F. J., “Methods of Comparing Classifications”, Annual Review of Ecology and Systematics, vol. 5, 1974, p. 101-113. [ROU 87] ROUSSEEUW P., “Silhouettes : a Graphical Aid to the interpretation and validation of cluster analysis”, Journal of computational and applied mathematics, vol. 20, 1987, p. 53-65. [SAR 83] S ARLE W. S., “SAS Technical Report A-108, Cubic Clustering Criterion”, 1983, Cary, N.C. : SAS Institute Inc. [SCO 71] S COTT A. J., S YMONS M. J., “Clustering Methods Based on Likelihood Ratio Criteria”, Biometrics, vol. 27, num. 2, 1971, p. 387-397. [SOK 58] S OKAL R., M ICHENER C., “A Statistical Method for Evaluating Systematic Relationships”, University of Kansas Science Bulletin 38, , 1958, p. 1409-1438. [SOR 48] S ORENSEN T., “A Method of Establishing Groups of Equal Amplitude in Plant Sociology Based on Similarity of Species and its Application to Analyses of the Vegetation on Danish Commons”, Biologiske Skrifter, vol. 5, 1948, p. 1-34. [TEA 11] T EAM R. D. C., “R : A Language and Environment for Statistical Computing”, Vienna, Austria, 2011. [TIB 01] T IBSHIRANI R., WALTHER G., H ASTIE T., “Estimating the Number of Clusters in a Data Set Via the Gap Statistic”, Journal of the Royal Statistical Society : Series B (Statistical Methodology), vol. 63, num. 2, 2001, p. 411-423.

[WAL 12] WALESIAK M., D UDEK A., “clusterSim : Searching for Optimal Clustering Procedure for a Data Set”, 2012, R package version 0.41-8. [WAR 63] WARD J., “Hierarchical Grouping to Optimize an Objective Function”, Journal of the American Statistical Association, vol. 58, num. 301, 1963, p. 236-244.

Suggest Documents