NbClust Package : finding the relevant number of ... - Cedric/CNAM

Introduction

NbClust package

Examples

Conclusion

NbClust Package : finding the relevant number of clusters in a dataset Malika Charrad, Nadia Ghazzali, V´eronique Boiteau, Azam Niknafs Laval University, Quebec, Canada

June 13th, 2012

UseR !2012, Nashville

NbClust Package

Introduction

NbClust package

Examples

Outline

1

Introduction

2

NbClust package

3

Examples

4

Conclusion


NbClust Package

Conclusion

Introduction

NbClust package

Examples

Outline

1

Introduction

2

NbClust package

3

Examples

4

Conclusion


NbClust Package

Conclusion

Introduction

NbClust package

Examples

Conclusion

Introduction

Clustering is the task of assigning a set of objects into groups (clusters) so that the objects in the same cluster are more similar to each other than objects in other clusters. Most of the clustering algorithms depend on input parameters such as the number of clusters, the minimum number of objects in a cluster, or the diameter of a cluster .. ⇒ The selection of different parameters leads to different clusters of data.


NbClust Package

Introduction

NbClust package

Examples

How many clusters are there in the dataset ?


NbClust Package

Conclusion

Introduction

NbClust package

Examples

Conclusion

How to select the best number of clusters in a dataset ?

If the clustering algorithm parameters are assigned improper values, the clustering method may result in a partitioning scheme that’s not optimal ⇒ Wrong decisions. The user is faced with the dilemma of selecting the number of clusters in the dataset. The problem of deciding the number of clusters better fitting a dataset as well as the evaluation of the clustering results is known under the term cluster validity.


NbClust Package

Introduction

NbClust package

Examples

Conclusion

Related work

(Milligan and Cooper, 1985) examined 30 indices with simulated data. There are other criteria which were not examined in Milligan and Cooper study such as : Dunn index (Dunn, 1974) Silhouette statistic (Rousseeuw, 1987) Gap statistic (Tibshirani, 2001) Dindex (Lebart, 2000) SD index and SDbw index (Halkidi et al., 2000, 2001) Statistic of Hubert ((Hubert and Arabie, 1985))


NbClust Package

Introduction

NbClust package

Examples

Related work ⇒ 19 among all existing indices are implemented in SAS and R packages : cclust, clusterSim, clv and clvalid.


NbClust Package

Conclusion

Introduction

NbClust package

1

Introduction

2

NbClust package

3

Examples

4

Conclusion


Examples

NbClust Package

Conclusion

Introduction

NbClust package

Examples

NbClust package 1. NbClust package provides 30 indices to determine the number of clusters : 11 other indices : ”duda” Duda and Hart (1973) ”beale” Beale (1969) ”gplus” Rohlf (1974), Milligan (1981) ”frey” Frey and Van Groenewoud (1972) ”tau” Rohlf (1974), Milligan (1981) ”mcclain” McClain and Rao (1975), ”gap” Tibshirani (2001), ”dindex” Lebart (2000), ”hubert” Hubert and Arabie (1985), ”sdindex” Halkidi et al. (2000), ”sdbw” Halkidi et al. (2001).

2. NbClust offers the user the best clustering scheme among different results. UseR !2012, Nashville

NbClust Package

Conclusion

Introduction

NbClust package

Examples

Conclusion

NbClust function

NbClust(data, diss=”NULL”, distance=”euclidean”, min.nc=2, max.nc=15, method=”ward”, index=”all”, alphaBeale=0.1) Arguments : data matrix or data set diss dissimilarity matrix to be used. By default, diss=”NULL”, but if it is replaced by a dissimilarity matrix, distance should be ”NULL”. distance the distance measure to be used to compute the dissimilarity matrix. This must be one of : ”euclidean”, ”maximum”, ”manhattan”, ”canberra”, ”binary”, ”minkowski” or ”NULL”.


NbClust Package

Introduction

NbClust package

Examples

Conclusion

NbClust function

NbClust(data, diss=”NULL”, distance=”euclidean”, min.nc=2, max.nc=15, method=”ward”, index=”all”, alphaBeale=0.1) Arguments : min.nc minimum number of clusters, between 2 and (number of objects - 1). max.nc maximum number of clusters, between 2 and (number of objects - 1), greater or equal to min.nc. method the cluster analysis method to be used. Available methods are : ”ward”, ”single”, ”complete”, ”average”, ”mcquitty”, ”median”, ”centroid” ”kmeans”


NbClust Package

Introduction

NbClust package

Examples

Conclusion

NbClust function Nb.clusters(data, diss=”NULL”, distance=”euclidean”, min.nc=2, max.nc=15, method=”ward”, index=”all”, alphaBeale=0.1) Arguments : index the index to be calculated. This should be one of : ”kl”, ”ch”, ”hartigan”, ”ccc”, ”scott”, ”marriot”, ”trcovw”, ”tracew”, ”friedman”, ”rubin”, ”cindex”, ”db”, ”silhouette”, ”duda”, ”pseudot2”, ”beale”, ”ratkowsky”, ”ball”, ”ptbiserial”, ”gap”, ”frey”, ”mcclain”, ”gamma”, ”gplus”, ”tau”, ”dunn”, ”hubert”, ”sdindex”, ”dindex”, ”sdbw”, ”alllong” : all indices included ”all” : all indices except GAP, Gamma, Gplus and Tau. alphaBeale significance value for Beale’s index.


NbClust Package

Introduction

NbClust package

1

Introduction

2

NbClust package

3

Examples

4

Conclusion


Examples

NbClust Package

Conclusion

Introduction

NbClust package

Examples

Example1 : Simulated dataset with 2 variables and 4 clusters


NbClust Package

Conclusion

Introduction

NbClust package

Examples

NbClust output : Gap index


NbClust Package

Conclusion

Introduction

NbClust package

Examples

NbClust output : ”alllong” option [All.index] Values of indices for each partition of the dataset obtained with a number of clusters between min.nc and max.nc.


NbClust Package

Conclusion

Introduction

NbClust package

Examples

Conclusion

NbClust output : Critical values

[All.CriticalValues] Critical values of some indices for each partition obtained with a number of clusters between min.nc and max.nc.


NbClust Package

Introduction

NbClust package

Examples

Conclusion

NbClust output : Best number of clusters [Best.nc] Best number of clusters proposed by each index and the corresponding index value.


NbClust Package

Introduction

NbClust package

Examples

NbClust output : Best number of clusters


NbClust Package

Conclusion

Introduction

NbClust package

Examples

NbClust output : Hubert index and Dindex


NbClust Package

Conclusion

Introduction

NbClust package

Examples

NbClust output : Hubert index and Dindex


NbClust Package

Conclusion

Introduction

NbClust package

Examples

Example2 : Iris dataset (Fisher 1936) Iris dataset is composed of 3 species : ”Setosa”, ”Virginica” and ”Versicolor”


NbClust Package

Conclusion

Introduction

NbClust package

Examples

Clustering of Iris dataset (1)


NbClust Package

Conclusion

Introduction

NbClust package

Examples



NbClust Package

Conclusion

Introduction

NbClust package

Examples



NbClust Package

Conclusion

Introduction

NbClust package

1

Introduction

2

NbClust package

3

Examples

4

Conclusion


Examples

NbClust Package

Conclusion

Introduction

NbClust package

Examples

How to decide on the correct number of clusters ?


NbClust Package

Conclusion

Introduction

NbClust package

Examples

Conclusion

How to decide on the correct number of clusters ?

1. Majority rule : User can select the number of clusters proposed by the majority of indices. ex : 4 in 1st example and 3 in 2nd example. 2. User can consider only indices that performed best in simulations studies. Top-5 indices in Milligan and Cooper study are : CH index, Duda index, Cindex, Gamma and Beale index.


NbClust Package

Introduction

NbClust package

Examples

Conclusion

Conclusion

NbClust package provides a large list of indices, many of them are not implemented anywhere. The current version contains up to 30 indices. NbClust package permits the user to simultaneously vary the number of clusters, the clustering method and the indices to decide how best to group observations in his dataset or to compare all indices or clustering methods. NbClust package is available at http ://cran.r-project.org/web/packages/NbClust/index.html


NbClust Package

Introduction

NbClust package

Examples

Thank you !


NbClust Package

Conclusion

NbClust Package : finding the relevant number of ... - Cedric/CNAM

NbClust Package : finding the relevant number of ... - Cedric/CNAM

Suggest Documents

NbClust: An R Package for Determining the Relevant Number of ...

NbClust: An R Package for Determining the Relevant Number of ...

NbClust: An R Package for Determining the Relevant Number of ...

NbClust package for determining the number of ...

Package 'NbClust'

Name Number Package Name Number Package Name Number ...

Finding Relevant Website Queries - CiteSeerX

Finding the Number of Clusters Minimizing Energy

Is the finding of obsessional behaviour relevant to the ... - Hindawi

Left luggage: finding the relevant context of Austrian Economics

Finding and Visualizing Relevant Subspaces for ...

Finding and Visualizing Relevant Subspaces for ... - CiteSeerX

Finding and Visualizing Relevant Subspaces for ... - CiteSeerX

Finding Relevant Answers in Software Forums

Portfolio: Finding Relevant Functions and Their Usages

THE âEFFECTIVE NUMBER OF RELEVANT PARTIES ... - Centres FUSL

Finding Relevant Sequences With The Least Temporal ... - Agile

On finding the relevant dynamics for model based controlling ... - Csic

Randomized Algorithm of Finding the True Number of Clusters Based

Finding Relevant Cases in Large Databases of Signals, Time Series

The CNVrd2 package: measurement of copy number at ... - Core

Finding the Number of Cubes in Rectangular Cube ...

On Finding the Number of Graph Automorphisms - UMD Department ...

Finding the Number of Cubes in Rectangular Cube

NbClust Package : finding the relevant number of ... - Cedric/CNAM