Implementation in software packages. â« Proposal of new ... implemented in two-step cluster analysis (SPSS) ... Variability of the whole data set: â â. â. â. = =.
Clustering with Mixed Type Variables and Determination of Cluster Numbers Hana Řezanková, Dušan Húsek Tomáš Löster University of Economics, Prague ICS, Academy of Sciences of the Czech Republic
COMPSTAT 2010
1
Outline Motivation Methods for clustering with mixed type variables Implementation in software packages Proposal of new criteria for cluster evaluation Application Conclusion
COMPSTAT 2010
2
Motivation Task: We are looking for groups of similar
objects (e.g. respondents), i.e. we will concentrate on the problem of object clustering The objects are characterized by both quantitative and qualitative (nominal) variables (e.g. respondent opinions, numbers of actions) The number of clusters is unknown in advance – i.e. we should cope with appropriate number of clusters determination (assignment) COMPSTAT 2010
3
Methods for clustering with mixed type variables Using a specialized dissimilarity measure
(Gower’s coefficient, cluster variability based) and application of agglomerative hierarchical cluster analysis (AHCA) Clustering objects separately with quantitative and qualitative variables and combining the results by cluster-based similarity partitioning algorithm (CSPA) Latent class models COMPSTAT 2010
4
Implementation in software packages Specialized dissimilarity measures
- are not implemented for AHCA Clustering objects with qualitative variables - is implemented only rarely (disagreement coef.) Cluster-based similarity partitioning algorithm - is not implemented but it could be realized LC Cluster models (Latent GOLD) Log-likelihood distance measure between clusters - implemented in two-step cluster analysis (SPSS) COMPSTAT 2010
5
Implementation in software packages Log-likelihood distance measure between clusters
- implemented in two-step cluster analysis (SPSS)
Dhh h , h (h h ) 1 2 2 g n g ln( sl s gl ) H gl l 1 l 1 2 Kl nglu nglu … entropy ln H gl ng u 1 n g m (1)
m( 2 )
COMPSTAT 2010
6
Implementation in software packages Log-likelihood distance measure between objects
- implemented in two-step cluster analysis (SPSS)
Dhh h , h (h h ) 1 2 2 g n g ln( sl s gl ) H gl l 1 l 1 2 m (1)
D ( x i , x j ) x ,x i
m( 2 )
j
COMPSTAT 2010
7
Evaluation criteria implemented in software packages BIC (Bayesian Information Criterion)
AIC (Akaike Information Criterion) - implemented in two-step cluster analysis (SPSS) k
I BIC 2 g wk ln( n ) g 1 k
(1) wk k 2m ( K l 1) l 1
I AIC 2 g 2 wk g 1
… minimum m( 2 )
only for initial estimation of number of clusters COMPSTAT 2010
8
Proposed evaluation criteria Within-cluster variability for k clusters:
1 2 2 ( k ) g ng ln( sl sgl ) H gl g 1 l 1 g 1 l 1 2 k
k
m(1)
m( 2 )
Variability of the whole data set: m(1)
m( 2 )
1 2 (1) n ln(2 sl ) H l l 1 2 l 1 COMPSTAT 2010
9
Proposed evaluation criteria Within-cluster variability for k clusters:
1 2 2 ( k ) g ng ln( sl sgl ) H gl g 1 l 1 g 1 l 1 2 k
difference
k
m(1)
m( 2 )
diff ( k ) ( k 1) ( k )
it should be maximal for the suitable number of clusters COMPSTAT 2010
10
Evaluation criteria modified for qualitative variables 1. Uncertainty index (R-square (RSQ) index)
VB VT VW (1) ( k ) I U (k ) (1) VT VT 2. Semipartial uncertainty index
(optimal number of clusters - minimum)
I SPU ( k ) I U ( k 1) I U ( k ) COMPSTAT 2010
11
Evaluation criteria modified for qualitative variables 3. Pseudo (Calinski and Habarasz) F index
– PSF (SAS), CHF ( SYSTAT) VB ( n k ) ( (1) ( k )) k 1 I CHFU ( k ) VW ( k 1) ( k ) nk
4. Pseudo T-squared statistic – PST2 (SAS) h , h (h h ) PTS (SYSTAT) I PTSU (k ) h h nh nh 2 COMPSTAT 2010
12
Evaluation criteria modified for qualitative variables
SYSTAT COMPSTAT 2010
13
Evaluation criteria modified for qualitative variables 5. Modified Davies and Bouldin (DB) index s D ,h s D ,h max h , h h D hh I DB ( k ) h 1 k k
h h max h , h h ( ) h 1 h , h h h I DBU ( k ) k k
COMPSTAT 2010
14
Evaluation criteria modified for qualitative variables 6. Dunn’s index Dhh I D ( k ) min min 1 h k 1 h k max diam g 1 g k Dhh
min
x i Ch , x j Ch
D(xi , x j )
diam g max D ( x i , x j ) x i ,x j C g
COMPSTAT 2010
15
Modified evaluation criteria Cluster variability based on the variance and
Gini’s coefficient of mutability 1 2 2 G g n g ln( sl s gl ) G gl l 1 l 1 2 m(1)
n glu Ggl 1 u 1 n g Kl
m( 2 )
2
Gini’s coefficient of mutability
k
I BGC 2 Gg wk ln( n ) g 1
COMPSTAT 2010
k
G ( k ) Gg g 1
16
Evaluation criteria modified for qualitative variables 1. Tau index (RSQ index)
VB VT VW G (1) G ( k ) I ( k ) VT VT G (1) 2. Semipartial tau index
(optimal number of clusters - minimum)
I SP ( k ) I ( k 1) I ( k ) COMPSTAT 2010
17
Application to a real data file Data from a questionnaire survey
(for the participants of the chemistry seminar) 7 qualitative and 1 quantitative (count) variables Two-step cluster analysis for clustering of respondents (experiments for the numbers of clusters from 2 to 4) LC Cluster model (experiments for the numbers of clusters from 2 to 6) – the quantitative variable was recoded to 5 categories COMPSTAT 2010
18
Application to a real data file Criteria based on the entropy (TSCA in SPSS) Measure Within-cluster variability Variability difference IU ISPU ICHFU IBIC
1 273.92
Number of clusters 2 3 241.17 206.39
4 186.51
-
32.75
34.78
19.88
0 0.12 0 590.85
0.12 0.13 6.52 568.41
0.25 0.07 7.69 541.88
0.32 7.19 545.15
COMPSTAT 2010
19
Application to a real data file Criteria based on the Gini’s coefficient (TSCA in SPSS) Measure Within-cluster variability Variability difference I ISP ICHF IBGC
1 185.41
Number of clusters 2 3 162.57 137.83
4 127.86
-
22.84
24.74
9.97
0 0.12 0 413.85
0.12 0.13 6.74 411.20
0.26 0.05 8.11 404.75
0.31 6.90 427.84
COMPSTAT 2010
20
Application to a real data file Comparison of BIC Method Two-step CA LC Cluster Model
1 590.85 1397.01
Number of clusters 2 3 568.41 541.88 1059.24 1019.18
COMPSTAT 2010
4 545.15 1036.90
21
Conclusion If the distance between objects, distance between
clusters, within-cluster variability and the total variability are defined for the case when objects are characterized by mixed-type variables, then the evaluation criteria for quantitative variables can be modified. One possibility is an application of log-likelihood distance measure based on the entropy Another possibility is to use the analogous measure with using of Gini’s coefficient COMPSTAT 2010
22
Thank you for your attention
COMPSTAT 2010
23