Clustering with Mixed Type Variables and Determination of Cluster ...

10 downloads 3071 Views 856KB Size Report
Implementation in software packages. ▫ Proposal of new ... implemented in two-step cluster analysis (SPSS) ... Variability of the whole data set: ∑ ∑. ∑. ∑. = =.
Clustering with Mixed Type Variables and Determination of Cluster Numbers Hana Řezanková, Dušan Húsek Tomáš Löster University of Economics, Prague ICS, Academy of Sciences of the Czech Republic

COMPSTAT 2010

1

Outline  Motivation  Methods for clustering with mixed type variables  Implementation in software packages  Proposal of new criteria for cluster evaluation  Application  Conclusion

COMPSTAT 2010

2

Motivation  Task: We are looking for groups of similar

objects (e.g. respondents), i.e. we will concentrate on the problem of object clustering  The objects are characterized by both quantitative and qualitative (nominal) variables (e.g. respondent opinions, numbers of actions)  The number of clusters is unknown in advance – i.e. we should cope with appropriate number of clusters determination (assignment) COMPSTAT 2010

3

Methods for clustering with mixed type variables  Using a specialized dissimilarity measure

(Gower’s coefficient, cluster variability based) and application of agglomerative hierarchical cluster analysis (AHCA)  Clustering objects separately with quantitative and qualitative variables and combining the results by cluster-based similarity partitioning algorithm (CSPA)  Latent class models COMPSTAT 2010

4

Implementation in software packages  Specialized dissimilarity measures

- are not implemented for AHCA  Clustering objects with qualitative variables - is implemented only rarely (disagreement coef.)  Cluster-based similarity partitioning algorithm - is not implemented but it could be realized  LC Cluster models (Latent GOLD)  Log-likelihood distance measure between clusters - implemented in two-step cluster analysis (SPSS) COMPSTAT 2010

5

Implementation in software packages  Log-likelihood distance measure between clusters

- implemented in two-step cluster analysis (SPSS)

Dhh   h , h  (h  h )   1 2 2  g  n g   ln( sl  s gl )   H gl  l 1   l 1 2 Kl nglu nglu … entropy ln H gl    ng u 1 n g m (1)

m( 2 )

COMPSTAT 2010

6

Implementation in software packages  Log-likelihood distance measure between objects

- implemented in two-step cluster analysis (SPSS)

Dhh   h , h  (h  h )   1 2 2  g  n g   ln( sl  s gl )   H gl  l 1   l 1 2 m (1)

D ( x i , x j )   x ,x i

m( 2 )

j

COMPSTAT 2010

7

Evaluation criteria implemented in software packages  BIC (Bayesian Information Criterion)

AIC (Akaike Information Criterion) - implemented in two-step cluster analysis (SPSS) k

I BIC  2  g  wk ln( n ) g 1 k

  (1) wk  k  2m   ( K l  1)  l 1  

I AIC  2  g  2 wk g 1

… minimum m( 2 )

only for initial estimation of number of clusters COMPSTAT 2010

8

Proposed evaluation criteria  Within-cluster variability for k clusters:

 1  2 2  ( k )    g   ng   ln( sl  sgl )   H gl  g 1 l 1 g 1  l 1 2  k

k

m(1)

m( 2 )

 Variability of the whole data set: m(1)

m( 2 )

1 2  (1)  n  ln(2 sl )   H l l 1 2 l 1 COMPSTAT 2010

9

Proposed evaluation criteria  Within-cluster variability for k clusters:

 1  2 2  ( k )    g   ng   ln( sl  sgl )   H gl  g 1 l 1 g 1  l 1 2  k

difference

k

m(1)

m( 2 )

diff ( k )   ( k  1)   ( k )

it should be maximal for the suitable number of clusters COMPSTAT 2010

10

Evaluation criteria modified for qualitative variables 1. Uncertainty index (R-square (RSQ) index)

VB VT  VW  (1)   ( k ) I U (k )     (1) VT VT 2. Semipartial uncertainty index

(optimal number of clusters - minimum)

I SPU ( k )  I U ( k  1)  I U ( k ) COMPSTAT 2010

11

Evaluation criteria modified for qualitative variables 3. Pseudo (Calinski and Habarasz) F index

– PSF (SAS), CHF ( SYSTAT) VB ( n  k )  ( (1)   ( k )) k 1  I CHFU ( k )   VW ( k  1)   ( k ) nk

4. Pseudo T-squared statistic – PST2 (SAS)  h , h  (h  h ) PTS (SYSTAT) I PTSU (k )   h   h nh  nh  2 COMPSTAT 2010

12

Evaluation criteria modified for qualitative variables

SYSTAT COMPSTAT 2010

13

Evaluation criteria modified for qualitative variables 5. Modified Davies and Bouldin (DB) index  s D ,h  s D ,h   max    h  , h  h D hh    I DB ( k )  h 1 k k

   h   h max    h , h  h    (   ) h 1  h , h h h   I DBU ( k )  k k

COMPSTAT 2010

14

Evaluation criteria modified for qualitative variables 6. Dunn’s index   Dhh   I D ( k )  min  min  1 h  k 1 h k max diam g   1 g  k  Dhh 

min

x i Ch , x j Ch 

D(xi , x j )

diam g  max D ( x i , x j ) x i ,x j C g

COMPSTAT 2010

15

Modified evaluation criteria  Cluster variability based on the variance and

Gini’s coefficient of mutability   1 2 2 G g  n g   ln( sl  s gl )   G gl  l 1   l 1 2 m(1)

 n glu   Ggl  1      u 1  n g  Kl

m( 2 )

2

Gini’s coefficient of mutability

k

I BGC  2 Gg  wk ln( n ) g 1

COMPSTAT 2010

k

G ( k )   Gg g 1

16

Evaluation criteria modified for qualitative variables 1. Tau index (RSQ index)

VB VT  VW G (1)  G ( k )   I ( k )  VT VT G (1) 2. Semipartial tau index

(optimal number of clusters - minimum)

I SP ( k )  I ( k  1)  I ( k ) COMPSTAT 2010

17

Application to a real data file  Data from a questionnaire survey

(for the participants of the chemistry seminar)  7 qualitative and 1 quantitative (count) variables  Two-step cluster analysis for clustering of respondents (experiments for the numbers of clusters from 2 to 4)  LC Cluster model (experiments for the numbers of clusters from 2 to 6) – the quantitative variable was recoded to 5 categories COMPSTAT 2010

18

Application to a real data file Criteria based on the entropy (TSCA in SPSS) Measure Within-cluster variability Variability difference IU ISPU ICHFU IBIC

1 273.92

Number of clusters 2 3 241.17 206.39

4 186.51

-

32.75

34.78

19.88

0 0.12 0 590.85

0.12 0.13 6.52 568.41

0.25 0.07 7.69 541.88

0.32 7.19 545.15

COMPSTAT 2010

19

Application to a real data file Criteria based on the Gini’s coefficient (TSCA in SPSS) Measure Within-cluster variability Variability difference I ISP ICHF IBGC

1 185.41

Number of clusters 2 3 162.57 137.83

4 127.86

-

22.84

24.74

9.97

0 0.12 0 413.85

0.12 0.13 6.74 411.20

0.26 0.05 8.11 404.75

0.31 6.90 427.84

COMPSTAT 2010

20

Application to a real data file Comparison of BIC Method Two-step CA LC Cluster Model

1 590.85 1397.01

Number of clusters 2 3 568.41 541.88 1059.24 1019.18

COMPSTAT 2010

4 545.15 1036.90

21

Conclusion  If the distance between objects, distance between

clusters, within-cluster variability and the total variability are defined for the case when objects are characterized by mixed-type variables, then the evaluation criteria for quantitative variables can be modified.  One possibility is an application of log-likelihood distance measure based on the entropy  Another possibility is to use the analogous measure with using of Gini’s coefficient COMPSTAT 2010

22

Thank you for your attention

COMPSTAT 2010

23

Suggest Documents