Clustering with Mixed Type Variables and Determination of Cluster ...

Clustering with Mixed Type Variables and Determination of Cluster Numbers Hana Řezanková, Dušan Húsek Tomáš Löster University of Economics, Prague ICS, Academy of Sciences of the Czech Republic

COMPSTAT 2010

1

Outline  Motivation  Methods for clustering with mixed type variables  Implementation in software packages  Proposal of new criteria for cluster evaluation  Application  Conclusion

COMPSTAT 2010

2

Motivation  Task: We are looking for groups of similar

objects (e.g. respondents), i.e. we will concentrate on the problem of object clustering  The objects are characterized by both quantitative and qualitative (nominal) variables (e.g. respondent opinions, numbers of actions)  The number of clusters is unknown in advance – i.e. we should cope with appropriate number of clusters determination (assignment) COMPSTAT 2010

3

Methods for clustering with mixed type variables  Using a specialized dissimilarity measure

(Gower’s coefficient, cluster variability based) and application of agglomerative hierarchical cluster analysis (AHCA)  Clustering objects separately with quantitative and qualitative variables and combining the results by cluster-based similarity partitioning algorithm (CSPA)  Latent class models COMPSTAT 2010

4

Implementation in software packages  Specialized dissimilarity measures

- are not implemented for AHCA  Clustering objects with qualitative variables - is implemented only rarely (disagreement coef.)  Cluster-based similarity partitioning algorithm - is not implemented but it could be realized  LC Cluster models (Latent GOLD)  Log-likelihood distance measure between clusters - implemented in two-step cluster analysis (SPSS) COMPSTAT 2010

5

Implementation in software packages  Log-likelihood distance measure between clusters

- implemented in two-step cluster analysis (SPSS)

Dhh   h , h  (h  h )   1 2 2  g  n g   ln( sl  s gl )   H gl  l 1   l 1 2 Kl nglu nglu … entropy ln H gl    ng u 1 n g m (1)

m( 2 )

COMPSTAT 2010

6

Implementation in software packages  Log-likelihood distance measure between objects

- implemented in two-step cluster analysis (SPSS)

Dhh   h , h  (h  h )   1 2 2  g  n g   ln( sl  s gl )   H gl  l 1   l 1 2 m (1)

D ( x i , x j )   x ,x i

m( 2 )

j

COMPSTAT 2010

7

Evaluation criteria implemented in software packages  BIC (Bayesian Information Criterion)

AIC (Akaike Information Criterion) - implemented in two-step cluster analysis (SPSS) k

I BIC  2  g  wk ln( n ) g 1 k

  (1) wk  k  2m   ( K l  1)  l 1  

I AIC  2  g  2 wk g 1

… minimum m( 2 )

only for initial estimation of number of clusters COMPSTAT 2010

8

Proposed evaluation criteria  Within-cluster variability for k clusters:

 1  2 2  ( k )    g   ng   ln( sl  sgl )   H gl  g 1 l 1 g 1  l 1 2  k

k

m(1)

m( 2 )

 Variability of the whole data set: m(1)

m( 2 )

1 2  (1)  n  ln(2 sl )   H l l 1 2 l 1 COMPSTAT 2010

9

Proposed evaluation criteria  Within-cluster variability for k clusters:

 1  2 2  ( k )    g   ng   ln( sl  sgl )   H gl  g 1 l 1 g 1  l 1 2  k

difference

k

m(1)

m( 2 )

diff ( k )   ( k  1)   ( k )

it should be maximal for the suitable number of clusters COMPSTAT 2010

10

Evaluation criteria modified for qualitative variables 1. Uncertainty index (R-square (RSQ) index)

VB VT  VW  (1)   ( k ) I U (k )     (1) VT VT 2. Semipartial uncertainty index

(optimal number of clusters - minimum)

I SPU ( k )  I U ( k  1)  I U ( k ) COMPSTAT 2010

11

Evaluation criteria modified for qualitative variables 3. Pseudo (Calinski and Habarasz) F index

– PSF (SAS), CHF ( SYSTAT) VB ( n  k )  ( (1)   ( k )) k 1  I CHFU ( k )   VW ( k  1)   ( k ) nk

4. Pseudo T-squared statistic – PST2 (SAS)  h , h  (h  h ) PTS (SYSTAT) I PTSU (k )   h   h nh  nh  2 COMPSTAT 2010

12

Evaluation criteria modified for qualitative variables

SYSTAT COMPSTAT 2010

13

Evaluation criteria modified for qualitative variables 5. Modified Davies and Bouldin (DB) index  s D ,h  s D ,h   max    h  , h  h D hh    I DB ( k )  h 1 k k

   h   h max    h , h  h    (   ) h 1  h , h h h   I DBU ( k )  k k

COMPSTAT 2010

14

Evaluation criteria modified for qualitative variables 6. Dunn’s index   Dhh   I D ( k )  min  min  1 h  k 1 h k max diam g   1 g  k  Dhh 

min

x i Ch , x j Ch 

D(xi , x j )

diam g  max D ( x i , x j ) x i ,x j C g

COMPSTAT 2010

15

Modified evaluation criteria  Cluster variability based on the variance and

Gini’s coefficient of mutability   1 2 2 G g  n g   ln( sl  s gl )   G gl  l 1   l 1 2 m(1)

 n glu   Ggl  1      u 1  n g  Kl

m( 2 )

2

Gini’s coefficient of mutability

k

I BGC  2 Gg  wk ln( n ) g 1

COMPSTAT 2010

k

G ( k )   Gg g 1

16

Evaluation criteria modified for qualitative variables 1. Tau index (RSQ index)

VB VT  VW G (1)  G ( k )   I ( k )  VT VT G (1) 2. Semipartial tau index

(optimal number of clusters - minimum)

I SP ( k )  I ( k  1)  I ( k ) COMPSTAT 2010

17

Application to a real data file  Data from a questionnaire survey

(for the participants of the chemistry seminar)  7 qualitative and 1 quantitative (count) variables  Two-step cluster analysis for clustering of respondents (experiments for the numbers of clusters from 2 to 4)  LC Cluster model (experiments for the numbers of clusters from 2 to 6) – the quantitative variable was recoded to 5 categories COMPSTAT 2010

18

Application to a real data file Criteria based on the entropy (TSCA in SPSS) Measure Within-cluster variability Variability difference IU ISPU ICHFU IBIC

1 273.92

Number of clusters 2 3 241.17 206.39

4 186.51

-

32.75

34.78

19.88

0 0.12 0 590.85

0.12 0.13 6.52 568.41

0.25 0.07 7.69 541.88

0.32 7.19 545.15

COMPSTAT 2010

19

Application to a real data file Criteria based on the Gini’s coefficient (TSCA in SPSS) Measure Within-cluster variability Variability difference I ISP ICHF IBGC

1 185.41

Number of clusters 2 3 162.57 137.83

4 127.86

-

22.84

24.74

9.97

0 0.12 0 413.85

0.12 0.13 6.74 411.20

0.26 0.05 8.11 404.75

0.31 6.90 427.84

COMPSTAT 2010

20

Application to a real data file Comparison of BIC Method Two-step CA LC Cluster Model

1 590.85 1397.01

Number of clusters 2 3 568.41 541.88 1059.24 1019.18

COMPSTAT 2010

4 545.15 1036.90

21

Conclusion  If the distance between objects, distance between

clusters, within-cluster variability and the total variability are defined for the case when objects are characterized by mixed-type variables, then the evaluation criteria for quantitative variables can be modified.  One possibility is an application of log-likelihood distance measure based on the entropy  Another possibility is to use the analogous measure with using of Gini’s coefficient COMPSTAT 2010

22

Thank you for your attention

COMPSTAT 2010

23

Clustering with Mixed Type Variables and Determination of Cluster ...

Clustering with Mixed Type Variables and Determination of Cluster ...

Suggest Documents

Clustering with Mixed Type Variables and Determination of Cluster ...

Clustering of samples and variables with mixed-type data - PLOS

Clustering with Same-Cluster Queries

DENOMINATORS OF CLUSTER VARIABLES

Denominators of cluster variables

Distance Metrics and Clustering Methods for Mixed-Type Data

Learning Bayesian Networks with Mixed Variables ... - CiteSeerX

Stable Cluster Variables

Comprehensive cluster analysis with Transitivity Clustering - Nature

expert networks with mixed continuous and categorical feature variables

HIMIC : A Hierarchical Mixed Type Data Clustering Algorithm - UCCS

Clustering of categorical variables around latent variables Cahiers du ...

Linear Fuzzy Clustering With Selection of Variables Using ... - CiteSeerX

Determination of hydrogen cluster velocities and comparison with ...

Voltammetric Determination of TBHQ Individually and Mixed with BHT ...

Document Clustering with Cluster Refinement and Model ... - CiteSeerX

Document Clustering with Cluster Refinement and Model ... - CiteSeerX

Malware Clustering and Network Signature Generation with Mixed ...

Conceptual Clustering with Numeric-and-Nominal Mixed Data | A New

Clustering of Variables Around Latent Components

Desired clustering of genetic regulatory networks with mixed delays

Robust Regression Analysis with LR-Type Fuzzy Input Variables and ...

Evaluating Clustering Algorithms: Cluster Quality and Feature ...

How to find an appropriate clustering for mixedtype variables with ...