Document not found! Please try again

Clustering units from frequency and nominal variables: definition of a

0 downloads 0 Views 167KB Size Report
1 EIO. Universitat Polit`ecnica de Catalunya. 08028 Barcelona - Spain ..... А S dGp IБ i I dWXI IUGRcd vGGxH DueV rВY. ` aIWP idIU ` Wp HSRPqTUc DtuV yhY.
Clustering units from frequency and nominal variables: definition of a global distance. Application to survey data with closed and open-ended questions M´ onica B´ecue1 and Jerˆ ome Pag`es2 1

2

EIO. Universitat Polit`ecnica de Catalunya 08028 Barcelona - Spain (e-mail: [email protected]) Agrocampus Rennes 65 rue de Saint-Brieuc, CS 84215 F-35042 Rennes cedex, France (e-mail: [email protected])

Abstract. Clustering units from heterogeneous data such as nominal and frequency variables is a relevant challenge. This kind of clustering requires to define a global distance between the units that takes into account the specificity of the data. An important application is clustering the respondents to a questionnaire including both closed and open-ended questions. The main arguments for using a global distance defined through a geometrical and multidimensional approach are exposed and illustrated through an example. Keywords: Clustering, Heterogeneous data, Multiple factor analysis, Global distance, Clusters description.

1

Introduction

In very different studies, the statistical units are described by both nominal and frequency variables. In ecology, it is common to describe different sites by counting up the occurrences of every species as well as by identifying several soil and climatic attributes. In economic studies, the regions are often characterized by the counts of inhabitants in socioeconomic categories and by nominal attributes. A particular case arises in survey data, when a complex topic is tackled by using closed and open-ended questions: the statistical analysis of the latter starts from counting up the occurrences of the different words in every individual answer. Each set of nominal variables provides a units × variables subtable. Each frequency variable provides a contingency table units × categories. So, the global table juxtaposes units × nominal variables and contingency subtables, i.e. heterogeneous data. We want here to cluster the units, using both kinds of data but taking into account their specificity. The starting point consists in defining a global

90

B´ecue and Pag´es

distance, based on a geometrical approach to the data, in such a way that the influence of the different groups is balanced. Section 2 presents the notations. Section 3 addresses the global distance definition and Section 4 comments the clustering step. The results obtained with an actual example are discussed in Section 5. Finally, Section 6 values the contribution of such a methodology for clustering heterogeneous data.

2

Notation

A set of I statistical units are described by Jc groups of frequency variables (leading to build up Jc contingency tables with dimension I × Kj ; Kj here is the number of categories of the variable j) and Jq groups of nominal variables (leading to Jq individuals×indicator variables tables with dimension I × Kj ; Kj here is the number of categories of all the nominal variables of the set j).TheP whole of these J tables (J = Jc + Jq ) make up a multiple table I × K (K = j∈J Kj ). At the crossing of row i and column k (belonging to table j) we have, • if j is a contingency table: fikj the relative frequency, in table j (j = 1, ..., Jc ), with P which row i (i = 1, ..., I) is associated to column k (k = 1, ..., Kj ). ( ijk fikj = 1).

• if j is an indicator table : xik = 1 if i belongs to the category k and 0 if not. P Kj P We denote: fi.j = i fikj the row and column k=1 fikj and f.jk = margins of the contingency table j as subtable of the global table; fi.. = P f the row margin of the table gathering all the Jc contingency tables. kj ikj The margins of the tables of indicator variables, which are constant, do not intervene in the calculus. Remark: the row margin of the table gathering all the Jc contingency tables fi.. ; i = 1, ..., I will be used as row weights (and metric in the column space). In the case of tables with notable higher frequencies than others, the former can strongly dominate those weights. As an alternative, it is possible to firstly transform the data into proportions before the concatenation.

3

Definition of a global distance

The first (and fundamental) step for clustering consists in choosing the distance (or the similarity measure) between the units. The case we have to deal with, heterogeneous data with frequency and nominal variables, presents specific problems, close to the problems faced when mixed variables, quantitative and qualitative, are considered. In both cases, we have to choose: • a distance between units within every group of columns (separate distances)

Clustering units from frequency and nominal variables

91

• an aggregation function of these separate distances in a global distance in such a way that the different groups have a balanced influence. 3.1

Separate distances between units, as induced by every group of columns

For nominal and frequency variables groups, it is usual to consider the χ2 distance, i.e. the distance between profiles considered, respectively, in correspondence analysis (CA) and in multiple correspondence analysis (MCA). 3.2

Aggregation strategy

It is not a straightforward matter to define an aggregation strategy, even in the case of different groups made up by the same type of variable, that gives a balanced influence to every group of variables. Geometrical approach [Escofier and Pag`es, 1998] propose a geometrical approach in the mixed case, with quantitative and qualitative variables. They consider the structures of the clouds of individuals as induced by every separate distance and propose to re-scale every subcloud in order to have the same greatest axial inertia. For that, a suited principal axes method is performed on every separate table, principal component analysis (PCA) in the case of quantitative variables and multiple correspondence analysis (MCA) in the case of qualitative variables. So, the highest axial inertia λj1 is measured, that allows for re-scaling the distances by dividing by λj1 the separate distance corresponding to set j. Furthermore, the global distance, as weighted sum of the re-scaled separate distances, is automatically calculated by performing MFA on the juxtaposed table. Besides, this approach allows for transforming the initial mixed variables into only quantitative variables (that are the principal components) while the genuine distances, as defined from qualitative variables, are conserved in the case of considering all the principal axes. Nevertheless, in some cases, it can be useful to keep only the first principal axes. Global distance in the case of heterogeneous data For considering frequency groups, we have already adapted MFA to contingency tables and proposed the multiple factor analysis for contingency tables (MFACT: [B´ecue and Pag`es, 1999], [B´ecue and Pag`es, 2004]). With this aim, MFACT transforms the juxtaposed contingency table as in internal CA [Cazes and Moreau, 2000] and adopts the point of view of MFA for balancing the influence of the different tables in the global analysis. Besides, the combination of MFACT with the usual MFA makes possible to deal with contingency tables and indicator tables in a same analysis [B´ecue and Pag`es,

92

B´ecue and Pag´es

2001]. The initial global table, multiple table row-wise juxtaposing the contingency and the indicator variables tables, is transformed as shown in Figure 1.     

   



     



             



   





  







&   # $  !  $ ! % " ⋅   





 





*



' ()



  







Fig. 1. Multiple table issued from the original table by suited transformations

Then, a non-normalized weighted PCA is performed using: • as row weights (and metric in the column space): fi.. ; i = 1, ..., I, where fi.. is the mean relative weight of the rows on all the contingency tables that are considered; • as column weights (and metric in the individuals space) the initial weight of the column divided by λj1 , and so {(Ik /IJ)/λj1 ; k = 1, ..., Kj ; j = 1, ..., Jq ; f.kj /λj1 ; k = 1, ..., Kj ; j = Jq +1, ..., Jc }. Ik (k being an indicator variable) is the number of individuals belonging to category k. In such a way, MFACT automatically induces the squared distance between rows i and l given by (1):

2

d (i, l) =

X 1 X

j∈Jc

λj1

k∈Kj

1 f.kj



flkj fikj − fi.. fl.. +



f.kj − f..j

X 1 X

λj j∈Jq 1 k∈Kj



fl.j fi.j − fi.. fl..

2

2 I [xik − xlk ] Kj Ik

(1)

The contingency table j brings the contribution to the global distance indicated by the term j of the first block of (1): the deviation between the row profiles i and l is relativized, for each column of table j, by the deviation between the row margins in this table j. The qualitative variable j brings the contribution to the distance indicated by the term j of the second block of (1). Every contribution to the distance is rescaled by 1/λj1 , thus balancing the influence of the different groups of variables.

Clustering units from frequency and nominal variables

4

93

Clustering step

Clustering method For the clustering step, different methods can be used, although a hierarchical clustering, using generalized Ward’s criterion, is a suited clustering method when operating from quantitative variables, especially when they are principal components. Characterization of the clusters and validation of the partition For every cluster, the significantly over and under represented categories, in the case of the nominal variables, are selected by using a statistical test [Lebart et al., 1998]. A very similar reasoning allows for selecting the significatively frequent words in every cluster. The count miq of word i in cluster q is compared to the counts that would be obtained with all the samples comprised of m.q occurrences (m.q : total length of cluster q) randomly extracted from the whole corpus without replacement [Lebart et al., 1997], [B´ecue and Lebart, 2000]. Furthermore, for every cluster, the modal answers are identified. They are actual responses, given by respondents, that are considered as representative according to two different criterions. The first criterion is linked to the frequency of the characteristic words in the answer while the second one is induced by the definition of a distance between the response lexical profile and the cluster lexical profile. It is usual to consider that the most representative answers are those which are selected by both criterions [Lebart et al., 1997].

5

5.1

Example: practices and opinions of the children about reading Data

The application is extracted from a large study carried out in the outskirts of Barcelona. 895 children studying fifth grade (about 10 or 11 years old) answered a closed questionnaire concerning attitude about reading and had to complete the two following assertions: 1. Para m´ı leer es...(For me, to read means. . . ); 2. Creo que leer es importante porque...(I believe that reading is important because. . . ). We only keep the 816 children having answered to the active questions. The closed questions concerning the attitudes about reading correspond to the first group (nominal variables) and, respectively, the two open-ended questions make up groups 2 and 3. So, the columns of the first group (indicator variables) are the categories of the closed questions and the columns of

94

B´ecue and Pag´es

the second and third groups correspond to the words used in the corresponding open-ended question, whose frequency is counted up for every child. We only keep the words used at least 8 times by the whole of the respondents. Additional information is also used, as supplementary, to illustrate the clusters Table 1. Closed active questions         

5.2



!

     "  #

'

 (



)

 (



     

-

 .  /   0  1 2# 3

6

 (

8

9 ( : ;  2 3 1


= 1 1 51 ] 5 8 9 - 56: 1 ;

Z ./ L 9 . F 1 9 7 .1 7 ? 5@ [ B aD E / = 2 0

 ' % V"% € % ! U &! ) % & $ "* + " Y " W  ( ) ! & * ! ' % T&# Y% , >= 1 1 51 ] 5 8 9 - 56: 1 ; Z ./ L 9 . F 1 9 7 .1 7 ? 5@ [ A aK E / = 2 0

   

    

                     

                      

< = 1 9 2 > 9 6= 1 7 / 4 ? @ , A B 8 C D ; < = 1 9 2 E 65@ 0/ F 1 2 6>>6- 4 .5G , C C 8 HA ; < I = 1 >1 = =1 9 2 67 ? 9 ./ 4 2 , H D 8 J K ; / = L / 5@ 9 ./ 4 2 9 7 2 06.1 7 5.G , M 8 C ; < 2 / 5 7 / 5 .6N 1 5@ 1 0- @ / . 9 = L / / N 0 , O O 8 C ; P 5 @ / F 1 Q E 1 @ 9 : 1 1 7 / 4 ? @ L / / N 0 ,K A 8 H R ; < = 1 9 2 E @ 1 7 < 9 F 054 2 G 67 ? ,J K 8 O D ; Z ./ L 9 . 3 4 9 .6>6- 9 56/ 7 [ I 9 0 0 ,J M 8 J R ; \ 9 7 ? 4 9 ? 1 3 4 9 .6>6- 9 56/ 7 [ I 9 00 , H O 8J J ;

< = 1 9 2 9 ./ 5 , C J 8 HR ; < = 1 9 2 1 9 06.G , DJ 8 B R ; < = 1 9 2 06.1 7 5.G , D C 8 A O ; < = 1 9 2 E @ 1 7 < >1 1 . .6N 1 =1 9 2 67 ? < / 7 .G 0/ F 1 56F 1 0 .6N 1 5@ 1 0- @ / . 9 = L / / N 0 , OO8 C; < .6N 1 L / / N 0 ? 6: 1 7 L G 5@ 1 51 9 - @ 1= , M R 8 D C ; Z ./ L 9 . 3 4 9 .6>6- 9 5 6/ 7 [ : 1 =G ? / / 2 ,K O 8 H R ; / = 1 ] - 1 ..1 7 5 ,K C 8 OA ; \ 9 7 ? 4 9 ? 1 3 4 9 .6>6- 9 56/ 7 [ : 1 = G ? / / 2 , H C 8 J D ; / = 1 ] - 1 ..1 7 5 , OA 8 OJ ; b / = 2 0 [ - / 0 9 ,5@ 67 ? 8 H H / > C K ; .1 1= ,5/ = 1 9 2 8 b / = 2 0 [ I 9 0 9 = ,5/ @ 9 : 1 9 ? / / 2 56F 1 ; 8 J B / > K C / > OR M ; 2 6: 1 = 562 9 ,>4 7 7 G 8 D / > OR ; 0 9 L 1 0 J D ;Q 2 6: 1 = 06y 7 , >4 7 8 H R / > K J ;Q 9 : 1 7 54 = 9 ,G / 4 N 7 / E 8 D / > OR ; 06c 7 / ,6> 7 / 5 8 OA / > H K ; , 9 : 1 7 54 = 1 8 K K / > C D ;Q = 9 5/ ,5 6F 1 8 J O / > J D ;Q I 9 0 9 0 ,G / 4 0I 17 2 8 A / > OJ ; 5 61 F I / ,56F 1 8 OK / > OB ;Q 2 6: 1=56=F 1 ,J J / > H H ;Q F 4 7 2 / ,E / = .2 8 O H / > OB ;Q .6L = / ,L / / N 8 J K / > H D ; Q 1 7 5= 9 = ,5 / ? / 67 8 M / > OR ;Q 2 6: 1 =56= 01 ,5 / @ 9 : 1 >4 7 8 OR / > OJ ; Q > 9 7 5 9 06 9 , > 9 7 5 9 0G 8 M / > O O ;Q 6F 9 ? 67 9 - 6y 7 , 6F 9 ? 67 9 56/ 7 8 A / > M ;Q I 9 0/ ,5/ @ 9 : 1 , 9 ? / / 2 F / F 1 7 5 ; 8 M / > O H ;Q >/ =F 9 ,E 9 G 8 D / > OJ ; d / 2 9 . 9 7 0E 1= 0 [ F 1 9 7 .1 7 ? 5@ D aD E / = 2 0 e f i qxj x n i n z zv{ xl m h n n g qlu znu n i | l u w j g j x zj g j } n i qh x j g m h n r ju n i n z zv{ xl ,5/ ? / 67 5@ 1 L / / N 5@ 9 5 < 9 F = 1 9 2 67 ? Q 5/ .6: 1 / 4 5 5@ 1 9 2 : 1 7 54 = 1 0 5@ 9 5 65 - / 7 5 9 67 0; e f i qxj x n i n z zv{ x l ~ g n x n z w x l qj p l i vg qj u w j g j x j } n i qh x j g znu n i | l ,5/ ? / 67 5/ 5@ 1 L / / N Q 5/ L 1 5@ 1 I = / 5 9 ? / 7 6 05 9 7 2 5/ .6: 1 / 4 5 9 2 : 1 7 54 = 1 0 E @ 17 = 1 9 2 67 ? ; b / = 2 0 [ 9 I = 1 7 2 / ‚ 9 I =1 7 2 1 0‚ 9 I = 1 7 2 1 F / 0 ,5/ b / = 2 0 [ 6F 9 ? 67 9 - 6y 7 , 6F 9 ? 67 9 56/ 7 8 OD / > .1 9 = 7 8 O OK / > H J O;Q F 4 - @ / ‚F 4 - @ 9 0 , 9 ./ 5 O M ;Q @ 9 - 1 ,5/ 2 / 8 D / > O O; 8 : / - 9 L 4 . 9 = 6/ / > 8 C D / > OK J ;Q - / 0 9 0 ,5@ 67 ? 0 8 D B / > J M K ;Q , : / - 9 L 4 . 9 =G 8 OR / > OB ; Q 9 I = 1 7 2 1 ,5 / .1 9 = 7 8 6F I / =5 9 7 51 0 , 6F I / = 5 9 7 5 8 OR / > OB ;Q I 4 1 2 / ,< J C / > C H ; - 9 7 8 D / > OB ; d / 2 9 . 9 7 0E 1= 0 [ F 1 9 7 .1 7 ? 5@ D aJ E / = 2 0 e f g h i j k l g j m h n o n p h g qj o h kr l ,65 6 0 9 5@ 67 ? 5@ 9 5 < 1 7s / G 9 ./ 5 ; e t i j k l gj o hu vo w l xqj i qn , 9 : 1 =G 6F I / = 5 9 7 5 5@ 67 ? ;

d / 2 9 . 9 7 0E 1= 0 [ F 1 9 7 .1 7 ? 5@ OR aK E / = 2 0 e ƒ w x n i | n g o h k r j g k l g j g „G / 4 .1 9 =7 9 ./ 5 / > 5@ 67 ? 0 ; e ƒ w x n i | n g k l g j g ,G / 4 .1 9 =7 5@ 67 ? 0 ;

d / 2 9 . 9 7 0E 1= 0 [ F 1 9 7 .1 7 ? 5@ D aA E / = 2 0 e … n n i g n † j w j zj { x j g i h n } j g ‡ ˆvj‰ j g j w j vg n g k l i zj vo j p vi j k vŠ i ,65 5 1 9 - @ 1 0 G / 4 7 1 E E / = 2 0 a ‹ / 4 5= 9 : 1 . 5/ / 5@ 1 = - / 4 7 5= 61 0 E 65@ G / 4 = 6F 9 ? 67 9 56/ 7 ; e ƒ w x n i | l l qx lp xjŒ j ‡ u j g n o n j { x n zj vo j p vi j k vŠ i ,< .1 9 =7 0I 1 ..67 ? 9 7 2 65 056F 4 . 9 51 0 F G 6F 9 ? 67 9 56/ 7 ;

[B´ecue and Pag`es, 2001]M. B´ecue and J. Pag`es. Analyse simulatan´e de questions ouvertes et de questions ferm´ees. m´ethodologie, exemple. Journal de la Soci´et´e Fran¸caise de Statistique, 42(4):91–104, 2001. [B´ecue and Pag`es, 2004]M. B´ecue and J. Pag`es. A principal axes method for comparing contingency tables: Afmtc. Comput. Statist. Data Anal, 45(3):481–485, 2004. [Cazes and Moreau, 2000]P. Cazes and J. Moreau. Analyse des correspondances d’un tableau de contingence dont les lignes et les colonnes sont munies d’une structure de graphe bistochastique. In J. Moreau, P.A. Doudin, and P. Cazes, editors, L’analyse des correspondances et les techniques connexes. Approches nouvelles pour l’analyse statistique des donn´ees, pages 87–103, 2000.

Clustering units from frequency and nominal variables

97

[Escofier and Pag`es, 1998]B. Escofier and J. Pag`es. Analyses Factorielles Simples et Multiples. Objectifs, m´ethodes et interpr´etation. Dunod, Paris, 3 edition, 1998. [Lebart et al., 1997]L. Lebart, A. Salem, and L. Berry. Exploring Textual Data. Kluwer, Dordrecht, 1997. [Lebart et al., 1998]L. Lebart, A. Morineau, and M. Piron. Statistique exploratoire multidimensionnelle. Dunod, Paris, 3 edition, 1998.

Suggest Documents