Clustering units from frequency and nominal variables: definition of a

Clustering units from frequency and nominal variables: definition of a global distance. Application to survey data with closed and open-ended questions M´ onica Bécue1 and Jerˆ ome Pagès2 1

2

EIO. Universitat Politècnica de Catalunya 08028 Barcelona - Spain (e-mail: [email protected]) Agrocampus Rennes 65 rue de Saint-Brieuc, CS 84215 F-35042 Rennes cedex, France (e-mail: [email protected])

Abstract. Clustering units from heterogeneous data such as nominal and frequency variables is a relevant challenge. This kind of clustering requires to define a global distance between the units that takes into account the specificity of the data. An important application is clustering the respondents to a questionnaire including both closed and open-ended questions. The main arguments for using a global distance defined through a geometrical and multidimensional approach are exposed and illustrated through an example. Keywords: Clustering, Heterogeneous data, Multiple factor analysis, Global distance, Clusters description.

1

Introduction

In very different studies, the statistical units are described by both nominal and frequency variables. In ecology, it is common to describe different sites by counting up the occurrences of every species as well as by identifying several soil and climatic attributes. In economic studies, the regions are often characterized by the counts of inhabitants in socioeconomic categories and by nominal attributes. A particular case arises in survey data, when a complex topic is tackled by using closed and open-ended questions: the statistical analysis of the latter starts from counting up the occurrences of the different words in every individual answer. Each set of nominal variables provides a units × variables subtable. Each frequency variable provides a contingency table units × categories. So, the global table juxtaposes units × nominal variables and contingency subtables, i.e. heterogeneous data. We want here to cluster the units, using both kinds of data but taking into account their specificity. The starting point consists in defining a global

90

Bécue and Pagés

distance, based on a geometrical approach to the data, in such a way that the influence of the different groups is balanced. Section 2 presents the notations. Section 3 addresses the global distance definition and Section 4 comments the clustering step. The results obtained with an actual example are discussed in Section 5. Finally, Section 6 values the contribution of such a methodology for clustering heterogeneous data.

2

Notation

A set of I statistical units are described by Jc groups of frequency variables (leading to build up Jc contingency tables with dimension I × Kj ; Kj here is the number of categories of the variable j) and Jq groups of nominal variables (leading to Jq individuals×indicator variables tables with dimension I × Kj ; Kj here is the number of categories of all the nominal variables of the set j).TheP whole of these J tables (J = Jc + Jq ) make up a multiple table I × K (K = j∈J Kj ). At the crossing of row i and column k (belonging to table j) we have, • if j is a contingency table: fikj the relative frequency, in table j (j = 1, ..., Jc ), with P which row i (i = 1, ..., I) is associated to column k (k = 1, ..., Kj ). ( ijk fikj = 1).

• if j is an indicator table : xik = 1 if i belongs to the category k and 0 if not. P Kj P We denote: fi.j = i fikj the row and column k=1 fikj and f.jk = margins of the contingency table j as subtable of the global table; fi.. = P f the row margin of the table gathering all the Jc contingency tables. kj ikj The margins of the tables of indicator variables, which are constant, do not intervene in the calculus. Remark: the row margin of the table gathering all the Jc contingency tables fi.. ; i = 1, ..., I will be used as row weights (and metric in the column space). In the case of tables with notable higher frequencies than others, the former can strongly dominate those weights. As an alternative, it is possible to firstly transform the data into proportions before the concatenation.

3

Definition of a global distance

The first (and fundamental) step for clustering consists in choosing the distance (or the similarity measure) between the units. The case we have to deal with, heterogeneous data with frequency and nominal variables, presents specific problems, close to the problems faced when mixed variables, quantitative and qualitative, are considered. In both cases, we have to choose: • a distance between units within every group of columns (separate distances)

Clustering units from frequency and nominal variables

91

• an aggregation function of these separate distances in a global distance in such a way that the different groups have a balanced influence. 3.1

Separate distances between units, as induced by every group of columns

For nominal and frequency variables groups, it is usual to consider the χ2 distance, i.e. the distance between profiles considered, respectively, in correspondence analysis (CA) and in multiple correspondence analysis (MCA). 3.2

Aggregation strategy

It is not a straightforward matter to define an aggregation strategy, even in the case of different groups made up by the same type of variable, that gives a balanced influence to every group of variables. Geometrical approach [Escofier and Pagès, 1998] propose a geometrical approach in the mixed case, with quantitative and qualitative variables. They consider the structures of the clouds of individuals as induced by every separate distance and propose to re-scale every subcloud in order to have the same greatest axial inertia. For that, a suited principal axes method is performed on every separate table, principal component analysis (PCA) in the case of quantitative variables and multiple correspondence analysis (MCA) in the case of qualitative variables. So, the highest axial inertia λj1 is measured, that allows for re-scaling the distances by dividing by λj1 the separate distance corresponding to set j. Furthermore, the global distance, as weighted sum of the re-scaled separate distances, is automatically calculated by performing MFA on the juxtaposed table. Besides, this approach allows for transforming the initial mixed variables into only quantitative variables (that are the principal components) while the genuine distances, as defined from qualitative variables, are conserved in the case of considering all the principal axes. Nevertheless, in some cases, it can be useful to keep only the first principal axes. Global distance in the case of heterogeneous data For considering frequency groups, we have already adapted MFA to contingency tables and proposed the multiple factor analysis for contingency tables (MFACT: [Bécue and Pagès, 1999], [Bécue and Pagès, 2004]). With this aim, MFACT transforms the juxtaposed contingency table as in internal CA [Cazes and Moreau, 2000] and adopts the point of view of MFA for balancing the influence of the different tables in the global analysis. Besides, the combination of MFACT with the usual MFA makes possible to deal with contingency tables and indicator tables in a same analysis [Bécue and Pagès,

92

Bécue and Pagés

2001]. The initial global table, multiple table row-wise juxtaposing the contingency and the indicator variables tables, is transformed as shown in Figure 1.

−

& # $ ! $ ! % " ⋅

*

' ()

Fig. 1. Multiple table issued from the original table by suited transformations

Then, a non-normalized weighted PCA is performed using: • as row weights (and metric in the column space): fi.. ; i = 1, ..., I, where fi.. is the mean relative weight of the rows on all the contingency tables that are considered; • as column weights (and metric in the individuals space) the initial weight of the column divided by λj1 , and so {(Ik /IJ)/λj1 ; k = 1, ..., Kj ; j = 1, ..., Jq ; f.kj /λj1 ; k = 1, ..., Kj ; j = Jq +1, ..., Jc }. Ik (k being an indicator variable) is the number of individuals belonging to category k. In such a way, MFACT automatically induces the squared distance between rows i and l given by (1):

2

d (i, l) =

X 1 X

j∈Jc

λj1

k∈Kj

1 f.kj

flkj fikj − fi.. fl.. +

f.kj − f..j

X 1 X

λj j∈Jq 1 k∈Kj

fl.j fi.j − fi.. fl..

2

2 I [xik − xlk ] Kj Ik

(1)

The contingency table j brings the contribution to the global distance indicated by the term j of the first block of (1): the deviation between the row profiles i and l is relativized, for each column of table j, by the deviation between the row margins in this table j. The qualitative variable j brings the contribution to the distance indicated by the term j of the second block of (1). Every contribution to the distance is rescaled by 1/λj1 , thus balancing the influence of the different groups of variables.


4

93

Clustering step

Clustering method For the clustering step, different methods can be used, although a hierarchical clustering, using generalized Ward’s criterion, is a suited clustering method when operating from quantitative variables, especially when they are principal components. Characterization of the clusters and validation of the partition For every cluster, the significantly over and under represented categories, in the case of the nominal variables, are selected by using a statistical test [Lebart et al., 1998]. A very similar reasoning allows for selecting the significatively frequent words in every cluster. The count miq of word i in cluster q is compared to the counts that would be obtained with all the samples comprised of m.q occurrences (m.q : total length of cluster q) randomly extracted from the whole corpus without replacement [Lebart et al., 1997], [Bécue and Lebart, 2000]. Furthermore, for every cluster, the modal answers are identified. They are actual responses, given by respondents, that are considered as representative according to two different criterions. The first criterion is linked to the frequency of the characteristic words in the answer while the second one is induced by the definition of a distance between the response lexical profile and the cluster lexical profile. It is usual to consider that the most representative answers are those which are selected by both criterions [Lebart et al., 1997].

5

5.1

Example: practices and opinions of the children about reading Data

The application is extracted from a large study carried out in the outskirts of Barcelona. 895 children studying fifth grade (about 10 or 11 years old) answered a closed questionnaire concerning attitude about reading and had to complete the two following assertions: 1. Para m´ı leer es...(For me, to read means. . . ); 2. Creo que leer es importante porque...(I believe that reading is important because. . . ). We only keep the 816 children having answered to the active questions. The closed questions concerning the attitudes about reading correspond to the first group (nominal variables) and, respectively, the two open-ended questions make up groups 2 and 3. So, the columns of the first group (indicator variables) are the categories of the closed questions and the columns of

94

Bécue and Pagés

the second and third groups correspond to the words used in the corresponding open-ended question, whose frequency is counted up for every child. We only keep the words used at least 8 times by the whole of the respondents. Additional information is also used, as supplementary, to illustrate the clusters Table 1. Closed active questions

5.2

!

" #

'

(

)

(

-

. / 0 1 2# 3

6

(

8

9 ( : ; 2 3 1

= 1 1 51 ] 5 8 9 - 56: 1 ;

Z ./ L 9 . F 1 9 7 .1 7 ? 5@ [ B aD E / = 2 0

' % V"% % ! U &! ) % & $ "* + " Y " W ( ) ! & * ! ' % T&# Y% , >= 1 1 51 ] 5 8 9 - 56: 1 ; Z ./ L 9 . F 1 9 7 .1 7 ? 5@ [ A aK E / = 2 0

< = 1 9 2 > 9 6= 1 7 / 4 ? @ , A B 8 C D ; < = 1 9 2 E 65@ 0/ F 1 2 6>>6- 4 .5G , C C 8 HA ; < I = 1 >1 = =1 9 2 67 ? 9 ./ 4 2 , H D 8 J K ; / = L / 5@ 9 ./ 4 2 9 7 2 06.1 7 5.G , M 8 C ; < 2 / 5 7 / 5 .6N 1 5@ 1 0- @ / . 9 = L / / N 0 , O O 8 C ; P 5 @ / F 1 Q E 1 @ 9 : 1 1 7 / 4 ? @ L / / N 0 ,K A 8 H R ; < = 1 9 2 E @ 1 7 < 9 F 054 2 G 67 ? ,J K 8 O D ; Z ./ L 9 . 3 4 9 .6>6- 9 56/ 7 [ I 9 0 0 ,J M 8 J R ; \ 9 7 ? 4 9 ? 1 3 4 9 .6>6- 9 56/ 7 [ I 9 00 , H O 8J J ;

< = 1 9 2 9 ./ 5 , C J 8 HR ; < = 1 9 2 1 9 06.G , DJ 8 B R ; < = 1 9 2 06.1 7 5.G , D C 8 A O ; < = 1 9 2 E @ 1 7 < >1 1 . .6N 1 =1 9 2 67 ? < / 7 .G 0/ F 1 56F 1 0 .6N 1 5@ 1 0- @ / . 9 = L / / N 0 , OO8 C; < .6N 1 L / / N 0 ? 6: 1 7 L G 5@ 1 51 9 - @ 1= , M R 8 D C ; Z ./ L 9 . 3 4 9 .6>6- 9 5 6/ 7 [ : 1 =G ? / / 2 ,K O 8 H R ; / = 1 ] - 1 ..1 7 5 ,K C 8 OA ; \ 9 7 ? 4 9 ? 1 3 4 9 .6>6- 9 56/ 7 [ : 1 = G ? / / 2 , H C 8 J D ; / = 1 ] - 1 ..1 7 5 , OA 8 OJ ; b / = 2 0 [ - / 0 9 ,5@ 67 ? 8 H H / > C K ; .1 1= ,5/ = 1 9 2 8 b / = 2 0 [ I 9 0 9 = ,5/ @ 9 : 1 9 ? / / 2 56F 1 ; 8 J B / > K C / > OR M ; 2 6: 1 = 562 9 ,>4 7 7 G 8 D / > OR ; 0 9 L 1 0 J D ;Q 2 6: 1 = 06y 7 , >4 7 8 H R / > K J ;Q 9 : 1 7 54 = 9 ,G / 4 N 7 / E 8 D / > OR ; 06c 7 / ,6> 7 / 5 8 OA / > H K ; , 9 : 1 7 54 = 1 8 K K / > C D ;Q = 9 5/ ,5 6F 1 8 J O / > J D ;Q I 9 0 9 0 ,G / 4 0I 17 2 8 A / > OJ ; 5 61 F I / ,56F 1 8 OK / > OB ;Q 2 6: 1=56=F 1 ,J J / > H H ;Q F 4 7 2 / ,E / = .2 8 O H / > OB ;Q .6L = / ,L / / N 8 J K / > H D ; Q 1 7 5= 9 = ,5 / ? / 67 8 M / > OR ;Q 2 6: 1 =56= 01 ,5 / @ 9 : 1 >4 7 8 OR / > OJ ; Q > 9 7 5 9 06 9 , > 9 7 5 9 0G 8 M / > O O ;Q 6F 9 ? 67 9 - 6y 7 , 6F 9 ? 67 9 56/ 7 8 A / > M ;Q I 9 0/ ,5/ @ 9 : 1 , 9 ? / / 2 F / F 1 7 5 ; 8 M / > O H ;Q >/ =F 9 ,E 9 G 8 D / > OJ ; d / 2 9 . 9 7 0E 1= 0 [ F 1 9 7 .1 7 ? 5@ D aD E / = 2 0 e f i qxj x n i n z zv{ xl m h n n g qlu znu n i | l u w j g j x zj g j } n i qh x j g m h n r ju n i n z zv{ xl ,5/ ? / 67 5@ 1 L / / N 5@ 9 5 < 9 F = 1 9 2 67 ? Q 5/ .6: 1 / 4 5 5@ 1 9 2 : 1 7 54 = 1 0 5@ 9 5 65 - / 7 5 9 67 0; e f i qxj x n i n z zv{ x l ~ g n x n z w x l qj p l i vg qj u w j g j x j } n i qh x j g znu n i | l ,5/ ? / 67 5/ 5@ 1 L / / N Q 5/ L 1 5@ 1 I = / 5 9 ? / 7 6 05 9 7 2 5/ .6: 1 / 4 5 9 2 : 1 7 54 = 1 0 E @ 17 = 1 9 2 67 ? ; b / = 2 0 [ 9 I = 1 7 2 / 9 I =1 7 2 1 0 9 I = 1 7 2 1 F / 0 ,5/ b / = 2 0 [ 6F 9 ? 67 9 - 6y 7 , 6F 9 ? 67 9 56/ 7 8 OD / > .1 9 = 7 8 O OK / > H J O;Q F 4 - @ / F 4 - @ 9 0 , 9 ./ 5 O M ;Q @ 9 - 1 ,5/ 2 / 8 D / > O O; 8 : / - 9 L 4 . 9 = 6/ / > 8 C D / > OK J ;Q - / 0 9 0 ,5@ 67 ? 0 8 D B / > J M K ;Q , : / - 9 L 4 . 9 =G 8 OR / > OB ; Q 9 I = 1 7 2 1 ,5 / .1 9 = 7 8 6F I / =5 9 7 51 0 , 6F I / = 5 9 7 5 8 OR / > OB ;Q I 4 1 2 / ,< J C / > C H ; - 9 7 8 D / > OB ; d / 2 9 . 9 7 0E 1= 0 [ F 1 9 7 .1 7 ? 5@ D aJ E / = 2 0 e f g h i j k l g j m h n o n p h g qj o h kr l ,65 6 0 9 5@ 67 ? 5@ 9 5 < 1 7s / G 9 ./ 5 ; e t i j k l gj o hu vo w l xqj i qn , 9 : 1 =G 6F I / = 5 9 7 5 5@ 67 ? ;

d / 2 9 . 9 7 0E 1= 0 [ F 1 9 7 .1 7 ? 5@ OR aK E / = 2 0 e w x n i | n g o h k r j g k l g j g G / 4 .1 9 =7 9 ./ 5 / > 5@ 67 ? 0 ; e w x n i | n g k l g j g ,G / 4 .1 9 =7 5@ 67 ? 0 ;

d / 2 9 . 9 7 0E 1= 0 [ F 1 9 7 .1 7 ? 5@ D aA E / = 2 0 e n n i g n j w j zj { x j g i h n } j g vj j g j w j vg n g k l i zj vo j p vi j k v i ,65 5 1 9 - @ 1 0 G / 4 7 1 E E / = 2 0 a / 4 5= 9 : 1 . 5/ / 5@ 1 = - / 4 7 5= 61 0 E 65@ G / 4 = 6F 9 ? 67 9 56/ 7 ; e w x n i | l l qx lp xj j u j g n o n j { x n zj vo j p vi j k v i ,< .1 9 =7 0I 1 ..67 ? 9 7 2 65 056F 4 . 9 51 0 F G 6F 9 ? 67 9 56/ 7 ;

[Bécue and Pagès, 2001]M. Bécue and J. Pagès. Analyse simulatané de questions ouvertes et de questions fermées. méthodologie, exemple. Journal de la Société Fran¸caise de Statistique, 42(4):91–104, 2001. [Bécue and Pagès, 2004]M. Bécue and J. Pagès. A principal axes method for comparing contingency tables: Afmtc. Comput. Statist. Data Anal, 45(3):481–485, 2004. [Cazes and Moreau, 2000]P. Cazes and J. Moreau. Analyse des correspondances d’un tableau de contingence dont les lignes et les colonnes sont munies d’une structure de graphe bistochastique. In J. Moreau, P.A. Doudin, and P. Cazes, editors, L’analyse des correspondances et les techniques connexes. Approches nouvelles pour l’analyse statistique des données, pages 87–103, 2000.


97

[Escofier and Pagès, 1998]B. Escofier and J. Pagès. Analyses Factorielles Simples et Multiples. Objectifs, méthodes et interprétation. Dunod, Paris, 3 edition, 1998. [Lebart et al., 1997]L. Lebart, A. Salem, and L. Berry. Exploring Textual Data. Kluwer, Dordrecht, 1997. [Lebart et al., 1998]L. Lebart, A. Morineau, and M. Piron. Statistique exploratoire multidimensionnelle. Dunod, Paris, 3 edition, 1998.

Clustering units from frequency and nominal variables: definition of a

Clustering units from frequency and nominal variables: definition of a

Suggest Documents

Conceptual Clustering with Numeric-and-Nominal ...

Conceptual Clustering with Numeric-and-Nominal Mixed Data | A New

TRANSFORMATIONS OF RANDOM VARIABLES 1.1. Definition. We ...

Clustering of categorical variables around latent variables Cahiers du ...

Toward a quantitative definition of mechanical units - GeoScienceWorld

Clustering of Variables Around Latent Components

Imputation of Nominal Variables Using Gaussian-Based Routines

Imputation of Nominal Variables Using Gaussian-Based Routines

Clustering Large Databases with Numeric and Nominal ... - CiteSeerX

A Partitioning Method for the Clustering of Categorical Variables

a definition of potential entrepreneur from a

Clustering with Mixed Type Variables and Determination of Cluster ...

Feature clustering and mutual information for the selection of variables

Combining clustering of variables and feature selection using ... - arXiv

Clustering of samples and variables with mixed-type data - PLOS

Clustering with Mixed Type Variables and Determination of Cluster ...

Frequency of occurrence for units of phonemes, morae, and syllables ...

Identifying Variables Responsible for Clustering in ... - CiteSeerX

'atomic weight' -the name, its history, definition, and units - iupac

Finding Solvable Units of Variables in Nonlinear ODEs of ECM ...

From response units to functional units: a thermodynamic ... - HESS

Automatic Translation of Nominal Compounds from ... - CiteSeerX

Nominal exchange rates and monetary fundamentals Evidence from a ...

Definition of Agricultural Management Units in an ... - SciELO Colombia