COMMUNICATIONS lN STATISTICS Simulation and Computation@ Vol. 32, No. 4, pp. 1131-1150,2003
Clustering of Variables Around Latent Components E. Vigneau* and E. M. Qannari Laboratoire de Sensométrie et de Chimiométrie, ENITIAAjINRA, Nantes, France
ABSTRACT Clustering of variables around latent components is investigated as a means to organize multivariate data into meaningful structures. The coverage incIudes (i) the case where it is desirable to lump together correlated variables no matter whether the correlation coefficient is positive or negative; (ii) the case where negative correlation shows high disagreement among variables; (iii) an extension of the cIustering techniques which makes it possible to explain the cIustering of variables taking account of external data. The strategy basically consists in performing a hierarchical cIuster analysis, followed by a partitioning algorithm. Both algorithms aim at maximizing the same criterion which reflects the extent to which variables in each cIuster are related to the latent variable associated with this cIuster. Illustrations are outlined using real data sets from sensory studies.
*Correspondence: E. Vigneau, Laboratoire de Sensométrie et de Chimiométrie, ENITIAAjINRA, rue de la Géraudière, BP 82 225, 44 322, Nantes Cedex 03, France; Fax: +33-2-51-78-54-38; E-mail:
[email protected]. 1131 DOI: 1O.1081jSAC-120023882 CoPyright @2003 by Marcel Dekker, Inc.
0361-0918 (Print); 1532-4141 (On line) www.dekker.com
1132
Vigneau and Qannari Key Words: Clustering of variables; Principal component analysis; Sensory data; Preference data.
1. INTRODUCTION
.
Principal Components Analysis (PCA) is an appealing statistical tool for data inspection and dimensionality reductioD.. Rotated principal components fulfill the need for practitioners to have more interpretable linear combinations of the original variables. Cluster analysis of variables is an alternative technique as it makes it possible to organize the data into meaningful structures. Moreover, once the variables are arranged into homogeneous clusters, further investigations may be undertaken. For instance, it is possible to associate with each c1uster of variables a synthetic component. These synthetic components may be easier to interpret than the rotated principal components as they exc1usivelyrefer to different groups of variables pertaining to various facets of the problem under investigation. Another advantage which may be gained from the clustering of variables relates to the selection of a subset of variables. Procedures for discarding or selecting variables based on a statistical criterion have been proposed by Jolliffe (1972), McCabe (1984), KrzanO\ysJei(1987), AI-Kandari and Jollife (2001) or Guo et al. (2002) among many others. Clustermg of variables gives the advantage over these methods of allowing the practitioner of actually choosing one variable from each c1uster taking account, in addition to statistical considerations, of such aspects as cost, easiness of measurement, practicability, interpretability. .. . ln articles from Qannari et al. (1997, 1998), Abdallah and Saporta (1998) ~nd Soffritti (1999), procedures of c1ustering of variables based on the definition of similarity (or dissimilarity) measures between variables have been discussed. The approach discussed herein consists in c1ustering varia~es around latent components. More precisely, the aim is to determine simultaneously K c1usters of variables and K latent components such that the variables in each c1uster are strongly related to the corresponding latent component. A solution to this problem is given by an iterative partitioning algorithm which involves, as a first step, the choice of K initial c1usters. ln order to help the practitioner in choosing an appropriate number of c1usters and the initial partition, a hierarchical c1ustering procedure is proposed. This hierarchical approach has the same rationale as the partitioning algorithm as both techniques aim at maximizing the same criterion.
Clustering of Variables
1133
ln practice, two cases have to be considered: 1.
The first case applies to situations where the aim is to lump together correlated variables regardless of the sign of the correlation coefficients. Such situation can typically be encountered in sensory studies when reduction of a list of attributes is sought (Gains et al., 1988). 2. The second case applies when a negative correlation coefficient between two variables shows disagreement between them. Panel segmentation in consumer studies illustrates this case as discussed in Sec. 5.2. ln both cases, the following aims are of interest: (a)
To achieve clustering of variables around latent components thus exhibiting their redundancy. (b) To cluster the variables around latent components that are expressed as linear combinations of external variables thus showing how the various clusters may be interpreted in terms of these external data. For instance, it may be of interest to cluster a set of sensory attributes while investigating how the various clusters relate to chemical measurements. Performed on preference data, the approach which consists in clustering variables (consumers) while taking account of external data such as sensory profiles provides an alternative to the so-called External Preference Mapping (PrefMap) (see for instance Greenhoff and MacFie (1994)). It should be pointed out that the procedure Varclus in SAS package SAS/STAT (1990) answers both the issues 1 and 2 raised above without, however, considering the situation where external data are available. The next two sections are dedicated to the presentation of the methods of clustering of variables. Extensions of these approaches by taking account of external data are discussed in Sec. 4. Finally, the techniques are illustrated using real data sets from sensory studies and hedonic scoring experiments. Throughout this article, matrices are represented- with bold capital letters, e.g., X. A vector is written with a lower case bold letter, e.g., x.
1134
Vigneau and Qannari
2. CLUSTERING OF VARIABLES WHEN POSITIVE AND NEGATIVE CORRELATIONS IMPL Y AGREEMENT Consider the case where the aim is to lump together correlated variables regardless of the sign of the correlation coefficients. For this purpose, we seek to determine K (supposed to be fixed) clusters of variables and K latent components by maximizing a criterion which expresses the extent to which the variables in each cluster are colinear with a latent component associated with this cluster. Consider a set of p variables Xl, Xz,. . ., Xpmeasured on n individuals. These variables, are assumed to be centered, but not necessarily standardized. Let us denote by Gl> Gz,..., GK the K clusters of variables and by CI, Cz, . . ., CKthe K latent components (i.e., synthetic variables) associated respectively with the K clusters. We seek to maximize the quantity: K
T
p
= n LL
°kj Covz(Xj, Ck) under the constraint
k=l j=l
C~Ck= 1
(2.1)
= 1 if the jth
variable belongs to cluster Gk and °kj = 0, otherwise, and Cov(Xj, Ck)stands for the covariance between Xj and Ck' T cân also be written as: where °kj
K
T
1""
=-
1
1
.i J CkXkXkCk n k=l
(2.2)
where Xk is the matrix who se columns are formed with the variables belonging to group Gk. A solution to this problem is given by an iterative algorithm in the course, of which the variables are allowed to move in and out of the grou~s at the different stages of the algorithm achieving at each stage an increase of criterion T. This partitioning algorithm mns as follows: Step 1. Start with K groups of variables which may be obtained by random allocation of the variables into K groups, or preferably, from the hierarchical clustering method discussed below. Step 2. ln cluster Gb the latent component Ck is defined as the first standardized eigenvector of XkX~. ln other words, Ck is the first standardized principal component of Xk.
1135
Clustering of Variables
Step 3. New clusters of variables are formed by assigninga variable to a
group if its squared coefficient of covariance with the latent component of this group is higher than with any other latent component. ln further steps, the process starting from Step 2 is continued iteratively until stability is achieved. We propose to complement this partitioning algorithm by a hierarchical clustering method. ln practice, both methods should be performed in order to gain benetit of each. The hierarchical clustering strategy is based on the same criterion T discussed above (2.1) which highlights the complementarity of both approaches. It is an agglomerative technique which proceeds sequentiaIly from the stage in which each variable is considered to form a cluster by itself to the stage where there is a single cluster containing aIl variables. At each stage, the number of clusters is reduced by one by aggregating two groups. Let us consider criterion T. From Eq. (2.2), it can be easily shown that T can be written as foIlows: K
= LÀ~k)
T
(2.3)
k=1
where À~k)denotes the largest eigenvalue ofmatrix ~XkX~or, equivalently ~X~Xk which is the covariance matrix between the variables in class Gk. This form of criterion T suggests the foIlowing hierarchical procedure: At the tirst stage, each variable forms a cluster by itself. T is then equal to: p
Ta
= L var(xÛ
(2.4)
j=1
At stage i, the merging of two clusters of variables A and B results in a variation of criterion T given by: 1:1
= T-1- 1 -
T 1 -11.1 - 1 (A)
+ 11.1(B) 1
1
-11.1
(AUB)
(2.5)
where À~A), À~B) and À~AUB)are the largest eigenvalues associated with the covariance matrices of the variables in cluster A, B and A U B, respectively. We can prove (see Appendix) that: À(AUB) < 1 - À(A) 1 + À(B) 1
(2.6)
1136
Vigneau and Qannari
which implies that the merging of two clusters at each step results in a decrease in criterion T. Therefore, the strategy of aggregation consists in merging those clusters A and B that result in the smallest decrease in criterion T.
3. CLUSTERING OF VARIABLES WHEN NEGATIVE CORRELATIONS SHOW DISAGREEMENT There are situations in which the practitioner wishes to take account of the strength of the correlation and also of the sign of the correlation between variables. For instance, suppose that P consumers are asked to rate their acceptability of n products. A negative covariance between the scores of two consumers emphasizes their different views of the products. The clustering procedure discussed herein is based on the following principle: find K groups of variables Gh G2,.. ., GK (with K supposed to be fixed) and K latent components, CI, Cz,. . ., CKsuch that the quantity S is maximized: K
S
P
= ..;n L L ÔkjCOV(Xj,Ck) under
the constraint
C~Ck
=
1
(3.1)
k=1 j=1
with Ôkj = 1 ifthejth
variable belongs to cluster Gk and Ôkj = 0, otherwise.
A solution to this problem is given by a partitioning algorithm which follows the same pattern as in Sec. 2 with the following adaptations: Step 2.;In cluster Gk(k = 1,2,..., Xk Ck
= JX~Xk
K), component Ckis set to: (3.2)
where Xk is the variable which represents the centroid of cluster Gk, defined by: Xk
=
"" Pk Lj=l Xkj
Pk
(3.3)
with Xkjdenoting the jth variable in group Gk and Pb the total number of variables in this group.
Step 3. New clusters are formed by moving each variable to a new group if its covariance with the standardized centroid of this group is higher than with any other standardized centroid.
1137
Clustering of Variables
As in Sec. 2, we suggest to complement the partitioning algorithm by a hierarchical clustering method. The strategy of hierarchical algorithm stems from the following form of criterion S: K S
= LPko-(Xk)
(3.4)
k=l
where o-(Xk)is the standard deviation of xk. At stage i, consider two clusters of variables A and Band denote by XAand XBtheir respective centroids. If A and B are merged this will result in a variation of criterion S as measured by: fj. = Si-l - Si
= PAo-(XA) +
PBo-(XB)- (PA + PB)o-(XAUB)
(3.5)
where PA, PB are respectively the number of variables in group A and group B. It is easy to verify (see Appendix) that: (PA + PB)o-(XAUB) :S PAo-(XA) + PBo-(XB)
(3.6)
Therefore the merging of two c1usters leads to a decrease in criterion S. The strategy of aggregation consists in merging at each stage those c1usters A and B that result in the smallest decrease in S.
4. EXTENSION: CLUSTERING OF VARIABLESWITH RESPECT TO EXTERNAL DATA The extension in this section deals with the problem of c1ustering a set of variables while at the same time exploring how this c1ustering may be explained using external variables. ln addition to the variables x., X2,. . ., xP' we consider a data set Z formed with q external variables, z., Z2,. . ., Zq,which refer to the same n individuals. Firstly, consider the case where it is assumed that both positive and negative correlations imply agreement. We seek to maximize criterion f given by: K
f = nL
P
L OkjCov2(Xj,Ck) under
the constraints
k=l j=l
Ck
= Z ak and
a~ak = 1
L.
(4.1)
1138
Vigneau and Qannari
we have:
-
T
=
1 -
~ akZ " , XkXkZak
(4.2)
~
n k=l
Xk being the matrix whose columns are formed with the variables belonging to group Gk. The maximiza tion of T under the considered constraints leads to a partitioning algorithm similar to the algorithm that was used for maximizing T, except that in this case, the latent component in group Gk is given by Ck = Zak with 3k being the first standardized eigenvector of ~
Z'XkX~Z associated with the largest eigenvalue f.L~k).
The hierarchical clustering approach may also be used in order to give a starting point for the partitioning algorithm and a hint about how many clusters should be considered. As previously, the strategy stems from the following form of T: K
T=
L
(4.3)
f.L~k)
k=l
This suggests an agglomerative procedure which consists in merging, at eacll 'step, those two clusters that result in the smallest decrease in T. ~he fact that the merging of two clusters results in a decrease of criterion T can be proven in a very similar way as for criterion T (see Appendix). If we consider the case where negative correlations imply disagreement, the objective becomes to maximize the criterion: K
S = ..jiiL
p
L
DlçjCOV(Xj,Ck) under
the constraints
k=l j=l
1147
Clustering of Variables 60 0 C2 I~C11 40 0
0
0
-20 0
--40
-60 P1
P2
Figure 6.
P3
P4
P5
P6
Pl
P8
Latent components associated with the two groups.
exhibited two opposite latent components which are, as a matter of fact, almost collinear with this principal component. However, it should be stressed that the advantage of the c1ustering approach around latent components over PrefMap is three folds: 1. It directly provides segments of consumers. 2. ln each segment, it gives a (single) model which relates preference to sensory data, whereas PrefMap derives a model per consumer. 3. It captures the most relevant information for explaining preference data even if this information is not contained in the first principal components of sensory data.
6. CONCLUSION The examples presented in Sec. 5 show the potential of c1uster analysis of variables around latent components (CAVALC, for short). We also stressed how this approach complements existing methods.
1148
Vigneau and Qannari
The strategy CAVALC-l is based on the magnitudes and not the signs of the covariances between variables. It may be used as a complement or an alternative to PCA, especially when simplification of principal components is wished. The main difference between this approach and other simplification schemes of principal components (Jeffers, 1967; Vines, 2000) is that individual variables can not appear in more than one latent component. CAVALC-l may also be used to select a subset of variables by choosing one variable from each cluster. When external data are available, CAVALC-l offers an alternative to PLS approach (Garthwaite, 1994) as it involves the computation of latent components, linear combinations of external data, that explain the variables under study. For this purpose, CAVALC-l splits these variables into several groups according to how they relate to the external data. Further research is indicated in order to investigate more deeply the respective advantages of CAVALC-l and PLS. CA VALC-2 which considers that two negatively correlated variables present high disagreement may be used in preference studies. By segmenting the panel of consumers, CA VALC-2 exhibits one dominant direction of preference in each segment. When external data are available, CAVALC-2 provides an alternative to External Preference Mapping (PrefMap) and leads to a small number of models linking preference data to e)\:ternal data. Moreover, it directly captures the information in these external data which are relevant for explaining the preferences of consumers. Research will also be undertaken in order to set up a hypothesis testing framework for the choice of the number of clusters. Monte-Carlo simulation may be useful in this context.
APPENDIX Pr00f of inequality (2.6): À(AUB) 1
I,
'
= max - c XAUBXAUBC IIcll=1 n
} I" 1 1 ' = max -c XAXAc + -c XBXBC 11c1l=1 n n { } I" l, ' .:Smax -c XAXAC + max -c XBXBC 11c1l=1 IIcll=1n {n { } } {
= À~A)+ À~B)
1149
Clustering of Variables
With XA denoting the matrix whose columns are formed with the variables belonging to group A and maxllcll=1standing for the maximum over the standardized components c. Proof of inequality (3.6): (PA +PB)CJ(XAUB)= (PA +PB) .)n"XAUBI
1
-
= .j1i IIpAxA+ 1
-
PBxBII
1-
:SPA .j1i"XAIi+PB .j1i"xBII
= PACJ(XA) + where
Il.11
PBCJ(XB)
denotes the usual Euclidean norm.
REFERENCES Abdallah, H., Saporta, G. (1998). Classification d'un ensemble de variables qualitatives. Revue de Statistique Appliquée XLVI(4):5-26. AI-Kandari, N. M., Jolliffe, 1. T. (2001). Variable selection and interpretation of covariance principal components. Communications in Statistics: Simulation and Computation 30:339-354. Cadima, J., Jolliffe, 1. T. (1995). Loadings and correlations in the interpretation of principal components. Journal of Applied Statistics 22:203-214. ESN. (1996). A European Sensory and Consumer Study: A Case Study on Coffee. Published by the European Sensory Network. Gains, N., Krzanowski, J., Thomson, M. H. (1988). A comparison of variable reduction techniques in a attitudinal investigation of meat products. Journal of Sensory Studies. 3:37-48. Garthwaite, P. H. (1994). An interpretation of partial least squares. Journal of the American Statistical Association 89:122-127. Greenhoff, K., MacFie, H. J. H. (1994). Preference mapping in practice. Measurement of Food Preferences; ln: Macfie, H. J. H., Thomson, D. M. H., eds. Blackie Academie & Professional: London, pp. 137-166. Guo, Q., Wu, W., Massart, D. L., Boucon, c., de Jong, S. (2002). Feature selection in principal component analysis- of analytical data. Chemometrics and Intelligent Laboratory Systems 61:123-132. Jeffers, J. N. R. (1967). Two case studies in the application of Principal Component Analysis. Applied Statistics 16:225-236.
1150
Vigneau and Qannari
Jolliffe, 1. T. (1972). Discarding variables in a principal component analysis. i: artificial data. Applied Statistics 21:160-173. Krzanowski, W. J. (1987). Selection of variables to preserve multivariate data structure, using principal components. Applied Statistics. 36:22-33. McCabe, G. P. (1984). Principal variables. Technometrics 26: 137-144. Qannari, E. M., Vigneau, E., Courcoux, P. (1997). Clustering of variables, application in consumer and sensory studies. Food Quality and Preference 8:423-428. Qannari, E. M., Vigneau, E., Courcoux, P. (1998). Une nouvelle distance entre variables; application en classification. Revue de Statistique Appliquée XL VI(2):21- 32. SAS/ST AT. (1990). The VARCLUS procedure. User's Guide, Version 6, Vol. 2. Cary, North Carolina: SAS Institute Inc., pp. 1641-1659. Soffritti, G. (1999). Hierarchical clustering of variables: a comparison among strategies of analysis. COlnmunications in Statistics: Simulation and Computation 28:977-999. Vines, S. K. (2000). Simple principal components. Applied Statistics 49:441-451.