Jun 27, 2007 - The forms of interpretation discussed in Sections 3 and 4 are effectively .... of scores for 150 children in ten subtests of the Wechsler Pre-School.
Communications in Statistics - Simulation and Computation
ISSN: 0361-0918 (Print) 1532-4141 (Online) Journal homepage: http://www.tandfonline.com/loi/lssp20
Variable selection and interpretation in canonical correlation analysis Noriah M. Al-Kandari & Ian T Jolliffe To cite this article: Noriah M. Al-Kandari & Ian T Jolliffe (1997) Variable selection and interpretation in canonical correlation analysis, Communications in Statistics - Simulation and Computation, 26:3, 873-900, DOI: 10.1080/03610919708813416 To link to this article: https://doi.org/10.1080/03610919708813416
Published online: 27 Jun 2007.
Submit your article to this journal
Article views: 124
Citing articles: 5 View citing articles
Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=lssp20
COMMUN . STATIST.-SIMULA., 26(3), 873-900 (1997)
VARIABLE SELECTION AND INTERPRETATION IN CANONICAL CORRELATION ANALYSIS
Noriah M. Al-Kandari and Ian T. Jolliffe
Department of Mathematical Sciences, University of Aberdeen, Aberdeen AB24 3QY, UK
Key Words: loadings; multiple regression; redundancy; reification; weights.
ABSTRACT
The canonical variates in canonical correlation analysis are often interpreted by looking at the weights or loadings of the variables in each canonical variate and effectively ignoring those variables whose weights or loadings are small. It is shown that such a procedure can be misleading. The related problem of selecting a subset of the original variables which preserves the information in the most important canonical variates is also examined. Because of different possible definitions of 'the information in canonical variates', any such subset selection needs very careful consideration.
Copyright O 1997 by Marcel Dekker, Inc.
AL-KANDARI AND JOLLIFFE
1. INTRODUCTION
In canonical correlation analysis (CCA) we are concerned with investigating the relationships between two sets of random variables, XI, Xz, ...,
X, on the one hand, and Y l , Y2, ..., Y, on the other. Further details of CCA, together with a motivating example, are given in Section 2.
The results of CCA comprise a sequence of pairs of maximally V2), ..., (Ur, Vs),where s I min(p,q), each correlated new variables ( U I ,VI), (UZ,
U is a linear combination of XI, X2, ..., X,, and each V is a linear combination of
YI, Y2, ..., Y,. To interpret, or reify, these pairs of variates it is common to look at the coefficients defining the new variates and essentially ignore those variables whose coefficients are negligible. This strategy was examined in the context of principal component analysis (PCA) by Cadima and Jolliffe (1995), and it was shown that it can be misleading. In Section 3 we show that similar problems exist in CCA, with an additional complication. This complication is that the coefficients in the canonical variates, the weights or loadings, can be defined in two distinct ways (weights or loadings- see Section 2), and the two choices can have different implications for interpretation. In Cadima and Jolliffe (1995) an alternative strategy for interpretation, based on multiple regression, was introduced for PCA. A similar strategy is explained for CCA in Section 4.
The forms of interpretation discussed in Sections 3 and 4 are effectively methods of variable selection. In Section 5 the topic of variable selection in CCA is expanded, and various criteria for selection are discussed. Throughout Sections 3-5, the motivating example introduced in Section 2 and an additional example introduced in section 4 are used for illustration. Section 6 contains discussion and some concluding results.
CANONICAL CORRELATION ANALYSIS 2. CANONICAL CORRELATION ANALYSIS
2.1 FORMULATION OF CANONICAL CORRELATION ANALYSIS
Canonical correlation analysis (CCA) is a multivariate statistical technique developed by Hotelling (1936) and designed to explore the XZ,..., X,) and Y = ( Y I , relationships between two sets of variables, say X = (XI, Yz, ..., Yy) CCA can be derived in two ways: either based on Lagrange multipliers and eigenanalysis (Hotelling 1935, 1936; Anderson 1958) or based on direct orthogonal decomposition of X and Y (Lancaster (1958, 1966) and Horst (1961)). The latter derivation is used in this paper since it has significant computational advantages, and is simpler and more intuitive than the former derivation. Canonical Correlation Analysis (CCA) has been explained in many multivariate analysis books- see Gittins (1985) and Rencher (1995) but in order to understand the contents of the sections which follow, we need to define and describe the properties of various quantities which arise in CCA. Suppose that A is the matrix whose columns are the eigenvectors of
standardized to have unit length, and B is similarly constructed from the eigenvectors of the transpose of the matrix (2.1). In (2.1)
,x
and
the within-set variance-covariance matrices for X and Y respectively and
,, are ,,
is the between sets covariance matrix. The matrices of canonical weights (CWs), A, and B y , for the first and second sets of variables, X and Y respectively, are defined by
AL-KANDARI AND JOLLIFFE
and
The magnitudes of the CWs indicate the importance of each variable from one set in obtaining a maximum correlation with the other set. They are analogous to the beta weights in multiple regression analysis because they show the contribution of each variable with respect to the total variance of the canonical variate by which they are defined. The CWs tend to be highly unstable if multicollinearity is present between the variables of either or both sets. The
canonical
correlations
are
invariant
to
separate
transformations of X and Y , so for convenience we work with X"' and
linear
Pt which
are standardized versions of X and Y with zero means and unit variances. In this case the covariance matrices
x,
,
,,
and
x,
correlations and the kth pair of canonical variates, CVs,
Uk
become matrices of and Vk are
and
where a, and bkare the kth columns of Ax and B y respectively. They are constructed as linear combinations of X and Y which successively have maximum correlation and are uncorrelated with previous CVs. The correlation between Uk and Vk is called the canonical correlation
coefficient (CCC) and is given by r, which is the square root of the kth eigenvalue of matrix (2.1). Its square is a measure of the variance shared by the kth pair of CVs.
CANONICAL CORRELATION ANALYSIS
877
The canonical loadings, or structure correlations, are used to study the linear association between the CVs and the original variables and aid in the interpretation of the CVs. These correlations (loadings) can be divided into two types:- Intraset Correlation Coefficients and Interset Correlation Coefficients. In this paper we consider only intraset correlations and for brevity we refer to these simply as loadings. The intraset correlations are correlations between the observed variables of X and their associated CV, Ukand are defined by
whereas the intraset correlations for the second set Y and its corresponding CV, Vk are
, and gjk be the jth element of ry,vk. The Let s, be the ith element of rXauk
total proportion of variance of a set which is extracted by a CV defined on the same set can be defined as the mean of the squares of the intraset correlation coefficients ( loadings). Therefore, the total proportion of variance extracted from X by the CV Ukis
Likewise, the total proportion of variance extracted from Y by the CV Vk
AL-KANDARI AND JOLLIFFE
Stewart and Love (1968) proposed an index called the canonical
redundancy coefficient which measures the proportion of the total variance predictable from a set of variables by the CVs of the other set. The redundancy coefficient can be calculated as the product of the proportion of variance extracted from a set of variables by a CV for that set and the square of the corresponding CCC. Hence the redundancy coefficient for X relative to the CV Vkis given by
Similarly, for the set Y , the redundancy coefficient is
The sum of the redundancy coefficients for each set over the s retained CVs yields the total redundancy coefficient for each set. The total redundancy for X given the CVs I/;, V, ,...,V, is
Correspondingly, the total redundancy for Y given the CVs U I , U2, ..., Us is
CANONICAL CORRELATION ANALYSIS
879
TABLEI. Canonical weights and correlation coefficients for the first two
The total redundancy provides an overall measure of the variance in one set accounted for by the variables of the other set, unlike the CCC which measures the strength of the relationship between particular linear combinations of each set. The total redundancy is an asymmetric measure of explained variance. This means that the total variance of X which is predictable from the CVs of Y (TR,,,) is not equal to the total variance of Y which is predictable from the CVs of X ( TR,,, ).
2.2. CCA OF THE WPPSI DATA SET
The data used in the present section are taken from Yule et al. (1967) and consist of scores for 150 children in ten subtests of the Wechsler Pre-School and Primary Scale of Intelligence (WPPSI). Full details of the ten subtests are given by Yule et al. (1967). I
Canonical correlation analysis is used to reveal the relationship between the first 5 (verbal) tests, X, and the last 5 (performance) tests, Y. A macro 'cca' was written using commands in MINITAB to provide all the previously defined results of CCA. A summary of the results of CCA for the first two pairs of CVs is given in Tables 1 and 11.
AL-KANDARI AND JOLLIFFE
TABLE11. Intraset Correlations (loadings), and Total Redundancies for the first two pairs of CVs in the WPPSI data
cvs
u1
u2
cvs
"1
v2
Y
X
Table I shows that the first pair of CVs (u,, v,) can be written as a linear functions of the variables of X and Y respectively as follows :
The table also gives coefficients for the second pair of CVs (uz, v2).The squares of CCCs express the variance shared by a pair of CVs. Therefore, 46% of the variance in the linear combination of the verbal tests given by
UI
=a:~"
is attributable to variation in the linear combination of the performance tests defined by
vl
= b : ~ " Similarly . 8.5% of the variation in u2 is accounted for by
the variation in vz. The loadings are given in Table 11 and are helpful in establishing the nature of the CVs defined in each set of variables. Considering the second set Y, from the magnitudes of r,,,, it appears that the correlation between the performance tests variables and v, are all reasonably strong, while only y3 (g32 =
-0.564) and y4 (g42= -0.556) show a strong correlation with v2.
CANONICAL CORRELATION ANALYSIS
88 1
The redundancy coefficients can be calculated from Tables I and II by multiplying the within-set variances (v:,,
and v;, ) of a CV by the between-set
squared correlation ( r i ) of the variate. For example, for u, and v, we find
Total redundancy expresses the total variance of one set which is predictable from all the CVs of the other set. We see from Table I1 by summing the contributions from the first two CVs that 28% of the variance of the verbal tests is predictable from the first two CVs of the performance tests, whereas 25% of the variance of performance tests is predictable from the first two CVs of the verbal tests. Almost all of this variation is in the first pair with very little contributed by the second pair. Table II confirms that total redundancy is asymmetric between sets (0.276 # 0.246).
3. TRUNCATED CANONICAL VARIATES
This section is similar to some of the discussion in Cadima and Jolliffe (1995) who discussed loadings and correlations in the interpretation of principal components (PCs). The usual way of interpreting PCs is by (consciously or unconsciously) removing those variables which have small weights in the PC from the linear function defining the PC, giving a truncated PC. Cadima and Jolliffe (1995) note that this procedure of truncating a PC is often not safe, as the correlations (or loadings) between the variables and PCs are not taken into account. The correlations between the individual variables and any PC are appropriate indicators when judging
882
AL-KANDARI AND JOLLIFFE
the sufficiency of single variable approximations but are insufficient to estimate the efficacy of the approximations based on a choice of k > 1 variables. Moreover, Cadima and Jolliffe (1995) note that the weights are not necessarily a reliable way of determining whether a subset of the p original variables will actually provide an acceptable truncated PC, especially for Covariance Matrix PCA. Nor is it possible to determine how close a PC is to its truncated approximation merely by looking at the correlations (or loadings) between single variables and PCs.
In this section we investigate to what extent similar problems exist in CCA. For a given CV, when small canonical weights are ignored for the purpose of interpretation, they are replaced by exactly zero weights. The resulting truncated CV, u" (or v y ) is then defined as a linear combination of the p* (or q*) retained variables. Considering the first set X in the example of Section 2.2., the size of the weights of the variables in Table I would lead us to regard the first two CVs as dominated by
{x3}
and
(x2, x4,
x*) respectively. Therefore, the two
truncated approximations, named uf" and uy , are defined as
The correlation between uf" and u, is 0.883. This correlation will increase if we add a single additional variable in (3.1). If the criterion to add a new variable is the second highest canonical weight on u, , then x4 is our choice, and the new truncated CV is u:
=
-
0.537
x3
-
0.321
xq,
with
correlation increased to 0.965. Alternatively, if the criterion to add a new variable is the strongest correlation (highest loading) with ul, then XL, is our
CANONICAL CORRELATION ANALYSIS
883
choice, and the new truncated CV has correlation 0.948 with u l . The correlation between u2 and
U P ,defined by (3.2), is 0.933.
If, instead, we truncate the CV, according to the size of loadings, rather than weights, of variables with CV, we obtain different results. Retaining variables whose loadings have absolute values 2 0.50 we find (see Table II) that u~ will require all the variables in X to be retained. If we increase the threshold for retaining variables then this implies that all the variables in Y will be discarded in obtaining the truncated approximations for u2. Hence different thresholds are needed for different CVs. A high correlation (0.994) between uf and its truncated approximation can be obtained if it is defined as a linear function of { x 2 ,xjr, ~
4 ) Thus, .
based on
the criterion of retaining those variables which have high loadings with their associated CV, the two truncated approximation for u, and u, are u t = - 0.229 xz u i = - 1.OW xs
- 0.537 x3 - 0.321 x4
(3.3) (3.4)
In (3.4), the second CV is dominated by x~ which has by far the highest loading with u,, and in this case the correlation between u: and u, is 0.557. This correlation will be improved if we add variables to (3.4). Adding the variables which have the highest loadings { x i , x 4 ] gives a correlation of 0.713. This is worse than adding the variables xr and x3 which have the lowest loadings but give a correlation of 0.886, and much worse than adding those which have the highest canonical weights; this selects x2 and x4, giving a correlation of 0.933. The subset {xz, x3, x 5 } could be retained simultaneously to yield good correlations between the two CVs ul and u2 and their associated truncated approximations u: and u i . The correlation between u, and u: in this case is equal to 0.959, and that
884
AL-KANDARI AND JOLLIFFE
between u, and u: is 0.886. Further discussion of variable selection will be given in Section 5. Similar features to those in the first data set X were also observed in the second set Y. In addition, the canonical weights and loadings for the second set
Y which are summarized in Tables I and I1 suggest the following points : (a) for different CVs, very distinct weights may correspond to similar loadings of a given variable, e.g. y3 in vl and v2; (b) a given CV may be equally correlated with high and low weight variables, e.g. ys and y2 with
v2;
(c) variables with
similar weights for a given CV can be differently correlated with that CV, e.g. y, and y4 in v ~ (d) ; similar weights may correspond to very different importances of the respective variables in approximating different CVs, e.g. y5 in vl and y4 in v2.
Moreover, similar loadings may lead to very different correlations between a given CV and its two truncated approximations. For example, in (3.4) we found that uz is dominated by x,-.Although x2 and x3 have approximately the same correlation with uz, there is a substantial difference in the correlations between u2 and its two truncated approximations which are defined as a linear combinations of
1x2,
x s ) and (xj,x5} respectively. The correlation between uz
and the former approximation is 0.885 while it is 0.550 with the latter one. This appears to be largely due to xz in Table I having a much higher weight in u2 than x3. Hence, the weights of the variables in a given CV could be regarded as a good criterion for truncation. Conversely, variables with similar weights in a given CV often yield similar correlations between the CV and its two truncated approximations based on these variables. For instance, considering the first CV of Y, yl and yr have very similar canonical weights in vl and the following two truncated approximations of vl have approximately the same correlations (0.940 & 0.920 respectively) with vl.
CANONICAL CORRELATION ANALYSIS
885
There are exceptions, especially for truncations using only one variable, for example y3 and y~ in
v2.
However, based on the previous discussion, and on
experience with other data sets (see Al-Kandari (1994)), it is often found that truncated approximations defined by those variables with high canonical weights are closer to the original CVs than truncated approximations defined by variables with high canonical loadings. We defer consideration of total redundancy coefficients until Section 5.
4. REGRESSED CANONICAL VARIATES
So far, we have only considered truncated approximations for CVs based on p* (or q*) selected variables. In this section, we will see that this strategy does not, in general, give the best approximate CV using this subset of variables. An alternative way of approximating the CV will be considered. For a given subset of p* (or q*) variables, the best (least squares) linear approximation to a CV is given by the multiple linear regression equation of the CV on the p* (or q*) variables. This regressed CV maximizes the multiple correlation coefficient of the p* (or q*) variables with the CV, and may be substantially better than a truncated CV in terms of this correlation. In this section we investigate differences in approximating and interpreting a CV using the truncated CV and the regressed CV. The truncated and corresponding regressed approximations for the CVs of the previous data set are reported in Table 111. The selected subsets of X and Y in Table III represent a summary of the results of Section 3 that are obtained based on the canonical weights and loadings.
886
AL-KANDARI AND JOLLIFFE
TABLE111. The coefficients of the variables and the correlations in the two approximations for the first two pairs of CVs in the WPPSI data
It is clear that for all the given subsets of X and Y, except ( x 2 , x4,
x5},
there is no real difference in approximating and interpreting the CVs using the two different methods of approximation. Even for the subset
(x2, x4, x5]
in
approximating uz, the differences are not substantial, but this is not always the case. Consider a second (unpublished) data set, due to Dr. M. E. Jarvik, which consists of 8 psychological and physical variables, X, and 4 smoking variables,
Y measured on 110 individuals. Full details of the analysis of this data set are discussed by Al-Kandari (1994). Table IV shows the canonical weights and loadings of the variables in the first two pairs of CVs. The results in Table 1V have been used to construct Table V which gives the truncated and regressed approximations for the first two pairs of CVs
CANONICAL CORRELATION ANALYSIS
887
TABLEIV. The canonical weights and loadings for the variables of the two sets X and Y in their associated CVs for the Jarvik smoking data Weights: Ax Loadings: rx,, X Canonical Variates: u Ul
XI
x7
Xs
Y
-0.474
1
Uz
0.800
1
uI -0.721
1
U2
0.356
-0.313 1 -0.588 -0.691 1 0.027 -0.339 1 -0.501 -0.532 1 -0.400 Weights: BY Loadings: ry., Canonical Variates: v
defined on subsets X* and Y* of X and Y respectively which are chosen based on the canonical weights and loadings. Table V demonstrates that the correlations between the CVs and their associated approximations are high for those defined as linear functions of the variables which are selected based on the canonical weights criterion. It is clear that the interpretations of the CVs using the truncated and regressed approximations are very similar when the canonical weights criterion is used. On the other hand when the loadings criterion is used to construct the two approximations, there are larger differences in interpreting the CVs for the two approximations. For example, the truncated CV for v, has roughly equal coefficients for y~ and y4 whereas yz dominates the regressed CV.
AL-KANDARI AND JOLLIFFE
888
TABLEV. The coefficients of the variables and the correlations in the two approximations for the first two pairs of CVs in the Jarvik smoking data CVs
Variables Xr
Truncated Coefficients I Corr. - 0.474 I
Regressed Coefficients 1 Corr. - 0.605 1
Further examination of this data set reveals three other interesting features which are highlighted in Tables VIa-VIc. Firstly, there is sometimes a clear difference in the weights of the variables in the two approximations and hence potentially different interpretations, even when the correlations between the CVs and two approximations are very similar- see Table VIa.
CANONICAL CORRELATION ANALYSIS
889
TABLE VIa. The coefficients of the variables and the correlations in the two
TABLEVIb. The coefficients of the variables and the correlations in the two approximations for the CVs ul,vl and v2 in ti : Jarvik smolung data Truncated CVs Variables [ Weights Corr. Xz 0.781 X3 - 0.257
I
I
TABLEVIc. The coefficients of the variables and the correlations in the two
AL-KANDARI AND JOLLIFFE
Secondly, there may be a large difference in the correlations between the CV and its two approximations, even when the weights of the variables in the two approximations are relatively close- see Table VIb. Finally, both the correlation and the importance of the variables in interpreting a CV can vary substantially between the two methods of approximation- see Table VIc. As an additional example, consider the following truncated and regressed approximations of v, which are defined by y3 and y4.
The variables in (4.1) and (4.2) exchange their weights, and there is a clear difference between the two approximations in their correlations with v, (0.222 and 0.794 with v:, v; respectively). This implies that
y4
is, in some
sense, more important than y3 when regressing v, on these two variables.
5. VARIABLE SELECTION
In this section three different methods of variable selection are discussed and illustrated by the Jarvik smoking data set. The first method of variable selection is by canonical weights. As we saw in Section 3, if the researcher is interested in determining the best subsets of X and Y to be used to interpret the first two pairs (u,, vl) and
( u 2 , v2) individually,
then the subsets
{ X I , x2,
x4}
and
{ y Z ,y3, y 4 } are recommended for the first pair ( u l , vl) whereas for the second
pair (u2,v 2 )the subsets
{XI,xj, xg, x,,
x8} and {yl, y2, yq} are selected.
CANONICAL CORRELATION ANALYSIS
891
In this method the selected subsets of X and Y are chosen to interpret separately the two pairs of CVs. Combining the two subsets of X, and those of Y , in order to interpret both pairs of CVs simultaneously is not always
successful. Often, especially when we increase the number of the CVs of interest, we will end up by retaining most of the variables since the variables which have negligible weights in the first CVs will dominate the later CVs. Hence, two alternative selection methods which are based on the total redundancy concept may be preferred. The regressed approximations in the previous two sections aid us in determining the subsets X* and Y* of X and Y respectively to be used to interpret the first two pairs of the CVs by retaining only those subsets of X and Y which yield high correlation between the CV and its regressed approximation. Some of these recommended subsets of X and Y , together with those obtained based on the canonical weights and loadings for the Jarvik smoking data set, are reported in Table VII. The key criterion to choose which subsets of X and Y give a good simultaneous interpretation for their two associated CVs is the total redundancy of one set explained by the other set. The total redundancy of X accounted for by
892
AL-KANDARI AND JOLLIFFE
TABLEVIII. The total redundancies of X explained by the CVs defined by different subsets f of Y Y* # Var. # Var. Y* TR,,, TR,,V. Y2, Y3, Y 4
Y I , Y3, Y 4 Y l , Y2,Y 3
0.109 0.088 0.087
3 3 3
Yz, Y 3 Y I , Y2, Y4 Y2, Y 4
2 3 2
0.081 0.067 0.060
the CVs of Y and vice versa could be considered as two different criteria for variable selection, since they may retain different sets of variables. Consider first the complete set X for Jarvik's data as the first set ( p = 8) and several selected subsets Y* of Y , from Table VII, as the second set (q*= 1, 2, 3). Table VIII shows the total redundancy of X explained by the CVs of Y*. In
Table VIII the subset
{yl]
is not included since the total redundancy of X
accounted for by y l is much smaller than those presented. The total redundancy of X explained by the first two CVs of Y , denoted by TR,,, , is equal to 0.1 14 which is small due the fact that vl and v2 are not designed to maximize the variation of X accounted for. The best subset Y* will be the subset of Y in which the total redundancy of X explained by the first two CVs: v; and v; of Y*, TR, v., is the closest to the reference value 0.114, which from Table VIII, is { y ~yh , y 4 ) , followed by { yl, y3, y 4 } . Given the previously selected subset Y* of Y, we attempt to determine which subset of X could be recommended to represent X. The criterion used in specifying the best subset of variables of X to retain is the proportion of the total variance of X which is explained by the first two CVs vf and v; of Y* using the variables in x*.Since the variables have unit variance, therefore, this proportion is defined as follows:-
CANONICAL CORRELATION ANALYSIS
893
TABLEIX. The total redundancies of X explained by the CVs defined by
where TR,.,,.
is the total redundancy of X* which is accounted for by v f and v;
and p and p* are the number of variables in the two sets X and X* respectively. These values will be compared with the total variance of X which is accounted for by vl and
v2
which are defined on all the variables in X. Thus using the first
two subsets of Table VJII as one set, Y*, and the subsets of the left-hand side of Table VII as the other set, x', the values of TRS.,". are obtained. The variables
.
in X* will be retained if the resultant values of TRSP.,. in Table M,is relatively close to TR,,,. . For example, consider the subset { x l , x2, x3, x4, x6, x7, x 8 } as the first set and the subset { y z , y3, y4} as the second set. The total redundancy of { x l , x2, x3, x4, xg, x,, x8 ) explained by vl and
v2
of {y2, y3, y4} is equal to 0.103. Thus the
proportion of the total variance of X that is accounted for by v; and v; which are written as a linear combinations of { y z , y3, y4) using the subset { x l , x2, x3, x4, X6,
x7, x 8 ) is 0.090, obtained by multiplying 0.103 ( TRx.
,,.) by the ratio
7
8
AL-KANDARI AND JOLLIFFE
TABLEX. The total redundancies of Y explained by the CVs defined by different subsets X*of X x* #~ a r . TR,.,,.
I
P (-).
/
I
In Table IX, the value of TR;.,v. using { x l , xz, x3, x4, xg, x,, x8] is the best
P followed by the subsets
{XI,
x3, x4, x5, x 7 ] and { x l , xz, x3, x4, X$ x 7 ] in both of the
chosen subsets of Y. If we compare the values of TR;.,". for the first two subsets in Table IX, we find that they are quite close. Thus it is better to select the subset subsets { x l , x3, x4, x5, x7} as the best subset of X because it consists of fewer variables (5 as compared to 7). Also the values of TR;.,,.using the subset { Y I , y3, y4] are smaller than those for
the subset {y,,
113,
{y2, y3,
y 4 } . A likely reason for this is that
y4} is better at explaining variation in X than { y l , y3, y 4 } .
There is no need to go through the other possible Y* in Table VIII because the values of TR;.,".become smaller and smaller- see Al-Kandari (1994). The previous processes can be repeated, reversing the roles of X and Y. The total redundancies of Y explained by the first two CVs, u; and u i , of X* are shown Table X. Some of the selected subsets of X summarized in Table VII are not included in Table X since their TR,,,.values are small compared with those summarized in Table X. The subset { x l , x2, x3, x4, x6, x7, x8} provides the closest value to TRY,, (0.101) followed by ( x l , x2, x4, X6, x7, x 8 } . These two subsets consist of seven and six variables, respectively, and the values of TRY,,.are 0.095 and 0.094
CANONICAL CORRELATION ANALYSIS
TABLEXI. The total redundancies of Y given by the CVs of X* using different subsets Y* of Y TRy",u.
respectively. However, it is almost as large (0.093 and 0.094) for the subsets {xl, x2, x4, x7, x ~ and } {x2,x4, x5, x7}.This indicates that we will not lose much
information in the second set Y if the last subset, consisting of only four variables is used. Table XI was constructed using each of the subsets {xl, x2, xj, x4, x6, x,, ~8
I,
I,
{XI, ~ 2 x4, , X6, ~ 7 x~s , {XI, ~ 2 ~, 4 ~, 7 X8) ,
and
Ix2, ~ 4 ~, 5 x7I ,
as the new first
set, x*,and the summarized subsets of Y in Table VII as the second set, Y*. In this table the notation
x,'
( where i = 7, 6, 5 , 4 ) represents the subset
just noted containing i variables. The value of TR,P',u. in Table XI can be obtained in a similar way to
where
is the total redundancy in Y* accounted for by the first two CVs,
896
AL-KANDARI AND JOLLIFFE
u; and u ; , of X* and q and q* are the number of the variables in the two sets Y
and Y* respectively. We now determine which variables in Y to keep by calculating TRC.,u. and comparing it with TR,,,. , which is equal to 0.095, 0.094, 0.093 and 0.094, respectively, according to the specific X* which is considered as the first set. It is clear from Table XI that there is little difference in the values of TR,P.,u. in the second and the fifth columns, which encourages us to take the subset
( ~ 2 X4, , X 5 ~ 7 as 1
the first set.
Thus we could conclude that the researcher may select different subsets of X and Y depending on which criterion is considered to be most important. We have seen in our example that when using canonical weight for selection, the subsets
X X I and
{ X I , ~ 2 ~, 4 1 i, y 2 , y 3 , ~ 4 1 {, X I , x 3 , X63 ~ 7 ,
(yl,
y 2 , ~ 4 are 1
the best
subsets in approximating the first two pairs of CVs respectively. When considering the total redundancy of X explained by the first two CVs defined on subset Y*of Y using subset X* of X, the sets chosen- see Table IX. Finally, the subsets
( X I , x 3 , x q , XS, x 7 }
{ x 2 , x 4 , XS, x 7 }
and {yz, y 3 ,
and {y,, y 2 ,
y4)
y4)
are
are best
if the total redundancy of Y explained by the first two CVs defined on subset X* of X using subset Y*of Y is considered- see Table XI.
6. DISCUSSION AND CONCLUSION
When attempting to interpret the results of a CCA using subsets of variables there are two considerations. First, we need to select subsets X* and Y*, and having done so we then need to decide how to use the subsets to interpret the corresponding CVs. The usual and most obvious approach to the second step is to use a truncated version of the CV based on the full set of variables, but there are other arguably better alternatives. As shown in Sections 3 and 4, two
CANONICAL CORRELATION ANALYSIS
897
methods of approximations, truncation and regression can give significant differences in interpretation of the CVs. It was found in Sections 3 and 4 that the truncated approximations defined by those variables with high canonical weights provide a more valid interpretation of the first two CVs than those defined by variables with high canonical loadings. This result is supported by Rencher (1995) who noted that the canonical weights indicate the contribution of the variables in the presence of each other and if some of the variables are deleted and others added, the coefficients will change. Thus they provide a pertinent multivariate approach to interpretation of the contribution of the variables acting in combination. Conversely, Rencher (1988) argued that the canonical loadings should not be used as an approach to interpret CVs. The intraset correlations between an individual variable and the canonical variates are similarly redundant because they merely show how the variable, on its own, relates to the other set of variables. All information about how the variables in one set contribute jointly to canonical correlation with the other set is lost. In Section 5, it was shown that there are at least three criteria which could be of interest when selecting X* and Y*. Firstly, the researcher might like to preserve only the variables in the two sets which have high canonical weights in the first two pairs of the CVs. Two other possible quantities of interest are the overall measure of variance in one specific variable set accounted for by the variables of the other set. Which set of variables is chosen as predictors and which as predictands depends on which set of variables is important or on the aim of studying the relationships between the two sets. Rencher (1992) argued that the total redundancy does not really quantify the redundancy between the two sets X and Y since it is defined based on the intraset correlations which are useless in gauging the importance of a given variable in the context of the others. He, therefore, claimed that total redundancy
AL-KANDARI AND JOLLIFFE
898
is not a useful measure of association between two sets of variables. On the contrary, it was shown in this paper that total redundancy plays a role in variable selection and in interpretation of the CVs. The criterion by which the subsets X* and Y* are selected depends on the area of interest to the researcher and is affected by factors such as data type and structure, and which group of variables is more important for the researcher. For instance, if there is multicollinearity between the variables in a given data set, then the researcher should definitely avoid using canonical weights as the criterion because, as noted in Section 2, they tend to be unstable in the presence of multicollinearity. An extensive simulation study would be needed to understand when particular problems may arise. What we have seen here, and with other data sets, is that there is plenty of scope for problems. Any attempt at interpretation must be treated very cautiously, with the regression approximation looked at routinely to check for problems. Work is in progress, on similar problems in PCA, using simulation studies to investigate for a given structure of a data set, which criteria yield good subsets of variables to retain in order to aid the individual or simultaneous interpretation of the first two PCs. Similar studies could eventually be done for Canonical Correlation Analysis (CCA). An obvious question is whether the choice of criterion makes much difference, or whether similar subsets are selected in each case. The analysis of the Jarvik smoking data set (Section 5) shows that the selected subsets of X and
Y may have little overlap. Conversely, it was found in the WPPSI data set that the selected subsets of X and Y obtained by the total redundancy of X accounted for by the variables of Yare similar to those obtained by the total redundancy of
Y explained by the variables of X, but in different order of importance- see AlKandari (1994).
In general, it can be concluded that the canonical variates can be adequately approximated by fewer than p
+ q variables using the truncated and
CANONICAL CORRELATION ANALYSIS
899
regressed approximations. The selection of the variables from the two sets X and
Y in order to interpret the canonical variates may be different depending on the criterion of greatest interest.
ACKNOWLEDGMENTS
We are grateful to the University of Kuwait for their financial support of Noriah Al-Kandari during the period in which this research was done, to Trevor Ringrose for supplying Jarvik's data set and to Jorge Cadima for some helpful comments.
BIBLIOGRAPHY Al-Kandari, N. (1994). Variable selection and Interpretation in Canonical Correlation Analysis. Unpublished M.Sc. dissertation, University of Aberdeen. Anderson, T. W. (1958). An Introduction to Multivariate Statistical Analysis. John Wiley and Sons, Inc. New York, London and Sydney. Cadima, J. F. C. L, & Jolliffe, I. T. (1995). "Loading and Correlations in the Interpretation of Principal Components", J. Appl. Statist., 22(2), 203-214. Gittins, R. (1985). Canonical Analysis: A Review with Applications in Ecology. Springer-Verlag Berlin Heidelberg. New York. Horst, P. (1961). "Relations Among m Sets of Measures", Psychometrika, 26(2), 129-149. Hotelling, H. (1935). "The most predictable criterion", J. Educ. Psychol., 26, 139-142. Hotelling, H. (1936). "Relations between two sets of variates", Biometrika, 28, 321-377. Lancaster, H. 0 . (1958). "The structure of bivariate distributions", Ann. Math. Statist., 29,7 19-736.
900
AL-KANDARI AND JOLLIFFE
Lancaster, H. 0. (1966). "Kolmogorov's remark on the Hotelling canonical correlations", Biometrika, 53, 585-588. Rencher, A. C. (1988). "On the use of Correlations to Interpret Canonical Functions", Biometrika, 75, 363-365. Rencher, A. C. (1992). "Interpretation of Canonical Discriminate Functions, Canonical variates, and Principal Components", Amer. Statistician, 46, 217-225. Rencher, A. C. (1995). Methods of Multivariate Analysis. John Wiley & Sons, Inc. New York. Stewart, D., & Love, W. (1968). "A General Canonical Correlation Index", Psychol. Bull., 70(3), 160- 163.
Yule, W., Berger, M., Butler, S., V., Newham, & Tizard, J. (1967). "The WPPSI: An Empirical Evaluation with a British Sample", Brit. J. Educa. Psychol., 39, 1 - 13.
R e c e i v e d O c t o b e r , 1996; R e v i s e d December, 1996.