CCA reduces to diagonalizing the matrix SllS12S;2S21 (e.g., Gittins, 1979). Among various statistics for testing the independence of Y and X one may use.
COMPUTATIONAL STATISTICS &DATAANALYSIS ELSEVIER
Computational Statistics & Data Analysis 20 (1995) 643-656
Refined approximations to permutation tests for multivariate inference Frrdrrique Kazi-Aoual a'*, Simon Hitier b, Robert Sabatier c, Jean-Dominique Lebreton 2 a Unitd de Biom~trie, INRA/ENSA.M/UM II, 9 Place Viala, 34060 Montpellier cedex 1, France b Centre d'Ecologie Fonctionnelle et Evolutive, CNRS, BP 5051, 34 033 Montpellier cedex 1, France cLaboratoire de Physique Mol~culaire et Structurale, Facultb de Pharmacie, Universit~ Montpellier L 34 060 Montpellier cedex 1, France
Received 1 April 1994; Revised 1 September 1994
Abstract Various authors have proposed approximations to permutation tests of independence between two data tables. We develop approximations based on explicit expressions of the first three moments of three different test statistics under the permutation distribution. The rejection level is then determined by using a Pearson-type III distribution matching the values of the first three moments. We present three examples in which the relative merits of the test statistics are examined and the results of the approximation procedure are compared with explicit permutation tests. K e y w o r d s : Permutation tests; Exact permutation tests; Monte-Carlo methods; Multivariate inference; Randomization tests
1. Introduction Multivariate inference procedures, such as the one-way multivariate analysis of variance (MANOVA1; e.g., Krishnaiah, 1980), have been developed under the assumption of Gaussian distributions (e.g., Giri, 1977). This framework is frequently restrictive in biological applications. To avoid it, many authors have used explicit permutation tests, made possible by the availability of fast computers (see reviews by Edington, 1987; Manly, 1991). In this approach, to test for the independence of a table Y and a table X, one gets the observed distribution of a test statistic
* Corresponding author. 0167-9473/95/$9.50 © 1995 Elsevier Science B.V. All rights reserved SSDI 0167-9473(94)00064-6
644
F. K a z i - A o u a l et al. / C o m p u t a t i o n a l Statistics & D a t a Analysis 20 (1995) 6 4 3 - 6 5 6
over a random subset of the permutations of the rows of Y. The null hypothesis is that the multivariate distribution of the variables of Y is permutable over the statistical units. The probability of rejection is the proportion of permutations that lead to a statistic value greater than the observed one. Its exact value is obtained only when all permutations are done ("exact permutation tests"), which is quite impractical even for a low number of statistical units (10! = 3.63 x 106). In practice, the probability of rejection is estimated from a pseudo-random subsample of permutations. An advantage of this approach, called for short "explicit permutation tests", is that it can be implemented using statistics associated with descriptive multivariate methods (e.g., Ter Braak, 1987a). Its main shortcoming is its cost in terms of computer time, especially if one takes care of sampling enough permutations to estimate precisely the probability of rejection. In a neglected paper, Mardia (1971), following Box and Watson (1962), gives analytic expressions for the first two moments of the trace statistic used in canonical correlation analysis under the permutation distribution. Then, Mardia proposes to modify the approximate beta distribution of this statistic under the Gaussian hypothesis to match these first two moments. A similar approach was also used by Mielke (1978), Mielke et al. (1976) and Raz and Fein (1992). One may speak of "approximate permutation tests". The purpose of this paper is to refine existing approximations to permutation tests of independence in two directions: (1 °) by proposing two alternative statistics to test for the independence of Y and X, especially suitable when a low number of statistical units makes the traditional statistics based on canonical analysis inefficient (Kazi-Aoual et al., 1992); (2 °) by using explicit expressions of the first three moments of these statistics under the permutation distribution (Kazi-Aoual, 1993), in order to approximate more precisely the permutation distributions of these test statistics. We compare our results with previous approaches to approximate permutation tests and with explicit permutation tests, on the basis of three examples of biological data.
2. Statistical measures of relationship between two tables
We consider two data tables Y and X with, respectively, p and q columns, over the same n statistical units. Y and X are considered as centered by columns. The natural geometric reduction of Y and/or X would then be by principal component analysis (PCA), usually after standardization. We use the following notation: $11 = _1 y , y , n
$12
= _1 Y ' X , F/
,
1
$21 = $12 = - X ' Y , /'/
1
$22 = - X ' X /1
(one may equivalently use 1/(n - 1)); P x = X ( X ' X ) - X ' is the projection operator onto M ( X ) , the range of X; tr(M) is the trace of matrice M. For the sake of generality, we use throughout generalized inverses.
F. Kazi-Aoual et al. / Computational Statistics & Data Analysis 20 (1995) 643-656
645
There are at least three natural ways of linking Y and X, which lead to three different trace statistics.
2.1. Canonical correlation analysis (CCA) CCA reduces to diagonalizing the matrix SllS12S;2S21 (e.g., Gittins, 1979). Among various statistics for testing the independence of Y and X one may use SCCA = tr(S1aS~2S22S21 ) (Pillai, 1955). If X is made up of a single categorical variable, tr(Si-x S~2S22S21 ) is the trace statistic of MANOVA1. Classical multivariate inference uses the distribution of SCCAunder the hypothesis that Y and X have independent Gaussian distributions, or, conditional on X, under a linear model Y = Xg + e, with e Gaussian. The distribution of SCCA is known only approximately or asymptotically, apart for small values of p and q (Pillai and Jayachandran, 1970). Its skewness coefficient 7 is asymptotically 0.
2.2. Principal component analysis with respect to instrumental variables (PCAIV) PCA with respect to instrumental variables (Rao, 1964; Sabatier et al., 1989), or redundancy analysis (Wollenberg, 1977), consists of PCA (Px (Y)), i.e. of diagonalizing Px(Y)'Px(Y) = S~2S22S21. One may use as a test statistic tr($12S22S21) or, equivalently, SpcAw = tr ($12S22 $21)/tr(S11)
(Stewart and Love, 1968).
2.3. Inter-battery analysis (IBA) Inter-battery analysis consists of PCA ((1/n)X' Y) (Tucker, 1958; Chessel and Mercier, 1993), i.e. of diagonalizing (1/n 2) Y ' X X ' Y = S 1 2 S 2 1 . One may use as a test statistic tr ($12 $21 ) or, equivalently, SmA = tr ($12 $21 )/tr ($11 ) tr ($22) or, still, the RV coefficient RV = tr ($12 $21 )/n/tr ($21) tr ($22) (Escoufier, 1973). When p -- q -- 1, the three statistics above reduce to r E. Further measures of the relationship between Y and X are reviewed by Lazraq (1988). He also provided the exact distributions of Spcgw and RV under the Gaussian setting. Although exact, these distributions are quite involved and difficult to use in practice. Mardia (1971) obtained the expectation Ep(SccA) and the variance Fp(SccA) under the permutation distribution. He corrected the approximate beta distribution of SccA/min(p, q) under the Gaussian setting by multiplying its two parameters by a same constant. The expectation is left unchanged while the variance is closer to Fp than in the original beta distribution. The resulting distribution compares favorably with samples of the permutation distribution. To our knowledge, no approximate permutation test has been proposed for the other two statistics. Many papers dealing with approximate permutation tests have appeared during the past ten years. For instance, Mielke (1978, 1979), Mielke et al. (1976) and
F. K a z i - A o u a l et al. / C o m p u t a t i o n a l Statistics & D a t a Analysis 20 (1995) 6 4 3 - 6 5 6
646
Biondini et al. (1988), proposed approximate permutation tests with a correction for skewness for multi-response permutation procedures (MRPP). One should note that, in the context of permutations, there is no difference between considering X as random and reasoning conditional to X.
3. Explicit expressions of the first three moments The moments of SccA, S p C A | V and SIB A (or RV, which is an equivalent statistic) can be included in a general formulation
3.1. General results Let A be a n by n symmetric matrix, with AI_ = 0, (1_ denotes the n by 1 vector of ones), and Y a n by p matrix, centered by column (Y'I_ = 0), W = YY'. The test statistic is T = tr (A W). We study the distribution of T under the hypothesis that the distribution of Y is permutable over the statistical units: the n! values of T obtained by permutation of the rows of Y are supposed to have the same probability.
Notation: Y' = (Yl, Y2,..., Yn); Sn is the set of permutations of indices ( 1 , 2 , . . . , n), W = Y Y ' = ( ... y;yj... ). The expectation of T = tr(AW) is Ep(tr A W ) = ~ ~
aijY'a,)y~¢j)
asS. i=l j=l
i ti= l a. azs S . Y.~oYo~o+ ~
n!
• i=
j=l
aij
z asS.
Y.~oYa~j).
j :/: i
As
E
Y',