of outcomes and a permutation step-down (PSD) procedure. The FWR ... repeated measures (RM) design or doubly multivariate design. One example is ...
Psychological Methods 2010, Vol. 15, No. 3, 268 –280
© 2010 American Psychological Association 1082-989X/10/$12.00 DOI: 10.1037/a0017737
Testing Multiple Outcomes in Repeated Measures Designs Lisa M. Lix and Tolulope Sajobi University of Saskatchewan This study investigates procedures for controlling the familywise error rate (FWR) when testing hypotheses about multiple, correlated outcome variables in repeated measures (RM) designs. A content analysis of RM research articles published in 4 psychology journals revealed that 3 quarters of studies tested hypotheses about 2 or more outcome variables. Several procedures originally proposed for testing multiple outcomes in 2-group designs are extended to 2-group RM designs. The investigated procedures include 2 modified Bonferroni procedures that adjust the level of significance, ␣, for the effective number of outcomes and a permutation step-down (PSD) procedure. The FWR, any-variable power, and all-variable power are investigated in a Monte Carlo study. One modified Bonferroni procedure frequently resulted in inflated FWRs, whereas the PSD procedure controlled the FWR. The PSD procedure could be substantially more powerful than the conventional Bonferroni procedure, which does not account for dependencies among the outcome variables. However, the difference in power between the PSD procedure, which does account for these dependencies, and Hochberg’s step-up procedure, which does not, were negligible. A numeric example illustrates implementation of these multiple-testing procedures. Keywords: familywise error rate, multiple testing, doubly multivariate data, correlation, resampling
achieving students. Teachers’ attitudes toward each of two stimuli were then assessed. The first stimuli condition was a student in a special education classroom, and the second stimuli condition was a low-achieving student in a regular classroom. Teachers provided data about their cognitive, emotional, and behavioral attitudes toward the stimuli. The researchers tested the Condition ⫻ Stimuli interaction, as well as the condition and stimuli main effects for each outcome variable, performing the tests at the ␣m ⫽ .05 (m ⫽ 1, . . . , L; L is the number of tests) level of significance. In studies where researchers conduct tests on multiple outcome variables, the familywise error rate (FWR), the probability of making at least one Type I error, can increase dramatically if each test is conducted at ␣m ⫽ .05. Assuming the tests are independent, FWR ⫽ 1 – (1 – ␣m)L. For example, when five tests are performed, FWR ⫽ 0.23. One well-known solution to control the FWR to ␣ is the Bonferroni procedure (Dunn, 1961), in which each test is conducted at the ␣/L level of significance. However, this multiple-testing procedure may result in low power to detect treatment effects, particularly when the tests are not independent (Sankoh, Huque, & Dubey, 1997), as is often the case when multiple outcome variables are investigated. Multiple-testing procedures that account for dependencies (i.e., correlations) among the tests are more powerful than procedures that do not account for the dependencies (Chen, Sakoda, Hsing, & Rosenberg, 2006; Troendle, 1995). However, these multiple-testing procedures have been developed only for multivariate designs in which data are collected once and not for doubly multivariate designs. A doubly multivariate design contains two sources of correlation: (a) within-variable correlation, which arises because the RMs for each outcome variable are correlated, and (b) between-variable correlation, which arises because the responses on the outcome variables at each measurement occasion are correlated. Both sources of correlation should be taken into
Many studies in psychology compare two or more groups on multiple outcome variables. For example, the literature about the use of psychotherapy for the treatment of mental health disorders in youths, which was recently reviewed by Weisz, Doss, and Hawley (2005), abounds with studies in which participants exposed to child-, parent-, family-, or teacher-focused interventions and participants randomized to a placebo condition, standard case management protocol, or alternate treatment were compared on multiple measures of patient symptoms, behaviors, function, and/or psychosocial responses (Keefe, Dunsmore, & Burnett, 1992; Weisz et al., 2005). J. M. Holland, Neimeyer, Currier, and Berman (2007) noted, in their review of outcome studies about personal construct therapy, that tests may be conducted on eight or more variables in a single study. Consider a study in which one or more groups of study participants provide data on two or more outcome variables at each of two or more measurement occasions. Such a design is known as a multivariate repeated measures (RM) design or doubly multivariate design. One example is provided by Weisel and Tur-Kaspa (2002). In their study of the effects of labeling and personal contact on teachers’ attitudes toward low-achieving students, teachers were randomly assigned to conditions in which they did or did not have contact with low-
Lisa M. Lix, School of Public Health, University of Saskatchewan; Tolulope Sajobi, Department of Mathematics and Statistics, University of Saskatchewan, Saskatoon, Saskatchewan, Canada. This research was supported by an operating grant from the Manitoba Health Research Council and a New Investigator Award from the Canadian Institutes of Health Research to Lisa M. Lix. Correspondence concerning this article should be addressed to Lisa M. Lix, School of Public Health, University of Saskatchewan, 107 Wiggins Road, Saskatoon, Saskatchewan, Canada S7N 5E5. E-mail: lisa.lix@ usask.ca 268
MULTIPLE-TESTING PROCEDURES FOR MULTIPLE OUTCOMES
account to obtain a test that is powerful to detect treatment effects across measurement occasions on multiple outcome variables. This research investigates procedures for controlling the FWR when testing hypotheses about multiple, correlated outcome variables in RM designs. Such procedures have been extensively investigated in the clinical trials literature (Zhang, Quan, Ng, & Stepanavage, 1997), where concerns about false positive results in the evaluation of treatment efficacy are significant. Wilkinson and the Task Force on Statistical Inference (1999) observed that “in many areas of psychology, we cannot do research on important problems without encountering multiplicity. We often encounter many variables and many relationships” (p. 599). Their guidelines state, Multiple outcomes require special handling. There are many ways to conduct reasonable inference when faced with multiplicity (e.g., Bonferroni correction of p values, multivariate test statistics, empirical Bayes methods). It is your responsibility to define and justify the methods used. (p. 599)
As the next section reveals, a variety of justifications exist for choosing to adopt or not to adopt a multiple-testing procedure for multiple outcomes. The decision is largely a function of one’s research objectives, which should be clearly defined at the outset of the study. In this article, we first describe a content analysis that investigates the frequency of doubly multivariate designs in published psychology research, the frequency of adoption of multiple-testing procedures in the analysis of the outcome variables from these designs, and reasons underlying their adoption and nonadoption. Then, we describe and investigate several procedures for controlling the FWR when testing multiple outcomes in two-group doubly multivariate designs. The procedures are first described for the case of data collected at a single measurement occasion and are then extended to the doubly multivariate case. A Monte Carlo study is conducted to evaluate the FWR and statistical power of these multiple-testing procedures. A numeric example is provided, and the use of existing software to implement the investigated procedures is described. The “Discussion and Conclusions” section describes some considerations when selecting a multipletesting procedure for correlated outcomes and alternative approaches for the analysis of doubly multivariate data.
269
Content Analysis of Multiple Testing in RM Designs We reviewed empirical articles published in a select number of American Psychological Association journals: Developmental Psychology, Journal of Consulting and Clinical Psychology, Health Psychology, and Journal of Applied Psychology. The journals represent different psychology subject areas. The review encompassed all research articles published in regular issues of each journal in 2008. Meta-analyses, archival research, reviews, methodological articles (e.g., Monte Carlo studies), and articles published in special supplements were excluded. When an article reported results for more than one study, only the first study that used an RM design was included in the content analysis. A total of 404 articles met the study inclusion criteria (Table 1). Of this number, 182 (45%) used an RM design. The majority of these (65%) used a mixed RM design, which contains both grouping (i.e., between-subjects) and RM factors. Overall, only one quarter of the RM studies tested hypotheses on a single outcome variable. In many of these single-outcome studies, the investigated variable was an objective measure of function or performance, such as heart rate, cortisol level, or number of errors in completing a task. However, self-report outcome variables were also investigated in several studies; one example is psychological well-being. The remaining articles used a doubly multivariate design, that is, an RM design in which data on two or more outcome variables were collected at each measurement occasion. One third of the articles used an RM design with either two or three outcome variables. Another 10% included RM designs with nine or more outcome variables; the maximum number of variables investigated in a single study was 21. In some of the studies, the outcome variables represented instrument subscales, such as the physical and mental health components of the Short Form 36 (Ware, Snow, & Kosinski, 1993). In other studies, the outcome variables represented measures used to evaluate different facets of an intervention, such as severity of pain or personal well-being. Of the RM studies that investigated multiple outcomes, only 11 (6%) specifically addressed the issue of FWR control when testing multiple outcomes. Two articles reported use of a multivariate test procedure, such as a procedure based on the multivariate mixed model (Boik, 1991), to simultaneously test hypotheses for the set of outcome variables. In both articles, a statistically significant
Table 1 Results of Content Analysis About Multiple Testing in Repeated Measures Designs for Four Journals Published by the American Psychological Association DP (n ⫽ 123)
JCCP (n ⫽ 98)
HP (n ⫽ 93)
JAP (n ⫽ 90)
Total (N ⫽ 404)
Article type
N
%
N
%
N
%
N
%
N
%
Had an RM design Tested a single outcome variable Tested 2–3 outcome variables Tested 4–8 outcome variables Tested 9⫹ outcome variables
66 23 21 20 2
54 35 32 30 3
53 11 11 19 12
54 21 21 36 23
37 8 12 14 3
40 22 32 38 8
26 4 16 4 2
29 15 62 15 8
182 46 60 57 19
45 25 33 31 10
Note. DP ⫽ Developmental Psychology; JCCP ⫽ Journal of Consulting and Clinical Psychology; HP ⫽ Health Psychology; JAP ⫽ Journal of Applied Psychology; RM ⫽ repeated measures.
LIX AND SAJOBI
270
multivariate test was followed with separate tests for each outcome variable. In five articles, the FWR was controlled in multiple testing by adopting the Bonferroni procedure. For example, Whittal, Robichaud, Thordarson, and McLean (2008) used this procedure to control the FWR in their study, which involved testing one primary outcome variable and five secondary outcome variables. Each of the secondary outcome variables was tested at the ␣m ⫽ .01 level of significance “to achieve a balance between protection for Type I and Type II error” (p. 1006). In the remaining four studies, the authors acknowledged that testing multiple outcomes might result in an inflated FWR, but they provided a justification for not adopting a multiple-testing procedure. For example, McCain et al. (2008) conducted a study in which nine outcome variables were investigated. Six were designated as primary outcomes and the remaining three were designated as secondary outcomes: A single primary response variable represented each of the six primary hypotheses. Because each of these hypotheses represented distinct domains within the PNI [psychoneuroimmunology] framework, no adjustment for multiplicity was made for the overall tests of the six primary variables. . . . However, statistical validity decisions allowed these secondary variables to be analyzed only when the primary variable had an overall significance of p ⱕ .05. Given this representation of distinct conceptual domains, no further adjustment for multiplicity was made. (p. 436)
Procedures for Testing Multiple Outcomes in Two-Group Designs For the case of a single measurement occasion, define the null hypothesis, H0m: 1m – 2m ⫽ 0, and the alternative hypothesis, HAm: 1m – 2m ⫽ 0, where 1m is the population mean for the treatment group on the mth outcome variable and 2m is the population mean for the control group on the mth outcome. Let pm denote the p value associated with the statistic (i.e., tm) used to test the null hypothesis. Assuming the data satisfy the assumptions of normality and homogeneity of group variances, Student’s t statistic provides a valid test of H0m, that is, tm ⫽
兩 Y 1m ⫺ Y 2m 兩
冑冉 sm2
1 1 ⫹ n1 n2
冊
,
(1)
where Y jm is the mean for the jth group on the mth outcome, s m2 ⫽
2 2 ⫹ 共n2 ⫺ 1兲s2m 共n1 ⫺ 1兲s1m . N⫺2
(2)
2 is the variance for the jth group on the mth In Equation 2, sjm outcome, and nj is the sample size for the jth group (¥j2 ⫽ 1 nj ⫽ N).
choices in many disciplines (Ramsey, 1982; Sankoh et al., 1997). The Bonferroni procedure compares pm to ␣m ⫽ ␣/L or, in other words, compares the adjusted p value, p~m ⫽ Lpm to ␣. If p~m ⬎ ␣ then H0m is retained; otherwise, it is rejected. More powerful alternatives to the single-step Bonferroni procedure include stepwise procedures proposed by B. S. Holland and Copenhaver (1987), Hochberg (1988), and Rom (1990). We consider the Hochberg procedure in this article. It begins by rank ordering the p values so that p(1) ⱕ p共2兲 ⱕ. . .ⱕp共L兲 , where p(m) denotes the mth rank-ordered p value. The corresponding null hypotheses are denoted H0(1), H0(2), . . . , H0(L). Testing begins with the largest p value, p(L). If p(L) ⱕ ␣, all L hypotheses are rejected; if not, H0(L) is retained and testing moves to H0(L – 1). If p(L – 1) ⱕ ␣/2, H0(L – 1) is rejected, as are all remaining hypotheses; if not, H0(L – 1) is retained. This continues, if all previous hypotheses have been retained, until p(1) is compared to ␣/L. This procedure can also be described with adjusted p values; at the mth stage of testing, define p~共m兲 ⫽ 共L ⫺ m ⫹ 1兲p共m兲 .
Multiple-Testing Procedures When the H0ms Are Not Independent Three categories of multiple-testing procedures have been proposed to account for dependencies among a set of outcome variables: (a) procedures that modify the divisor of the Bonferroni procedure, (b) procedures that assume a joint normal distribution of the test statistics, and (c) procedures that use an empirical method to approximate the joint distribution of the test statistics. In this article, we focus on procedures in the first and third categories, which, unlike procedures in the second category, make no assumptions about the distribution of the data and do not assume prior knowledge about the correlations among the outcome variables. One approach, proposed by Armitage and Parmar (1986), is to ⬘ ⬘ modify the Bonferroni procedure by using Lm instead of L, where Lm ⬘ is the effective number of outcomes (1 ⱕ Lm ⱕ L). Dubey (1994) ⬘ recommended Lm ⫽ Lwm, where wm ⫽ 1 – rm.,
r m. ⫽
冋冘
1 L⫺1
兩rmp兩 ⫺ 1 ,
p⫽1
(3)
and rmp is the (m, p)th element of R (m, p ⫽ 1, . . . , L), the pooled within-group sample correlation matrix for the L outcomes. If all ⬘ outcomes are completely correlated, then rm. ⫽ 1 and Lm ⫽ 1. If all outcomes are completely uncorrelated, then rm. ⫽ 0 and ⬘ Lm ⫽ L. If the correlations among the outcomes are relatively ⬘ constant, Dubey suggested adopting Lm ⫽ Lw, where w ⫽ 1 – r.., and
冘
Multiple-Testing Procedures When the H0ms Are Independent Although many multiple-testing procedures have been proposed to control the FWR to ␣ when the tests are independent, we focus on the Bonferroni procedure and its modifications because these procedures are straightforward to implement and are popular
册
L
L
r .. ⫽
rm.
m⫽1
L
.
(4)
Westfall and Young (1993) proposed a permutation step-down (PSD) procedure based on the maximum test statistic (i.e., maxT)
MULTIPLE-TESTING PROCEDURES FOR MULTIPLE OUTCOMES
to account for the dependencies among the outcome variables (see also Blair, Troendle, & Beck, 1996; Troendle, 1995). To apply this PSD procedure, begin by identifying a permutation scheme that generates the sampling distribution of the test statistic for each outcome variable under the null hypothesis. In the case of the two-group multivariate design, under the null hypothesis the observation vectors for study participants (i.e., Yij ⫽ [Yij1 . . . YijL]; i ⫽ 1, . . . , nj; j ⫽ 1, 2) are assumed to be exchangeable between groups; that is, the observation vectors are as likely to be drawn from the treatment population as from the control population. Compute the test statistics for each of the L outcomes and order them so that t(1) ⱕ t共2兲 ⱕ . . . ⱕ t共L兲 ; the corresponding null hypotheses are denoted H0(1), H0(2), . . . , H0(L). The data are permuted, or shuffled (i.e., resampled without replacement), among the treatment and control groups B times. For each permuted data set, test statistics for the L outcomes (i.e., tⴱ s) are computed. The hypothesis associated with the maximum statistic, that is, H0(L), is tested first. The p value associated with t(L) is 1 ~ p 共1兲 ⫽ B
冘 B
ⴱ ⱖ t共L兲 兴, I 关max thb
(5)
1ⱕhⱕL
b⫽1
where the maximum extends over all h (h ⫽ 1, . . . , L), and I[A] is the indicator function of the event A. In brief, the maximum test statistic in the original data set is compared to the maximum statistic from each permuted data set, taking into account the order of the test statistics in the original data set. The p value is the proportion of B permuted data sets for which the maximum statistic exceeds t(L). If p~(L) ⱕ ␣ then reject H0(L) and proceed to test H0(L – 1) by comparing p~共L ⫺ 1兲 to ␣. However, if p~共L兲 ⬎ ␣ retain H0(L) and all remaining hypotheses. In general, the adjusted p value for the mth order statistic in the original data set is 1 ~ p 共m兲 ⫽ B
冘 B
b⫽1
ⴱ I 关max thb ⱖ t共L ⫺ m ⫹ 1兲 兴,
(6)
h
where the maximum extends over all h corresponding to t(1), . . . , t(L (m ⫽ 1, . . . , L). If p~共m兲 ⱕ ␣, then reject H0(m) and proceed to test H0(m – 1). However, if p~共m兲 ⬎ ␣, retain H0(m) and all remaining hypotheses. A permutation test produces an exact result only if B is equal to the total number of possible permutations of the data used to generate the sampling distribution. In practice, this can be computationally prohibitive. A random sample of permutations gives an approximate solution but produces a numeric result that is similar to the exact solution (Blair et al., 1996). Although B ⫽ 500 permutation samples have been shown to produce acceptable results, the precision of the estimate increases as B increases (Ge, Dudoit, & Speed, 2003). Troendle (1996) proposed a permutation step-up (PSU) procedure (see also Blair et al., 1996) that is more powerful than Westfall and Young’s (1993) PSD procedure. However, Troendle found that the differences in power between the PSU and PSD procedure were small (i.e., less than 2 percentage points). – m ⫹ 1)
271
Although a PSD procedure (based on the minimum p value) is available in SAS software (SAS Institute, 2004b), the PSU procedure is not (Westfall, Tobias, Rom, Wolfinger, & Hochberg, 1999). Therefore, the PSD procedure is implemented here.
Procedures for Testing Multiple Outcomes in Two-Group Doubly Multivariate Designs Procedures for testing multiple outcomes in doubly multivariate designs are described for a design with J ⫽ 2 groups, K RMs, and L outcome measures. Let Yij ⫽ [Yij11 Yij12 . . . Yij1L . . . YijKL] denote the row vector of observations for the ith subject in the jth group. For each outcome variable, there are three null hypotheses that may be of interest to the researcher: Group ⫻ Occasions interaction effect, occasions main effect, and group main effect. The interaction hypothesis is generally tested first, and if it is significant, the researcher does not proceed to test main effects. In this article, the set of L interaction hypotheses is defined as a single family for multiple testing. The procedures discussed in this section are described with reference to this family of hypotheses, although all of the procedures can also be applied to other families of hypotheses, such as the set of L hypotheses for the RM main effect or the set of L hypotheses that define a priori contrasts among the groups and/or measurement occasions. Although this article focuses on controlling the FWR for L hypotheses, an alternate approach is to control the FWR for all possible hypotheses about the RMs or groups across the L outcome variables. The latter approach would result in a conservative procedure that may have limited appeal to researchers. The choice of a test statistic for the outcomes (i.e., tm) depends on the assumptions one makes about the data. Under the multivariate mixed model, the extension of the mixed model to multiple outcome variables, the data are assumed to follow a normal distribution and satisfy the assumption of multisample sphericity (Boik, 1988, 1991; Thomas, 1983). For multivariate multisample sphericity to be a tenable assumption, the differences between pairs of RMs must exhibit a common variance, and this common variance must be homogeneous (i.e., equal) across groups and across all outcome variables. Given that multivariate multisample sphericity is a stringent assumption, RM multivariate analysis of variance (MANOVA) is one recommended alternative (Boik, 1991; Everitt, 1995; Looney & Stanley, 1989; McCall & Appelbaum, 1973). This model makes no assumptions about the common covariance structure but does assume homogeneity of group covariances and multivariate normality. Robust alternatives to RM MANOVA, which do not assume covariance homogeneity and/or multivariate normality, have also been proposed (Keselman, Wilcox, & Lix, 2003; Lix & Hinds, 2004). The RM MANOVA model is adopted in this article. It is implemented for each outcome variable with Hotelling’s (1931) T2 statistic when J ⫽ 2 and a with multivariate criterion, such as the Pillai–Bartlett trace (Bartlett, 1939; Pillai, 1955), when J ⬎ 2. To extend the Armitage–Parmar–Dubey (APD) procedure to doubly multivariate data, recognize that ⍀, the common covariance matrix for the repeated measurements and outcome variables, has a block structure,
LIX AND SAJOBI
272
⍀ ⫽
冤
冋 冋
1111 · · · L111 11K1 · · · L1K1
··· ·· · ··· · 关·兴 · ··· ·· · ···
册 冋 册 冋
1L11 · · · LL11
关· · ·兴 关· · ·兴
1LK1 · · · LLK1
关· · ·兴
111K · · · · ·· · · · L11K · · · · 关·兴 · 11KK · · · · · ·· · · L1KK · · ·
1L1K · · · LL1K 1LKK · · · LLKK
册 册
冥
,
(7) where mpqs is the covariance for the mth and pth outcome variables (m, p ⫽ 1, . . . , L) for the qth and sth measurement occasions (q, s ⫽ 1, . . . , K). The upper left block is the L ⫻ L covariance matrix for the outcomes at the first measurement occasion, the next diagonal block represents the covariance matrix for the outcomes at the second measurement occasion, and so on. To estimate the effective number of outcomes, begin by calculating the average correlation for the mth and pth outcomes,
r mp.. ⫽
冋冘 冘
1 K共K ⫺ 1兲
K
册
K
兩rmpqs兩 ⫺ K ,
q⫽1 s⫽1
(8)
where rmpqs represents the correlation coefficient for the mth and pth outcome variables at the qth and sth measurement occasions. Then, applying the concepts of Equation 4,
r m. . . ⫽
冋冘
1 L⫺1
L
p⫽1
册
兩rmp.. 兩 ⫺ 1 ,
(9)
⬘ is used to estimate Lm . ⬘ Another approach to estimating Lm assumes that ⍀ has the simple structure, ⍀ ⫽ ⌺K 丢 ⌺L (Galecki, 1994), where ⌺K is the covariance matrix for the repeated measurements, ⌺L is the covariance matrix for the outcome variables, and R is the symbol for the Kronecker product. The advantage of imposing this factorial or Kronecker product structure on the data is that it reduces the number of parameters to estimate, which results in greater precision of the estimates when N/KL is small. If a general structure is assumed for ⍀, then a total of [KL(KL ⫹ 1)]/2 parameters must be estimated, whereas if a Kronecker product structure is assumed, there are only K(K ⫹ 1)/2 ⫹ L(L ⫹ 1)/2 parameters to estimate. A method for estimating ⌺K and ⌺L is described in the Appendix. To apply the PSD procedure to doubly multivariate data, one begins by identifying a permutation scheme that generates the sampling distribution of the test statistic under the null hypothesis. The permutation scheme is a function of the model effect that is being investigated. Under the null hypothesis of no Group ⫻ Occasions effect, the data are assumed to be exchangeable between groups and measurement occasions. Accordingly, for the bth (b ⫽ 1, . . . , B) permuted data set, the vectors Yijk ⫽ [Yijk1, . . . , YijkL] are shuffled between groups as well as between measurement occasions. If one were interested in testing the null hypothesis of no occasion main effect, however, the data are assumed to be exchangeable between measurement occasions but not between groups under the null hypothesis. For the PSD procedure, calculation of the adjusted p values associated with the test statistics for each of the L outcome variables proceeds in the same way as when there is only a single
measurement occasion. That is, for each permuted data set, test statistics for the L outcomes are computed, and each of the ordered test statistics from the original data set is compared to the relevant maximum test statistic from the permuted data set.
Monte Carlo Study A Monte Carlo study was undertaken to compare the FWR and power of procedures for testing multiple outcome variables in doubly multivariate designs. The procedures investigated were (a) Bonferroni, (b) Hochberg’s (1988) step-up, (c) APD, and (d) PSD. For the APD procedure, both methods of computing the effective number of outcomes were considered: (a) the method that assumes a general structure for ⍀ (i.e., APD-G), and (b) the method that assumes ⍀ has a parsimonious Kronecker product structure (i.e., APD-K). All procedures were investigated for testing the family of L Group ⫻ Occasions effects in a doubly multivariate design with J ⫽ 2 and K ⫽ 4. The simulation conditions were (a) number of outcome variables, (b) total sample size, (c) correlation among the outcomes, and (d) configuration of the population mean vector for the outcome variables. The constants were correlation among the repeated measurement (K ⫽ .50), variance of the repeated measurements 共K2 ⫽ 10), variance of the outcome variables (L2⫽ 1), and magnitude/pattern of the interaction effect for the nonnull case. Three values of L were considered: three, six, and nine (Blair et al., 1996; Troendle, 1996); previous simulation studies have investigated between 3 and 20 outcome variables although most have investigated L ⱕ 10. The content analysis described earlier showed that about 10% of the investigated studies tested nine or more outcome variables. Total sample sizes of N ⫽ 72 and 108 were investigated. The ratio N/L ranged from 8 to 36 in this study, whereas in previous research, ratios ranging from 3 to 15 were investigated (Blair et al., 1996). Only balanced group sizes were considered (Ramsey, 1982), so that n1 ⫽ n2 ⫽ 36 for the smaller value of N and n1 ⫽ n2 ⫽ 54 for the larger value of N. A vector of standard normal deviates, Cij, was transformed to a vector of multivariate observations through Y ij ⫽ j ⫹ RCTij ,
(10)
where j is the row vector of population means, Q is an upper triangular matrix of dimension KL satisfying the equality QTQ ⫽ ⍀ and is obtained by a Cholesky decomposition of ⍀. These deviates were generated with the RANNOR function in SAS (SAS Institute, 2004a). ⍀ had the form VVT, where V is a diagonal matrix with ⫽ K 丢 L on the diagonal, where K is the vector of standard deviations for the measurement occasions and L is the vector of standard deviations for the outcome variables. As well, ⫽ K R L, where K and L are, respectively, correlation matrices for the measurement occasions and outcome variables. Two types of group correlation matrices were considered. For the first type, 1L ⫽ 2L ⫽ L, where jL is the correlation matrix for the jth group. For this type of correlation matrix, four conditions were considered: L ⫽ (L) ⫽ 0.00, L ⫽ (L) ⫽ 0.30, L ⫽ (L) ⫽ 0.70, and L ⫽ mixed. For the last condition, L had a simplex structure in which the diagonal above and below the main diagonal
MULTIPLE-TESTING PROCEDURES FOR MULTIPLE OUTCOMES
273
total sample size. For L ⫽ 3, average error rates of the Bonferroni procedure and the Hochberg (1988) procedure never exceeded the upper bound of Bradley’s (1978) stringent criterion, although minimum values for both procedures were below the lower bound of this criterion. The APD procedures based on the general covariance structure (i.e., APD-G) and the Kronecker product structure (i.e., APD-K) resulted in average error rates above the upper bound of Bradley’s stringent criterion for L ⫽ 3 although minimum values did not fall below the lower bound of this criterion. Average, minimum, and maximum values for the PSD procedure were contained within the bounds of the stringent criterion for this smallest value of L. For L ⬎ 3, the Bonferroni procedure and the Hochberg (1988) procedure resulted in average and minimum error rates that were below the lower bound of Bradley’s (1978) stringent criterion. The APD-G procedure resulted in an average value that slightly exceeded the upper bound of this criterion for L ⫽ 6 but not L ⫽ 9, although the maximum value of this procedure was above the upper bound of the stringent criterion for both values of L. The APD-K procedure resulted in average values that were above this upper bound for the two largest values of L. Unlike error rates for the APD-G procedure, which tended to decrease as L increased, those of the APD-K procedure increased along with L. The PSD procedure always controlled the FWR within the bounds of Bradley’s stringent criterion. Error rates were similar for the two investigated conditions of total sample size (Table 3). Regardless of the value of N, mean Type I error rates were below the lower bound of Bradley’s (1978) stringent criterion for the Bonferroni procedure, near or below the lower bound for Hochberg’s procedure, and well controlled for the PSD procedure. With respect to Bradley’s (1978) liberal criterion, the Bonferroni, Hochberg (1988), and PSD procedures always resulted in error rates that were contained within the bounds of this criterion. For the APD-G procedure, only 8.3% of values exceeded the upper bound of Bradley’s liberal criterion, and all of these values occurred for the smallest value of L. However, for the APD-K procedure, two thirds (63.9%) of the FWRs exceeded the upper bound of Bradley’s liberal criterion. Further investigation of the data revealed that for most procedures, error rates were similar for 1L ⫽ 2L and 1L ⫽ 2L. For example, for the PSD procedure, the mean error rate was 4.95 for the former condition and 5.01 for the latter. For Hochberg’s (1988) procedure, the corresponding values were 4.43 and 4.52. For the ADP-G procedure, the average FWR was slightly larger (6.37) for 1L ⫽ 2L than for 1L ⫽ 2L (5.85). For conditions where 1L ⫽ 2L, error rates for the Bonferroni procedure and the Hochberg (1988) procedure decreased as L increased. For L ⫽ 0.30, the
had L ⫽ 0.70, and succeeding diagonals had L ⫽ 0.30 (Ramsey, 1982). The second type of group correlation matrix had 1L ⫽ 2L. Two conditions were considered: (a) 1L ⫽ (L) ⫽ 0.00 and 2L ⫽ (L) ⫽ 0.70, and (b) 1L ⫽ (L) ⫽ 0.30 and 2L ⫽ (L) ⫽ 0.70. The population mean vectors had the configuration j ⫽ jK 丢 jL, where jK is the jth mean vector for the measurement occasions and jL is the jth mean vector for the outcome variables. For the repeated measurements, 1K ⫽ 2K ⫽ [0 0 0 0] for the null case, and 1K ⫽ [0 0 1 1] and 2K ⫽ [1 1 0 0] for the nonnull case. Table 2 describes the six configurations of 1L that were investigated. In all cases, 2L was the null vector. As shown in this table, nonnull cases were investigated in which the treatment effect was concentrated on a single outcome variable, as well as conditions in which a treatment effect was present on all outcome variables and was either of the same magnitude or varied in magnitude across the outcomes (Troendle, 1996). Empirical FWRs were obtained when the mean vectors for the measurement occasions and outcome variables were null. For nonnull vectors, two measures of statistical power were calculated: (a) any-variable power, the power of a procedure to reject at least one nonnull hypothesis in the family of L hypotheses, and (b) all-variable power, the power of a procedure to reject all nonnull hypotheses in the family of L hypotheses. As Ramsey (1982) noted, any-variable power is often of greatest interest in exploratory studies, where “a researcher wants to maximize the chances of finding some true difference (possibly to justify future research)” (p. 140), whereas all-variable power is often of greatest interest in confirmatory studies. We conducted a total of 5,000 simulations for each combination of conditions. For the PSD procedure, B ⫽ 500 permutation samples were used to estimate the empirical distribution of the test statistics under the null hypothesis (Troendle, 1996). The FWR were evaluated with both Bradley’s (1978) stringent and liberal criteria. According to the stringent criterion, FWRs outside the bound of (0.045, 0.055) are considered to be nonrobust for ␣ ⫽ .05, whereas for the liberal criterion, FWRs outside the bound of (0.025, 0.075) are considered to be nonrobust. For evaluating statistical power, differences in power less than 5% are considered negligible, whereas those greater than or equal to 10% are considered substantial (Lix, Deering, Fouladi, & Manivong, 2009). The simulation program was written with SAS/IML software, Version 9.1 (SAS Institute, 2004a).
Results Type I error rates. Mean, minimum, and maximum FWRs are summarized in Table 3 by number of outcome variables and Table 2 Configurations of 1L for Nonnull Conditions in the Simulation Study Configuration
L⫽3
L⫽6
L⫽9
I II III IV V VI
(0, 0, 0) (1.5, 0, 0) (1.5, 1.5, 0) (2.0, 1.5, 0) (1.5, 1.5, 1.5) (2.0, 1.75, 1.5)
(0, 0, 0, 0, 0, 0) (1.5, 0, 0, 0, 0, 0) (1.5, 1.5, 1.5, 0, 0, 0) (2.0, 1.5, 1.5, 0, 0, 0) (1.5, 1.5, 1.5, 1.5, 1.5, 1.5) (2.0, 2.0, 1.75, 1.75, 1.5, 1.5)
(0, 0, 0, 0, 0, 0, 0, 0, 0) (1.5, 0, 0, 0, 0, 0, 0, 0, 0) (1.5, 1.5, 1.5, 1.5, 1.5, 0, 0, 0, 0) (2.0, 1.5, 1.5, 1.5, 1.5, 0, 0, 0, 0) (1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5) (2.0, 2.0, 2.0, 1.75, 1.75, 1.75, 1.5, 1.5, 1.5)
Note. 2L was equal to the null vector for all conditions.
LIX AND SAJOBI
274
Table 3 Familywise Error Rates (%) of Multiple-Testing Procedures for the Group ⫻ Occasions Effect, by Number of Outcome Variables (L) and Total Sample Size (N) Summary statistics L⫽3 M (SD) Min Max L⫽6 M (SD) Min Max L⫽9 M (SD) Min Max N ⫽ 72 M (SD) Min Max N ⫽ 108 M (SD) Min Max
BON
HOCH
APD-G
APD-K
PSD
4.69 (0.41) 4.10 5.22
4.76 (0.39) 4.14 5.26
6.79 (0.90) 5.10 8.12
7.05 (1.09) 4.98 8.60
5.06 (0.26) 4.50 5.38
4.38 (0.50) 3.38 4.88
4.39 (0.50) 3.42 4.90
5.90 (0.63) 4.76 6.74
8.54 (2.09) 4.86 11.28
4.94 (0.25) 4.54 5.42
4.22 (0.53) 3.52 4.98
4.23 (0.53) 3.52 4.98
5.34 (0.42) 4.72 5.98
9.42 (2.51) 4.70 12.86
4.92 (0.29) 4.50 5.38
4.48 (0.54) 3.50 5.22
4.51 (0.55) 3.50 5.26
6.05 (0.93) 4.76 7.74
8.34 (2.14) 4.86 12.86
4.98 (0.31) 4.50 5.38
4.38 (0.49) 3.38 5.04
4.41 (0.49) 3.42 5.04
6.00 (0.89) 4.72 8.12
8.34 (2.26) 4.70 12.36
4.97 (0.22) 4.58 5.42
Note. Bold mean, minimum, and maximum values are above the upper bound, and italicized average, minimum, and maximum values are below the lower bound of Bradley’s (1978) stringent criterion of (4.50, 5.50). BON ⫽ Bonferroni; HOCH ⫽ Hochberg; APD-G ⫽ Armitage–Parmar–Dubey based on general structure for ⍀; APD-K ⫽ Armitage–Parmar–Dubey based on Kronecker product structure for ⍀; PSD ⫽ permutation step-down.
average FWRs for these two procedures were 4.85 and 4.87, respectively, whereas for L ⫽ 0.70, the average values were 3.75 and 3.81, respectively. For the PSD procedure, the average FWR for L ⫽ 0.30 was 5.10, and for L ⫽ 0.70 it was 5.07. For the APD-G procedure, the average error rates were 6.30 and 6.35, respectively, for these two correlation conditions. Power rates. Table 4 summarizes the percentages of anyvariable and all-variable power for all procedures with the exception of the ADP-K procedure because it could not effectively control the FWR across all conditions. In this table, the data have been averaged across values of N, patterns of correlation, and mean configurations. For any-variable power when L ⫽ 3, the differences among the procedures were negligible. The difference in average power between Hochberg’s (1988) procedure and the Bonferroni procedure was 0.50%; this difference was 4.18% for the APD-G procedure, and for the PSD procedure it was 0.69%. For L ⬎ 3, some of these differences were slightly larger but still negligible in magnitude. The difference between the PSD and Bonferroni procedures was 1.11% for L ⫽ 6 and 1.40% for L ⫽ 9. The differences between the PSD procedure and Hochberg’s procedure were 3.00% and 2.43%, respectively, for these values of L. With respect to rates of all-variable power, the difference in average power between the Hochberg (1988) and Bonferroni procedures was 10.36% for L ⫽ 3, whereas the difference was 7.33% for the APD-G and Bonferroni procedures and 9.65% for the PSD and Bonferroni procedures. For L ⬎ 3, these differences were larger for all but the APD-G procedure. For example, for L ⫽ 9, the difference in average power between Hochberg’s (1988) procedure and the Bonferroni procedure was 11.95%, whereas for the PSD procedure, it was 11.21%. The difference in average any-
variable power between Hochberg’s procedure and the PSD procedure was not large. For all three values of L, Hochberg’s procedure was slightly more powerful, but the differences were negligible in magnitude. For example, for L ⫽ 9 the difference in average all-variable power was 0.74% between the two procedures. All-variable power rates for the five nonnull configurations of the mean vectors are summarized in Table 5. The data have been averaged across values of N, L, and correlation pattern. The difference in average power between the Hochberg (1988) and Bonferroni procedures ranged from 0.15% for Configuration II to 25.35% for Configuration VI, whereas the difference between the PSD and Bonferroni procedures ranged from 1.52% to 22.47%. For the APD-G and Bonferroni procedures, the differences ranged from 4.09% to 5.87%. The differences in average all-variable power between the PSD procedure and Hochberg’s procedure were small and ranged from 1.37% for Configuration II to ⫺3.30% for Configuration V. Average all-variable power for the APD-G procedure was greater than average power for the PSD procedure only for Configuration II (difference ⫽ 2.57%), although this difference is considered negligible. Otherwise, the differences could be substantial; for Configuration V, average power for the PSD procedure was 32.87%, and for the APD-G procedure it was only 20.18%. Average rates of any-variable power for these five nonnull mean configurations (not shown) revealed that although the Bonferroni procedure was the least powerful, the differences among the procedures were never substantial. For example, for Mean Configurations II and V, average power values for Hochberg’s (1988)
MULTIPLE-TESTING PROCEDURES FOR MULTIPLE OUTCOMES
275
Table 4 Percentages of Any-Variable and All-Variable Power of Multiple-Testing Procedures for the Group ⫻ Occasions Effect by Number of Outcome Variables (L) Summary statistics Any-variable; L ⫽ 3 M (SD) Min Max Any-variable; L ⫽ 6 M (SD) Min Max Any-variable; L ⫽ 9 M (SD) Min Max All-variable; L ⫽ 3 M (SD) Min Max All-variable; L ⫽ 6 M (SD) Min Max All-variable; L ⫽ 9 M (SD) Min Max Note.
BON
HOCH
APD-G
PSD
79.93 (15.97) 45.60 99.72
80.43 (15.94) 45.90 97.60
84.11 (14.37) 45.76 99.78
80.62 (15.58) 45.78 99.74
78.79 (19.60) 34.56 99.96
78.94 (19.59) 34.66 99.96
81.94 (18.37) 36.48 99.96
79.89 (19.03) 35.12 99.94
79.11 (21.47) 30.18 100.00
79.18 (21.47) 30.22 100.00
81.61 (20.56) 31.00 100.00
80.51 (20.81) 30.84 100.00
45.60 (17.09) 95.80 71.26
55.96 (15.37) 25.62 79.64
52.93 (17.29) 11.48 78.00
55.25 (15.71) 22.54 78.82
25.85 (17.25) 0.24 60.66
38.22 (17.48) 6.52 72.84
30.29 (18.47) 0.30 66.90
37.57 (17.67) 3.96 70.66
16.44 (16.36) 0.00 55.72
28.39 (18.08) 0.72 67.40
19.07 (17.44) 0.00 60.24
27.65 (17.97) 0.56 64.44
BON ⫽ Bonferroni; HOCH ⫽ Hochberg; APD-G ⫽ Armitage–Parmar–Dubey based on general structure for ⍀; PSD ⫽ permutation step-down.
procedure were 49.70% and 86.71%, respectively. For the PSD procedure, these values were 51.54% and 87.32%. Further analysis revealed that average rates of any-variable and all-variable power for all of the investigated procedures were similar regardless of whether 1L ⫽ 2L or 1L ⫽ 2L. For example, for the Bonferroni procedure, average any-variable power rates were 79.42% and 79.00%, respectively, for these two correlation conditions, whereas for all-variable power, the average rates were 29.10% and 29.83%, respectively. For the PSD procedure, the average any-variable power rate for the former case was 80.61%, whereas for the latter, it was 79.81%. For conditions where 1L ⫽ 2L, the power of the Bonferroni procedure decreased more than for other procedures as L increased; for L ⫽ 0.3, average any-variable power was 81.00%, whereas for L ⫽ 0.70, it was 74.10%. For the PSD procedure, the corresponding values were 81.30% and 77.40%, respectively.
Numeric Example and Software Data for this numeric example are from the Manitoba Inflammatory Bowel Disease Cohort Study, a longitudinal cohort study of patients diagnosed with Crohn’s disease or ulcerative colitis. An objective of the cohort study is to investigate patients’ psychosocial responses over the disease course. The cohort has been described in detail elsewhere (Lix et al., 2008). Data were analyzed for all cohort participants (n ⫽ 356) who provided responses at three measurement occasions: 0 months (i.e., baseline), 12 months, and 24 months following study entry. The
example tests differences between participants with self-reported active and inactive disease on multiple measures of quality of life. Disease activity was assessed on the basis of patient reports of symptom persistence at study baseline and is a dichotomous variable. The outcome variables are subscales of the Inflammatory Bowel Disease Questionnaire (Guyatt et al., 1989): bowel symptoms, emotional function, social function, and systemic symptoms. Higher scores are indicative of better health outcomes. The analysis is limited to study participants with complete data; a total of 55 participants had missing data on one or more outcome variables. The results for tests of the Group ⫻ Occasions interaction with the Bonferroni, Hochberg, APD-G, APD-K, and PSD procedures are reported in Table 6. Although the APD-K procedure is not recommended because it cannot control the FWR, the results for this procedure are provided for comparative purposes. If each test is evaluated at ␣m ⫽ .05, all are declared statistically significant. If the Bonferroni procedure is adopted to control the FWR to ␣ ⫽ .05, the test for the systemic symptoms outcome variable is judged to be nonsignificant. For the APD-G procedure, the estimated average correlations (Equation 9) range from .42 to .44; accordingly, the effective number of outcomes ranges from 2.18 to 2.24. For the APD-K procedure, the estimated average correlations range from .60 to .66; accordingly, the effective number of outcomes ranges from 1.61 to 1.74. Both procedures result in rejection of all interaction hypotheses. Similarly, the stepwise procedures result in rejection of all four interaction hypotheses. The SAS syntax to implement each of these test procedures is provided in the online supplementary documentation. This syntax
LIX AND SAJOBI
276
Table 5 Percentages of All-Variable Power of Multiple-Testing Procedures for the Group ⫻ Occasions Effect by Nonnull Mean Configuration Summary statistics Configuration M (SD) Min Max Configuration M (SD) Min Max Configuration M (SD) Min Max Configuration M (SD) Min Max Configuration M (SD) Min Max
BON
HOCH
APD-G
PSD
49.55 (13.74) 30.18 72.16
49.70 (13.78) 30.22 71.56
53.64 (14.28) 31.00 77.08
51.07 (13.67) 30.84 72.16
24.05 (17.21) 0.22 57.74
29.72 (18.79) 0.72 64.88
28.63 (19.80) 0.24 68.78
30.31 (18.70) 0.56 65.16
31.31 (21.33) 0.44 68.94
37.26 (22.20) 1.32 74.84
36.08 (23.31) 0.58 78.00
37.99 (22.08) 1.38 75.50
15.50 (14.74) 0.00 50.24
36.17 (18.72) 1.62 68.64
20.18 (18.27) 0.00 63.28
32.87 (18.83) 0.68 66.82
26.08 (19.66) 0.08 63.70
51.43 (19.70) 8.26 79.64
31.95 (22.51) 0.16 75.20
48.55 (20.60) 4.88 78.82
II
III
IV
V
VI
Note. BON ⫽ Bonferroni; HOCH ⫽ Hochberg; APD-G ⫽ Armitage–Parmar–Dubey procedure based on general structure for ⍀; PSD ⫽ permutation step-down.
includes the SAS/IML code used to implement the APD-G, APD-K, and PSD procedures and the SAS/STAT code to implement the MULTTEST procedure for the Bonferroni procedure and the Hochberg (1988) procedure.
Discussion and Conclusions Our content analysis of recently published articles from four psychology journals found that testing of multiple outcomes is common in RM designs. There are several reasons why researchers might investigate multiple outcome variables. There may be limited research knowledge about which outcome or outcomes will be responsive to treatment or little consensus about which outcome is clinically relevant. A treatment might be intended to have a multifaceted effect, necessitating a research question that focuses on more than one outcome. By investigating multiple outcome variables in a single study, it may be possible to gain a better overall understanding of treatment efficacy in experimental research. The content analysis results revealed that some researchers do not adjust for multiple testing because their objective or objectives are focused on the individual hypotheses. Other researchers choose to adopt a multiple-testing procedure because of their concern for an increase in the FWR. However, the content analysis also revealed that the majority of researchers do not address the issue of multiple testing in their studies. We recommend that researchers carefully consider their research objectives and ensure that their approach to multiple testing is consistent with these objectives. In the Monte Carlo study, the empirical properties of several procedures for testing multiple outcomes in two-group RM designs were investigated. The Bonferroni procedure and stepwise alternatives to the Bonferroni procedure were considered because they are easy to implement and well-known by many researchers. The
results of this study confirm previous research that demonstrates the Bonferroni procedure can result in substantially reduced power to detect all nonnull differences. Three procedures that account for the dependencies in the data were investigated. Two procedures, based on the work of Armitage and Parmar (1986) and Dubey (1994), modify the divisor of the Bonferroni procedure. These procedures do not consider the correlations among the outcome variables directly. Instead, they incorporate information about the average correlation among the outcomes when setting the level of significance for each variable. Neither of the modified Bonferroni procedures could control the FWR to ␣ for all of the investigated conditions. In particular, the procedure that assumes a simplified Kronecker product structure could result in inflated error rates, particularly as the number of outcome variables increased. The procedure that assumes a general covariance structure also occasionally resulted in inflated error rates for small numbers of outcome variables but resulted in error rates that tended to converge to ␣ as L increased in value. At the same time, the modified Bonferroni procedure that assumes a general covariance structure was only slightly more powerful than the conventional Bonferroni procedure and was frequently less powerful than the permutation procedure. The PSD procedure described by Westfall and Young (1993), which is based on the maximum test statistic and which uses resampling to account for the dependencies among the outcome variables, always maintained the FWR at ␣. It could be substantially more powerful than the conventional Bonferroni procedure, particularly as the number of outcome variables increased or as the correlations among the outcome variables increased in absolute magnitude. However, although this procedure always had greater any-variable power than Hochberg’s (1988) step-up procedure and frequently resulted in greater all-variable power, the magnitude of
⫽ 0.0120
⫽ 0.0150
p(3) ~
p(4) ~ Systemic symptoms
Social functioning
Emotional health
Note. BON ⫽ Bonferroni; HOCH ⫽ Hochberg; APD-G ⫽ Armitage–Parmar–Dubey procedure based on general structure for ⍀; APD-K ⫽ Armitage–Parmar–Dubey based on Kronecker product structure for ⍀; PSD ⫽ permutation step-down.
⬍ 0.0001 p(1) ~
PSD
⫽ 0.0020
Effective no. of outcomes ⫽ 1.67 p1 ⫽ 0.0011 p1 ~ ⫽ 0.0019 Effective no. of outcomes ⫽ 1.61 p2 ⫽ 0.0003 p2 ~ ⫽ 0.0005 Effective no. of outcomes ⫽ 1.69 p3 ⫽ 0.0071 p3 ~ ⫽ 0.0120 Effective no. of outcomes ⫽ 1.74 p4 ⫽ 0.0151 p4 ~ ⫽ 0.0263
APD-K APD-G
Effective no. of outcomes ⫽ 2.24 p1 ⫽ 0.0011 p1 ~ ⫽ 0.0026 Effective no. of outcomes ⫽ 2.18 p2 ⫽ 0.0003 p2 ~ ⫽ 0.0007 Effective no. of outcomes ⫽ 2.24 p3 ⫽ 0.0071 p3 ~ ⫽ 0.0159 Effective no. of outcomes ⫽ 2.23 p4 ⫽ 0.0151 p4 ~ ⫽ 0.0333
HOCH
F(2, 298) ⫽ 6.93 p(3) ⫽ 0.0011 p(3) ~ ⫽ 0.0044 F(2, 298) ⫽ 8.30 p(4) ⫽ 0.0003 p(4) ~ ⫽ 0.0012 F(2, 298) ⫽ 5.03 p(2) ⫽ 0.0071 p(2) ~ ⫽ 0.0142 F(2, 298) ⫽ 4.25 p(1) ⫽ 0.0151 p(1) ~ ⫽ 0.0151
BON
F(2, 298) ⫽ 6.93 p1 ⫽ 0.0011 p1 ~ ⫽ 0.0044 F(2, 298) ⫽ 8.30 p2 ⫽ 0.0003 p2 ~ ⫽ 0.0012 F(2, 298) ⫽ 5.03 p3 ⫽ 0.0071 p3 ~ ⫽ 0.0284 F(2, 298) ⫽ 4.25 p4 ⫽ 0.0151 p4 ~ ⫽ 0.0604
Outcome variable
Bowel symptoms
Table 6 Results of a Numeric Example for Multiplicity-Adjusted Tests of the Group ⫻ Occasions Effect for Four Outcome Variables; ␣ ⫽ .05
p(2) ~
MULTIPLE-TESTING PROCEDURES FOR MULTIPLE OUTCOMES
277
the difference in power between the two stepwise procedures was small. Thus, for the modest number of measurement occasions and outcome variables considered in this study, there was no substantial power advantage to be gained by adopting the computationally intensive step-down procedure. This is consistent with the results of Blair et al. (1996), who found that for the two-group design with L ⫽ 4 outcome variables and correlations among the outcome variables ranging from .20 to .50, the differences in average per-variable power between the PSD procedure and Rom’s (1990) sequential Bonferroni procedure were generally less than 5%. However, for correlations of .80 and L ⫽ 20 outcomes, the differences in power were more substantial and frequently exceeded 10%. The limitations of this study should be noted. The content analysis did not investigate all published psychology articles; doubly multivariate designs may be less common in some psychology research areas than in others. In the simulation study, several parameters were held constant, including the size of the correlation among the repeated measurements, which may not be representative of real data-analytic conditions. However, this does not affect the conclusions of this study because none of the investigated procedures assume a specific form for the covariance matrix of the repeated measurements. The simulation study focused only on conditions in which the outcome variables were positively correlated; in some data-analytic situations, outcome variables may be positively or negatively correlated, although positive correlations are more common (Blair et al., 1996). As well, all outcome variables had a constant variance. This is a reasonable assumption when the researcher is investigating multiple outcomes that represent subscales of a single instrument but may not be reasonable when the outcomes represent different instruments. Given the significant computational resources required for implementation of the PSD procedure, only 500 permutations were conducted for each set of simulation conditions. This affects the overall precision of the FWR and power estimates but is unlikely to change our conclusions about the relative performance of the different procedures. Only conditions of multivariate normality were considered. Permutation procedures rest on the assumption of exchangeability and are, therefore, sensitive to departures from a normal distribution if the populations have different distributions. In such circumstances, a resampling-based multiple-testing procedure based on the bootstrap method is preferable to one based on the permutation method (Huang, Xu, Calian, & Hsu, 2006; Pollard & van der Laan, 2005). One criticism of adjusting for multiple testing in studies that investigate several outcome variables is that, although the probability of making at least one Type I error is controlled to ␣, the probability of making a Type II error increases. Type II errors may be no less important than Type I errors when assessing differences between groups (Feise, 2002). A second criticism is that statistical significance of the tests, and accordingly decisions about treatment efficacy, are determined by the number of outcome variables or hypotheses included in the family. Decisions about the number of outcome variables to investigate may vary across studies. For example, one researcher may choose to limit the size of the family by using previous research to guide the selection of outcome variables. Another researcher might adopt an exploratory approach and investigate a larger set of outcomes. Different conclusions may result from these two approaches to selecting the number of
278
LIX AND SAJOBI
outcome variables. Furthermore, the family considered in this study was limited to tests of interaction effect hypotheses. However, the family could be defined as a set of contrasts among the RMs or groups on the L outcome variables. Other approaches to multiple testing that could be extended to doubly multivariate designs were not investigated in this research. As noted in the content analysis results, one procedure begins with a multivariate test, such as Hotelling’s (1931) T2, conducted at ␣ and proceeds with follow-up tests only on individual outcomes, each conducted at␣, if the omnibus hypothesis is rejected (Bird & HadziPavlovic, 1983; Bray & Maxwell, 1982; Ramsey, 1982). This procedure will control the FWR to ␣ if the omnibus hypothesis is true. However, if the omnibus hypothesis is rejected, the second stage of this protected procedure does not guard against a Type I error for any part or parts of the null hypothesis that are true. Moreover, previous research has shown that this approach may lack power to detect treatment effects, particularly if the size of the effect is similar across all outcome variables. Step-down analysis, as described by Roy and Bargman (1958; see also Stevens, 1973), is another alternative. This procedure assumes that the outcome variables can be ranked in descending order of importance. In a step-down analysis, tests of group differences are conducted, each at ␣, with an analysis of covariance model, in which higher ranked measures serve as covariates for tests on lower ranked measures. This article investigated only procedures that assign equal priority to each of the outcome variables in the analysis. In some studies, it may be possible to distinguish primary outcome variables, which are the main focus of a study, from secondary outcome variables, which are of lesser importance. This distinction between primary and secondary outcomes was made in a small number of articles that were investigated in the content analysis. Fisher (1991) suggested assigning a weight to each outcome variL able (i.e., qm; ¥m ⫽ 1 qm ⫽ 1) and adopting a Bonferroni procedure in which the divisor is equal to qmL to control the FWR (see also Benjamini & Hochberg, 1997). Finally, one criticism of multiple-testing procedures that control the FWR to ␣ is that they can be conservative. Benjamini and Hochberg (1995) proposed another measure of error, the false discovery rate, which is the ratio of the number of false rejections to the total number of rejections. Procedures that control the false discovery rate to ␣ also control the FWR in a weak sense. Benjamini and Hochberg proposed a multiple-testing procedure for independent tests that controls the false discovery rate; they demonstrated that it was more powerful than the Bonferroni procedure and the Hochberg (1988) procedure, particularly as the number of tests increased. Benjamini and Yekutieli (2001) showed that this procedure can also control the false discovery rate when the tests are positively correlated. In summary, we do not recommend that the Bonferroni procedure be adopted to control the FWR to ␣ when testing multiple outcomes in RM designs. Hochberg’s (1988) procedure can be recommended even though it does not account for the correlations among the outcomes because of its good power properties. The PSD procedure can also be recommended because it maintains control of the FWR and is substantially more powerful than the Bonferroni procedure. However, further research is warranted to assess these recommendations and compare the investigated procedures to other multipletesting procedures under conditions involving different numbers of outcome variables and correlation structures and in the presence of nonnormality and/or covariance heterogeneity.
References Armitage, P., & Parmar, M. (1986). Some approaches to the problem of multiplicity in clinical trials. In Proceedings of the XIIIth International Biometric Conference (pp. 1–15). Seattle, WA: Biometric Society. Bartlett, M. S. (1939). A note on tests of significance in multivariate analysis. Proceedings of the Cambridge Philosophical Society, 35, 180 –185. Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, 57, 289 –300. Benjamini, Y., & Hochberg, Y. (1997). Multiple hypothesis testing with weights. Scandinavian Journal of Statistics, 24, 407– 418. Benjamini, Y., & Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Annals of Statistics, 29, 1165–1188. Bird, K. D., & Hadzi-Pavlovic, D. (1983). Simultaneous test procedures and the choice of a test statistic in MANOVA. Psychological Bulletin, 93, 167–178. Blair, R. C., Troendle, J. F., & Beck, R. W. (1996). Control of familywise errors in multiple endpoint assessments via stepwise permutation tests. Statistics in Medicine, 15, 1107–1121. Boik, R. J. (1988). The mixed model for multivariate repeated measures: Validity conditions and an approximate test. Psychometrika, 53, 469 – 486. Boik, R. J. (1991). Scheffe’s mixed model for multivariate repeated measures: A relative efficiency evaluation. Communication in Statistics, 20, 1233–1255. Bradley, J. V. (1978). Robustness? British Journal of Mathematical and Statistical Psychology, 31, 144 –152. Bray, J. H., & Maxwell, S. H. (1982). Analyzing and interpreting significant MANOVAs. Review of Educational Research, 52, 340 –367. Chen, B. E., Sakoda, L. C., Hsing, A. W., & Rosenberg, P. S. (2006). Resampling-based multiple testing procedures for genetic case-control association studies. Genetic Epidemiology, 30, 495–507. Dubey, S. D. (1994). Adjustment of p-values for multiplicities of intercorrelating symptoms. In R. C. Buncher & J. Y. Tsay (Eds.), Statistics in the pharmaceutical industry (2nd ed., pp. 513–527). New York: Marcel Dekker. Dunn, O. J. (1961). Multiple comparisons among means. Journal of the American Statistical Association, 56, 52– 64. Dutilleul, P. (1999). The MLE algorithm for the matrix normal distribution. Journal of Statistical Computation and Simulation, 64, 105–123. Dutilleul, P., & Pinel-Alloul, B. (1996). A doubly multivariate model for statistical analysis of spatio-temporal environmental data. Environmetrics, 7, 551–565. Everitt, B. S. (1995). The analysis of repeated measures: A practical review with examples. Statistician, 44, 113–135. Feise, R. J. (2002). Do multiple outcome measures require p-value adjustment? BMC Medical Research Methodology, 2, 1– 4. Fisher, L. D. (1991). A review of methods for handling multiple endpoints in clinical trials. In Proceedings of the ASA Biopharmaceutical Section (pp. 43– 46). Galecki, A. T. (1994). General class of covariance structures for two or more repeated factors in longitudinal data analysis. Communication in Statistics—Theory and Methods, 23, 3105–3119. Ge, Y., Dudoit, S., & Speed, T. P. (2003). Resampling-based multiple testing for microarray data analysis. Test, 12, 1–77. Guyatt, G., Mitchell, A., Irvine, E. J., Singer, J., Williams, N., Goodacre, R., & Tompkins, C. (1989). A new measure of health status for clinical trials in inflammatory bowel disease. Gastroenterology, 96, 804 – 810. Hochberg, Y. (1988). A sharper Bonferroni procedure for multiple tests of significance. Biometrika, 75, 800 – 802. Holland, B. S., & Copenhaver, M. D. (1987). An improved sequentially rejective Bonferroni test procedure. Biometrics, 43, 417– 423. Holland, J. M., Neimeyer, R. A., Currier, J. M., & Berman, J. S. (2007). The efficacy of personal construct therapy: A comprehensive review. Journal of Clinical Psychology, 63, 93–107.
MULTIPLE-TESTING PROCEDURES FOR MULTIPLE OUTCOMES Hotelling, H. (1931). The generalization of Student’s ratio. Annals of Mathematical Statistics, 2, 360 –378. Huang, Y., Xu, H., Calian, V., & Hsu, J. C. (2006). To permute or not to permute. Bioinformatics, 18, 2244 –2248. Keefe, F. J., Dunsmore, J., & Burnett, R. (1992). Behavioral and cognitive– behavioral approaches to chronic pain: Recent advances and future directions. Journal of Consulting and Clinical Psychology, 60, 528 –536. Keselman, H. J., Wilcox, R. R., & Lix, L. M. (2003). A generally robust approach to hypothesis testing in independent and correlated groups designs. Psychophysiology, 40, 586 –596. Lix, L. M., Deering, K., Fouladi, R. T., & Manivong, P. (2009). Comparing treatment and control groups on multiple outcomes: Robust procedures for testing a directional alternative hypothesis. Educational and Psychological Measurement, 69, 198 –215. Lix, L. M., Graff, L. A., Walker, J. R., Clara, I., Rawsthorne, P., Rogala, L., et al. (2008). Longitudinal study of quality of life and psychological functioning for active, fluctuating, and inactive disease patterns in inflammatory bowel disease. Inflammatory Bowel Disease, 14, 1575–1584. Lix, L. M., & Hinds, A. (2004). Multivariate contrasts for repeated measures designs under assumption violations. Journal of Modern Applied Statistical Methods, 3, 333–344. Looney, S. W., & Stanley, W. G. (1989). Exploratory repeated measures analysis for two or more groups. American Statistician, 43, 220 –225. McCain, N. L., Gray, D. P., Elswich, R. K., Jr., Robins, J. W., Tuck, I., Walter, J. M., et al. (2008). A randomized clinical trial of alternative stress management interventions in persons with HIV infection. Journal of Consulting and Clinical Psychology, 76, 431–441. McCall, R. B., & Appelbaum, M. I. (1973). Bias in the analysis of repeated measures designs: Some alternative approaches. Child Development, 44, 401– 415. Pillai, K. C. S. (1955). Some new test criteria in multivariate analysis. Annals of Mathematical Statistics, 26, 117–121. Pollard, K. S., & van der Laan, M. (2005). Resampling-based multiple testing: Asymptotic control of Type I error and applications to gene expression data. Journal of Statistical Planning and Inference, 125, 85–100. Ramsey, P. H. (1982). Empirical power of procedures for comparing two groups on p variables. Journal of Educational Statistics, 7, 139 –156. Rom, D. (1990). A sequentially rejective test procedure based on modified Bonferroni inequality. Biometrika, 77, 663– 665. Roy, S. N., & Bargman, R. E. (1958). Tests of multiple independence and the associated confidence bounds. Annals of Mathematical Statistics, 29, 491–503. Sankoh, A. J., Huque, M. F., & Dubey, S. D. (1997). Some comments on frequently used multiple endpoint adjustment methods in clinical trials. Statistics in Medicine, 16, 2529 –2542.
279
SAS Institute. (2004a). SAS/IML user’s guide, Version 9. Cary, NC: SAS Institute. SAS Institute. (2004b). SAS/STAT user’s guide, Version 9. Cary, NC: SAS Institute. Stevens, J. P. (1973). Step-down analysis and simultaneous confidence intervals in MANOVA. Multivariate Behavioral Research, 8, 391– 402. Thomas, D. R. (1983). Univariate repeated measures techniques applied to multivariate data. Psychometrika, 48, 451– 464. Troendle, J. F. (1995). A stepwise resampling method of multiple hypothesis testing. Journal of the American Statistical Association, 90, 370 –378. Troendle, J. F. (1996). A permutational step-up method of testing multiple outcomes. Biometrics, 52, 846 – 859. Wansbeek, T., & Verhees, J. (1990). The algebra of multimode factor analysis. Linear Algebra and Its Applications, 127, 631– 639. Ware, J. E., Jr., Snow, K. K., & Kosinski, M. (1993). SF-36 health survey: Manual and interpretation guide. Boston, MA: Health Institute, New England Medical Center. Weisel, A., & Tur-Kaspa, H. (2002). Effects of labels and personal contact on teachers’ attitudes toward students with special needs. Exceptionality, 10, 1–10. Weisz, J. R., Doss, A. J., & Hawley, K. M. (2005). Youth psychotherapy outcome research: A review and critique of the evidence base. Annual Review of Psychology, 56, 337–363. Werner, K., Jansson, M., & Stoica, P. (2008). On estimation of covariance matrices with Kronecker product structure. IEEE Transactions on Signal Processing, 56, 478 – 491. Westfall, P. H., Tobias, R. D., Rom, D., Wolfinger, R. D., & Hochberg, Y. (1999). Multiple comparisons and multiple tests using SAS system. Cary, NC: SAS Institute. Westfall, P. H., & Young, S. S. (1993). Resampling-based multiple testing: Examples and methods for p-value adjustment. New York: Wiley. Whittal, M. L., Robichaud, M., Thordarson, D. S., & McLean, P. D. (2008). Group and individual treatment of obsessive– compulsive disorder using cognitive therapy and exposure plus response prevention: A 2-year follow-up of two randomized trials. Journal of Consulting and Clinical Psychology, 76, 1003–1014. Wilkinson, L., & the Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594 – 604. Zhang, J., Quan, H., Ng, J., & Stepanavage, M. E. (1997). Some statistical methods for multiple endpoints in clinical trials. Controlled Clinical Trials, 18, 204 –221.
Appendix Method for Estimating ⌺K and ⌺L for the Armitage–Parmar–Dubey Procedure Estimates of ⌺K and ⌺L can be obtained with least squares and maximum likelihood methods (Wansbeek & Verhees, 1990), with the latter being more efficient than the former (Werner, Jansson, & Stoica, 2008). Dutilleul (1999) proposed an algorithm (see also Dutilleul & Pinel-Alloul, 1996) that finds maximum likelihood estimates with the following system of equations,
ˆ ⫽ 1 ⌺ L KN
and
(Appendix continues)
冘冘 nj
2
i⫽1 j⫽1
ˆ ⫺1 共W ⫺ W兲T, 共Wij ⫺ W兲⌺ K ij
(A1)
LIX AND SAJOBI
280
observation matrices. Then RL, the correlation matrix for the L ˆ . outcomes can be estimated from ⌺ L
冘冘 n
2
j ˆ ⫽ 1 ˆ ⫺1 共W ⫺ W兲, ⌺ 共Wij ⫺ W兲 T⌺ ⌲ L ij LNi⫽1 j⫽1
(A2) where Wij is the L ⫻ K matrix obtained by reshaping Yij, and W is the matrix of means obtained by averaging across all such
Received September 26, 2008 Revision received August 4, 2009 Accepted August 18, 2009 䡲
New APA Editors Appointed, 2012–2017 The Publications and Communications Board of the American Psychological Association announces the appointment of 9 new editors for 6-year terms beginning in 2012. As of January 1, 2011, manuscripts should be directed as follows: ● Emotion (http://www.apa.org/pubs/journals/emo), David DeSteno, PhD, Department of Psychology, Northeastern University, Boston, MA 02115 ● Experimental and Clinical Psychopharmacology (http://www.apa.org/pubs/journals/ pha), Suzette M. Evans, PhD, Columbia University and the New York State Psychiatric Institute, New York, NY 10032 ● Journal of Abnormal Psychology (http://www.apa.org/pubs/journals/abn), Sherryl H. Goodman, PhD, Department of Psychology, Emory University, Atlanta, GA 30322 ● Journal of Comparative Psychology (http://www.apa.org/pubs/journals/com), Josep Call, PhD, Max Planck Institute for Evolutionary Biology, Leipzig, Germany ● Journal of Counseling Psychology (http://www.apa.org/pubs/journals/cou), Terence J. G. Tracey, PhD, Counseling and Counseling Psychology Programs, Arizona State University, Tempe, AZ 85823 ● Journal of Personality and Social Psychology: Attitudes and Social Cognition (http://www.apa.org/pubs/journals/psp), Eliot R. Smith, PhD, Department of Psychological and Brain Sciences, Indiana University, Bloomington, IN 47405 ● Journal of Experimental Psychology: General (http://www.apa.org/pubs/journals/ xge), Isabel Gauthier, PhD, Department of Psychology, Vanderbilt University, Nashville, TN 37240 ● Journal of Experimental Psychology: Human Perception and Performance (http:// www.apa.org/pubs/journals/xhp), James T. Enns, PhD, Department of Psychology, University of British Columbia, Vancouver, BC V6T 1Z4 ● Rehabilitation Psychology (http://www.apa.org/pubs/journals/rep), Stephen T. Wegener, PhD, ABPP, School of Medicine Department of Physical Medicine and Rehabilitation, Johns Hopkins University, Baltimore, MD 21287 Electronic manuscript submission: As of January 1, 2011, manuscripts should be submitted electronically to the new editors via the journal’s Manuscript Submission Portal (see the website listed above with each journal title). Manuscript submission patterns make the precise date of completion of the 2011 volumes uncertain. Current editors, Elizabeth A. Phelps, PhD, Nancy K. Mello, PhD, David Watson, PhD, Gordon M. Burghardt, PhD, Brent S. Mallinckrodt, PhD, Charles M. Judd, PhD, Fernanda Ferreira, PhD, Glyn W. Humphreys, PhD, and Timothy R. Elliott, PhD will receive and consider new manuscripts through December 31, 2010. Should 2011 volumes be completed before that date, manuscripts will be redirected to the new editors for consideration in 2012 volumes.