This is the so-called multiplicity problem of statistical inference. Here we focus on issues pertaining to the classical multiple comparisons problem arising in the.
Multiple Comparisons Multiplicity considerations arise in experimental research when it is desired to make inferences about several aspects of a problem simultaneously, while controlling some aspect of the frequency properties of the statistical procedure (see Simultaneous Inference). For example, observational units may generate multivariate responses and it may be of interest to examine covariate effects on each response (see Multiplicity in Clinical Trials) Alternatively, interest may lie in carrying out multiple analyses on the basis of subgroups of patients defined a priori. Repeated significance tests, often carried out to ensure early detection of effective treatments in clinical trials, also raise multiplicity issues (see Data and Safety Monitoring). In all of these cases, several tests of significance (see Hypothesis Testing) are typically carried out. If carried out naively, then the probability of making one or more false positive conclusion is typically higher than expected. This is the so-called multiplicity problem of statistical inference. Here we focus on issues pertaining to the classical multiple comparisons problem arising in the comparison of several populations with respect to a single response variable. For other examples of multiplicity, see Simultaneous Inference. Consider an experiment with the objective of comparing the means of I populations. Suppose a sample of ni subjects is available for study from the ith population. Let n· = Ii=1 ni , and let Yij denote the response variable for the j th subject in the ith sample. We further suppose the responses are generated according to the linear model Yij = µi + Eij , where µi is the mean response for the ith population, Eij ∼ N(0, σ 2 ) are independently distributed, and σ 2 is a common variance parameter reflecting the extent of the sampling variability, j = 1, . . . , ni , i = 1, . . . , I . Here and throughout, we make the distinction between random variables and their realized values by using upper and lower case letters, respectively.Thus, if yij denotes a realization of Yij , then i y i = nj =1 yij /ni denotes the mean of the ith sample, 2 and s the pooled estimate of the common variance (see Analysis of Variance). Comparisons of the population means µ1 , . . . , µI may be based on tests of hypotheses or interval estimation. We consider issues pertaining to hypothesis tests, and point out that directly analogous issues
arise in the context of interval estimation; specific comments on this follow. In this article we emphasize applications arising in clinical trials and related biopharmaceutical experiments in which the I samples consist of individuals randomized to one of I groups undergoing different treatment regimens. In what follows we use related terminology and will typically make reference to treatment comparisons. There is a variety of contexts in which one might be interested in making inferences about differences in the population means, with the specific features of the problem leading to particular analysis strategies. The structure of the problem described above is that of a one-way analysis of variance (ANOVA), suggesting that if the null hypothesis were the equality of all I means, a corresponding statistic following the F distribution on (I − 1, n· − I ) degrees of freedom would be a natural choice. In many contexts, however, such an approach provides inadequate insight, since rejection of the null hypothesis does not furnish information regarding the nature of the treatment differences. In general, multiple comparison procedures are directed at facilitating more detailed analyses to gain such insight, while adjusting for the multiplicity by controlling certain frequency properties of the testing procedure. Note, however, that there are many contexts in which multiple treatment comparisons can be made without the need to adopt procedures that adjust for multiplicity. Cox [7] has pointed out that probabilities regarding the simultaneous correctness of many statements may not always be of direct relevance. Dunnett [13] points out that in situations where multiple experimental treatments are used in the same study to maximize efficiency in the use of resources rather than for the purpose of making joint inferences, it is reasonable to make treatment to reference group (control) comparisons as if the data had been collected from different studies. Examples of such scenarios include Finney [19] and Redman & Dunnett [42]. This approach is consistent with the views put forth by Cook & Farewell [6] who propose that in more general contexts when multiplicities arise, such multiplicity adjustments are often adopted unnecessarily. Cook & Farewell suggest that provided a limited number of well-defined questions are posed at the design stage, and these questions relate to different features or different treatments under study, then the case can be made for avoiding multiple
2
Multiple Comparisons
testing procedures. As can be seen by the vagueness of the above statements, it remains difficult to characterize clearly situations in which multiplicity adjustments are required and when they are not, and so it is natural that some debate in this area will continue. In what follows we discuss and contrast strategies that are generally appropriate if there is genuine concern about the need for making adjustments for multiplicity.
Overview and Terminology Background For the well-known case in which I = 2, suppose the null and alternative hypotheses are H0 : µ1 = µ2 and Ha : µ1 = µ2 , respectively. Hypothesis testing procedures are typically formulated by specification of a discrepancy measure, a many-to-one function of the random response variables which is a stochastic measure of the “distance” between the observed responses and what would be anticipated under the null hypothesis. A discrepancy measure must have a known distribution for specified values of the parameters of interest. Upon collecting the data, one may compute a realized value for the discrepancy measure, which is typically referred to as a test statistic. The P value of the test is the probability, under the null hypothesis, of observing a realized value of the discrepancy measure as extreme or more extreme than that observed. Thus, small P values indicate that the data are inconsistent with the null hypothesis. The significance level, denoted by α, is a specified threshold value such that if the observed P -value is ≤ α, H0 is rejected in favor of Ha (see Level of a Test). Corresponding to a given threshold significance level is a critical value, c, such that test statistics larger than or equal to c lead to rejection of the null hypothesis. A type I error is said to be committed if the null hypothesis is rejected when it is in fact true (see Hypothesis Testing). The type I error rate, a decision-theoretic notion introduced by Neyman & Pearson [40], corresponds to the rate with which this error would be made in a hypothetically infinite population of repetitions of the trial. A test procedure is said to control the type I error rate at α if the probability of a type I error is less than or equal to α. The type I error rate may be interpreted probabilistically and hence one may write Pr(reject H0 |H0
is true) ≤ α (the equality will typically hold when the discrepancy measure has a continuous distribution and the null hypothesis specifies a single point in the parameter space). For the case in which I > 2, it might be tempting to carry out multiple hypothesis tests in an effort to learn more about the nature of any potential treatment differences. The principal difficulty with this approach is that multiple hypothesis tests at a common significance level α will result in a probability of committing one or more type I errors that may be substantially larger than α. Structure and rigor are added to the multiple testing procedures to ensure that the type I error rate properties are known, or at least controlled.
Formulation and Terminology of Multiple Comparison Procedures As a first step in formalizing multiple comparison procedures, Hochberg & Tamhane [26] define a family as a “collection of inferences for which it is meaningful to take into account some combined measure of error”. Tamhane [50] further states that there should be a “contextual relatedness” for inferences grouped into a common family. Tests pertaining to this family are directed at investigating this aspect of the treatments. Note that there may be more than one family of hypotheses in a given experiment, with each family addressing a different research question. Furthermore, since these questions may be interrelated, the families might not be disjoint (i.e. they may share one or more component hypotheses). For the purposes of this discussion it is sufficient to consider a single family, and we do so for the remainder of this article. For concreteness we consider a family as consisting of a collection of null and alternative hypotheses {(Hk0 , Hka ), k = 1, 2, . . . , K}, where K denotes the total number of hypotheses in the family. The familywise error rate (FWE) is defined as the probability of making one or more false positive conclusions over all hypothesis tests in a particular family. Control of the FWE is appropriate if one cannot tolerate any type I error in the family no matter how many of the K null hypotheses are true. A procedure is said to have strong control of the FWE if the probability of making at least one type I error over all hypothesis tests of the family is at most α, regardless of how many component null hypotheses may be true; weak control of the FWE at α is achieved
Multiple Comparisons if this type I error rate is guaranteed to be at most α only when all null hypotheses are true. Typically multiple comparison procedures control the FWE at α, thus satisfying Pr(at least one null hypothesis is falsely rejected) ≤ α. Procedures of this sort clearly also control the type I error rates of any subset of the family, including the component tests, while guaranteeing that the FWE does not exceed a specified level. Another term often used is the per-comparison error rate (PCE). Here we restrict consideration to true null hypotheses in the family, and define the PCE as the expected number of false positive conclusions divided by the number of true null hypotheses. This error rate therefore corresponds to the usual type I error rate for individual hypotheses that are tested without any adjustment for multiplicity. Suppose that m (which is unknown) hypotheses are true and K − m are false. Denote by T the number of true hypotheses that are rejected (false positives) and by F the number of false hypotheses that are rejected (true positives). T and F are random variables. Benjamini & Hochberg [3] defined the false discovery rate (FDR) as the expected value of T /(T + F ). By comparison, PCE is the expected value of T /m and FWE = Pr(T > 0). They proposed that a multiple testing procedure control FDR ≤ α, instead of FWE ≤ α. They showed that FDR is equivalent to FWE when m = K (all hypotheses are true). Thus, it provides weak control of FWE. When several hypotheses are false, it is less conservative than controlling the FWE and may provide a useful concept for situations where strict control of the FWE is not needed. Multiple comparison procedures may be classified as single-step or stepwise procedures. In a singlestep procedure, multiple tests are carried out using the same critical value for each component test. Procedures that involve carrying out multiple tests in sequence, using critical values which may be unequal, are called stepwise multiple testing procedures. For such stepwise procedures it is convenient to arrange the test statistics in ascending order according to their significance levels, and to arrange the component hypotheses conformably. Single-step procedures are attractive in some respects since they are simpler to apply and they have direct connections to simultaneous confidence intervals. They generally have lower power than stepwise procedures, however, and
3
so are not desirable for the purposes of hypothesis testing. Stepwise procedures for multiple comparisons may be further classified as step-down or step-up procedures. In step-down procedures, formal tests are carried out in a stepwise fashion starting with the most extreme outcome (i.e. the most significant test statistic). Testing proceeds to the hypothesis corresponding to the next most extreme outcome only upon rejection of the current hypothesis. If the current null hypothesis is not rejected, then all subsequent null hypotheses (i.e. those corresponding to the test statistics with the less extreme outcomes) are not rejected. Thus, if the first test statistic does not exceed its corresponding critical value, then the testing terminates with failure to reject any null hypothesis. Critical values are derived to ensure control of the FWE. In step-up testing procedures, the testing begins with the statistic corresponding to the least significant outcome. Testing continues and statistics are examined of progressively more extreme outcomes until a null hypothesis is rejected. At this point, all null hypotheses corresponding to the more extreme test statistics are also rejected. As in the step-down procedures, appropriate critical values are determined to ensure control of the FWE. Examples of single-step, stepdown, and step-up multiple testing procedures are provided in subsequent sections. In many cases an overall null hypothesis may be satisfied if and only if severalless restrictive hypotheses are satisfied. Let H0 = K k=1 Hk0 be the overallnull hypothesis that all Hk0 are true, and Ha = K k=1 Hka the corresponding alternative hypothesis that at least one Hk0 is false. Testing H0 against Ha is known as a union–intersection problem, due to Roy [44]. Denoting the test statistic for Hk by tk for k = 1, . . . , K, the test statistic for H0 is max(t1 , . . . , tK ). The critical value c is determined so that the type I error of the test is α, which requires that c be chosen to satisfy the following K-variate probability requirement: Pr(T1 < c, . . . , TK < c|H0 ) = 1 − α. The solution, c = cK say, will be larger than the α-point of the univariate statistic, the difference representing an adjustment for the multiplicity. problem, where H0 = KFor the intersection–union K H and H = H a k=1 k0 k=1 ka , the test statistic is min(t1 , . . . , tK ). To determine the value of the critical
4
Multiple Comparisons
constant in this case, Berger [4] demonstrated that the α-point of the univariate statistic is the correct value to use so that, in effect, no multiplicity adjustment is needed. If {Hk0 }K k=1 denotes a family of null hypotheses, then the closure of this family is formed by considering all intersections HS = k∈S Hk0 , where S ⊆ {1, 2, . . . , K}. A closed testing procedure operates by rejecting any HS if and only if every HR is rejected by an α-level test for R ⊇ S. Marcus et al. [37] show that this strategy controls the FWE.
Historical Remarks Suppose the null hypothesis consists of common means (H0 : µ1 = µ2 = · · · = µI ) and the alternative is that at least one mean is different. Upon application of standard ANOVA methods and rejection of the null hypothesis, it is natural to want to examine the nature of the apparent treatment differences. Fisher [20] was among the first to propose a formal multiple comparison procedure with a view to investigating potential treatment differences following a standard one-way analysis of variance. Fisher’s protected least significant difference procedure operates as follows. If the F test from the one-way analysis of variance is carried out with a type I error rate α, and if it fails to lead to rejection of the null hypothesis of common means, the procedure terminates. If H0 : µ1 = µ2 = · · · = µk is rejected, then all pairwise tests are carried out with a PCE of α for each. This procedure can be shown to have only weak control of the FWE. An alternative, suggested by Fisher, is to proceed directly to the K specific treatment comparisons of interest and carry out these tests with a PCE error rate α/K. This is referred to as a Bonferroni adjustment to the per-comparison error rates that maintains strong control of the FWE at α. It is well known to be conservative, particularly for highly correlated test statistics [41]. The Bonferroni procedure is an example of a single-step multiple test procedure since each test is carried out with the same critical value, regardless of the outcomes of any of the other tests. Scheff´e [46] developed an approach for simultaneous inference which follows naturally from the one-way analysis of variance. In particular, if one considers all contrasts of the form Ii=1 i µi , where I e’s multiple comparison procei=1 i = 0, Scheff´ dure generates a set of tests (confidence intervals)
that have a FWE (simultaneous coverage probability) less than or equal to α(≥ 1 − α). Note that with the appropriate choice of coefficients these contrasts may correspond to pairwise comparisons. Finally, we note that if the F test leads to rejection of the null hypothesis of common means, then there exists a vector of coefficients = (1 , . . . , I ) such that a test of H0 : Ii=1 i µi = 0 is rejected using Scheff´e’s procedure. Several alternative strategies for multiple testing were proposed on the basis of Studentized range statistics. If ni = n, i = 1, . . . , I , then the Studentized range distribution is the probability distribution for the range (max{Y i } − min{Y i }) of the I independent sample means all from a standard normal distribution, divided by the pooled estimate of the standard error of a sample mean (i.e. the square root of a chi-square distributed random variable scaled by a factor [I (n − 1)]−1 ). This studentized range distribution is indexed by [I, I (n − 1)] and we α let qI,I (n−1) denote the corresponding upper 100α% point. The percentage points of this Studentized range distribution are provided in Harter [22] and many texts, and may be used to carry out tests or construct simultaneous confidence intervals for differences in means. Tukey [52] indicated that one could carry out I (I − 1)/2 pairwise tests by comparing α zii = n1/2 (y i· − y i · )/s to the critical value qI,I (n−1) ; α that is, if |zii | > qI,I (n−1) , then the means µi and µi can be inferred to be different with FWE controlled at α, i = i = 1, . . . , I . So-called multiple range tests have also been developed by Newman [39], Keuls [34], and Duncan [9, 10]; the former two authors independently proposed the same testing procedure, which is often referred to as the Newman–Keuls test. Paraphrasing Miller [38], multiple range tests tend to declare two means in a set of I means significantly different provided the range of each and every subset containing the two means is significant according to an αg level studentized range test, where g is the number of means in the subset at hand. The Neuman–Keuls test and Duncan’s test differ in the way in which αg is determined. For the Neuman–Keuls test, αg = α for g = 2, 3, . . . , I , whereas αg = 1 − (1 − α)g−1 in Duncan’s test. Miller [38] provides a good illustration of the application of these two multiple range tests and argues that, while Duncan’s test leads to larger αg for larger group sizes (and hence greater power for detecting treatment effects), this is at the
Multiple Comparisons expense of increasing the rate of false positive conclusions arising from multiple tests. Miller therefore favors the Neuman–Keuls approach over Duncan’s. However, it is important to note that, in their original form, neither Duncan’s nor the Newman–Keuls procedure controls the FWE; various modifications of these procedures have been proposed but are beyond the scope of this review (see [50] for details). Generalizations to facilitate applications to the unbalanced one-way lay-out have been proposed by several authors [10, 11, 35, 52]. Dunnett [12] conducted a detailed simulation study designed to investigate the empirical type I error rates of various procedures for this context. He found that the preferred method for pairwise comparisons is the Tukey–Kramer procedure, which compares 1/2 the statistic √ (y i· − y i · )/[s(1/ni· + 1/ni · ) ], with α qI,I (n−1) / 2. This was found to be slightly conservative, but less so than other procedures developed for this same context [21, 24, 49]. A proof of the conservativeness of the Tukey–Kramer method for the one-way model was obtained by Hayter [23]. Generalizations to the studentized augmented range distributions may be utilized if it is desired to test not only for the equality of all means, but whether all means share a specific value. Hence, any pair of means for which the simultaneous confidence intervals does not contain the null value of zero, say, may be declared to be significantly different. With a simultaneous coverage probability of 1 − α, the FWE of this test is controlled at α.
More Recent Developments in Multiple Comparisons
5
from the K component tests ordered from largest to smallest. Then H0 is rejected if P(k) ≤ (K − k + 1)α/K for some k. This test has higher power than the Bonferroni procedure for testing H0 . The FWE is shown to be controlled at α by arguments pertaining to order statistics of independent uniform (0, 1) random variables and is valid under the assumption that the component discrepancy measures are independent [48]. Correlations among the discrepancy measures can lead to serious inflation of the FWE [28]. This is an example of a union–intersection test procedure. A number of more computationally intensive approaches have also been proposed. Brown & Fears [5] focused on binomial data and unadjusted P values arising from unconditional or conditional analyses. A permutation distribution (with fixed marginal frequencies) was then used to compute the adjusted P value corresponding to the probability (based on the permutation distribution) of realizing a P value smaller than that observed (see Randomization Tests). Westfall [53] proposed instead that one resample with replacement from the observed data set and determine whether the minimum P value of the new data set is, or is not, less than or equal to that observed in the sample. The frequency with which it is less is the adjusted P value for the comparison of interest. An advantage of this approach, suggested by Westfall & Young [54], is that it effectively addresses the correlation of the test statistics, particularly for multivariate outcomes, or when the comparisons of interest are all against a single control arm.
Stepwise Procedures Using P Values
Single-Step Procedures Using P Values The Bonferroni procedure which rejects any Hk0 whose P value is ≤ α/K is perhaps the most widely known single-step testing procedure. It is attractive in its generality, but improvements have been made to generate procedures that are less conservative. ˇ ak’s inequality leads to the less conservative critiSid´ cal value corresponding to 1 − (1 − α)1/K instead of α/K [47]. The gain in power from this approach may be quite minimal, however, for cases when K ≤ 10. Simes [48] presents a multiple testing procedure K applicable when H0 = K k=1 Hk0 and Ha = k=1 Hka . Let P(1) ≥ P(2) ≥ · · · ≥ P(K) be P values arising
Holm [27] presents a step-down multiple testing procedure, which he refers to as a sequentially rejective Bonferroni test, based on the ordered P values (see Multiple Endpoints, P Level Procedures). Again, let P(1) ≥ P(2) ≥ · · · ≥ P(K) be the ordered P values. Denote by H(k) 0 the null hypothesis corresponding to p(k) for 1 ≤ k ≤ K. The procedure compares the ordered P values with the sequence α, α/2, . . . , α/K starting with P(K) , then P(K−1) , etc., continuing as long as P(k) ≤ α/k, in which case we reject H(k) 0 and go to the next ordered P value down; the first time we find P(k) > α/k, we stop testing and accept (do not reject) the remaining hypotheses.
6
Multiple Comparisons
Holm [27] points out that since the thresholds used to assess the strength of evidence against the null hypotheses are larger for all but the minimum P value, P(K) , this approach has a higher probability of rejecting false null hypotheses than the standard Bonferroni procedure. This approach is attractive in that, as with the standard Bonferroni procedure, it is widely applicable. Step-up procedures have also been proposed as improvements to the standard Bonferroni approach. Under Hochberg’s [25] step-up procedure, all K P values are ordered as before, and testing begins by comparing the largest P value, which is P(1) , with α, then P(2) with α/2, and so on, continuing as long as the corresponding hypothesis is not rejected. The first time a rejection occurs, testing stops and all remaining hypotheses are rejected as well. In other words, one does not reject until P(k) ≤ α/k, at which (K) are rejected, k = 1, . . . , K. Note point H(k) 0 , . . . , H0 that this procedure employs the same critical values as Holm’s step-down procedure and any hypothesis rejected by Holm’s procedure is also rejected by Hochberg’s procedure; hence the latter is at least as powerful. In Hommel’s [29, 30] step-up procedure, one searches for the largest m(1 ≤ m ≤ K) such that P(k) >
(m − k + 1)α , m
for k = 1, . . . , m.
If such an m exists, then any hypothesis that has a P value ≤ α/m is rejected. If such an m does not exist, then all hypotheses are rejected. Hommel’s procedure coincides with Hochberg’s for its first two steps, but after that it may reject additional hypotheses to those rejected by Hochberg’s procedure. Both Hochberg’s and Hommel’s procedures were developed by applying the closure principle to the procedure of Simes [48]. Hommel’s procedure has slightly greater power, but is more complicated to apply. Another procedure that has slightly greater power but is also more complicated to apply, was given by Rom [43].
Stepwise Procedures Using Normal Theory In multitreatment trials, the individual hypotheses Hk , k = 1, . . . , K, are usually formulated in terms of parameters θk which are contrasts in the population means. Estimates θˆk of these are determined from the data. Under the linear model assumptions stated at
the beginning of this article, it follows that the θˆk are normally distributed with E(θˆk ) = θk , var(θˆk ) = τk2 σ 2 , and corr(θˆi , θˆj ) = ρij , where the τk2 and ρij are known constants which depend upon the design (e.g. on the sample sizes of the treatment groups). To test the hypothesis Hk that θk has a specified value 0 (say), a test statistic tk = θˆk /(τk2 s 2 )1/2 , where s 2 is an estimate of σ 2 , is used. Under the normality and homogeneous variance assumptions, the tk are Student t statistics and the joint distribution of the corresponding random variables is multivariate t. This provides the underlying distribution theory for testing the hypotheses. In stepwise testing the test statistics are ordered according to their P values as previously. Denote them by t(1) , t(2) , . . . , t(K) , where t(1) is the least significant and t(K) the most significant test statistic. These are compared in sequence with a set of critical constants c1 < · · · < cK , determined so that the FWE is ≤ α. In step-down testing, we start with t(K) , then go to t(K−1) , and so on. We continue to the next test in the sequence whenever we find t(k) ≥ ck and reject the corresponding hypothesis, stopping the first time t(k) < ck and accepting (not rejecting) any remaining hypotheses. In step-up testing we start with t(1) , then go to t(2) , and so on. We continue to the next test in the sequence whenever we find t(k) < ck and accept (do not reject) the corresponding hypothesis, stopping the first time t(k) ≥ ck and rejecting any remaining hypotheses. We illustrate the determination of the critical constants for the above step-down and step-up testing procedures for the case where one of the I treatment groups, say the I th group, is to be compared with each of the other groups. Then the contrasts of interest are θk = µk − µI , for k = 1, . . . , I − 1. Suppose, for simplicity, that each group has the same sample size, n, except for the I th group which has sample size n0 . Then let Yij (yij ) denote the random (realized) response for the j th subject and y i the sample mean in group i, i = 1, . . . , I ; s 2 denotes the pooled estimator of the common variance based on ν = (I − 1)(n − 1) + n0 − 1 degrees of freedom (df). Denote the random variables corresponding to the ordered t statistics t(1) , . . . , t(K) by T1 , . . . , TK , respectively. Then the critical constants for the stepdown case with two-sided alternative hypotheses are
Multiple Comparisons obtained by solving the following equations: Pr(−ck < T1 < ck , . . . , −ck < Tk < ck ) = 1 − α, k = 1, . . . , K = I − 1. T1 , . . . , Tk have a k-variate central t distribution with ν degrees of freedom and common ρ = n/(n + n0 ) under the null hypothesis H0 = ki=1 H(i) 0 . Note that since the critical value, cK , is the same as the one derived for the single-step procedure, the step-down method may be thought of as a natural extension of the single-step procedure. Tables of the critical values are readily available for the case in which the sample sizes for all but the reference group are the same (ni = n, i = 1, . . . , I − 1), and hence the discrepancy measures are equally correlated (see [2, 26]). In the case of unequal group sizes, good approximations can be obtained to the critical values by using the average correlation and interpolating from published tables. Alternatively, however, with a known correlation matrix, the critical values may be found by direct multivariate, or recursive, numerical integration [14]. For the step-up case, the values of the constants c1 , c2 . . . , ck are determined by solving Pr(−c1 < T(1) < c1 , . . . , −ck < T(k) < ck ) = 1 − α, k = 1, 2, . . . , K, where T(1) , . . . , T(k) are the ordered values of the random variables T1 , T2 , . . . , Tk associated with the first k t statistics in order of significance. Note that the solutions here must be obtained recursively, starting with k = 1, then k = 2, and so on, since in order to solve for any ck it is necessary to know the values of c1 , . . . , ck−1 . Also, for both step-down and step-up testing, the value of c1 is the α point of univariate Student’s t. For k > 1, the constant ck for step-up testing is slightly larger than the corresponding ck for step-down testing. The first step of the step-up testing procedure corresponds to the Laska & Meisner [36] MIN test, tests the intersection–union problem which K (k) (k) H vs. H = H ; hence, the stepH0 = K a k=1 0 k=1 a up procedure may be thought of as a natural extension of this test. There are limited tables of the step-up constants given in Dunnett & Tamhane [15, 16] for the case of equal correlations; the case of unequal correlations is considered in Dunnett & Tamhane [17]. The stepdown testing procedure is the normal theory analog
7
of Holm’s P value procedure, and the step-up testing procedure is the normal theory analog of Hochberg’s P value procedure. The advantage of the normal theory procedures is that they utilize the correlation structure of the parameter estimates and have higher power when the normality and homogeneous variance assumptions hold. However, the P value procedures do not depend on such assumptions and can be used when they do not hold.
Comparisons with the Best Treatment Now consider the case in which there is no specific reference group of interest and let µ(1) ≤ · · · ≤ µ(I ) denote the I ordered population means, where we assume that larger values of µi correspond to preferred treatments. Since the means themselves are unknown, so too is the appropriate ordering given above. Nevertheless, one can conduct inference on the quantities µ(I ) − µi , the difference in the mean response for the ith treatment group from that of the unknown “best” treatment. Hsu [31] derives a method of constructing simultaneous joint one-sided upper confidence intervals for the µ(I ) − µi , i = 1, . . . , I , and Hsu [32] extends these methods for two-sided intervals. For simplicity we focus on the case of common interest, namely where σ is unknown and upper bounds on µ(I ) − µi are of main interest. Hsu [31] shows that if U1 , . . . , UI are I independent and identically distributed standard normal random variables, α ni = n, i = 1, . . . , I, v = I (n − 1), and we let dI,v be the constant such that α s, i = 1, . . . , I − 1) = 1 − α, Pr(UI > Ui − dI,v
then a set of 100(1 − α)% simultaneous confidence intervals for µ(I ) − µi are given by [0, Di ], i = 1, . . . , I , where √ α Di = max max(X j ) − X i + dI,v s/ n, 0 . j =i
α is the solution to The constant dI,v ∞ ∞ α Φ I (u + dI,v s) dΦ(u) dΨv (s) = 1 − α, 0
−∞
where Φ(·) and Ψv (·) are the distribution functions for a standard normal random variable and a (χv2 /v)1/2 random variable, respectively. Tables for the constant d are identical, except for a constant
8
Multiple Comparisons
√
2, with those for the constants used in the normal theory step-down method described previously. For further information, see Hsu [33].
decisions made with respect to the other treatments included in the experiment, which are included in the same experiment for reasons of experimental efficiency.
Medical and Biometric Applications
Comparisons Between a New Treatment and Several Standards
Comparisons Between Several Treatments and a Control The problem is to compare K test treatments with a control treatment, which may be either a placebo or a standard treatment. Denote the unknown mean responses by µ1 , . . . , µK+1 , where µk , k = 1, . . . , K, denotes the mean for the kth test treatment and µK+1 denotes the control mean. We formulate a multiple hypotheses testing problem where we test H0k : µk = µK+1 vs. Hak : µk = µK+1 , for k = 1, . . . , K. Rejection of H0k in favor of Hak leads us to conclude there is a difference between the kth treatment and the control. The purpose of the trial may be to select the best candidate and to test the hypothesis pertaining to that particular candidate. Since there are K possible choices, we stipulate that the FWE be ≤ α to ensure that the probability of declaring a false positive result is at most α. Since only one treatment is to be chosen, we may use a single-step procedure. If we reject this hypothesis, and wish to perform additional hypothesis tests to determine whether other candidates also differ significantly from the control, then we use the step-down procedure as described earlier. (Note that since we are only interested in finding a test treatment better than the control, we may prefer to formulate the hypotheses testing problem with one-sided alternatives instead of two-sided alternatives (see Alternative Hypothesis) in order to increase power. However, this decision must be made at the design stage.) If the problem is to find all test treatments that can be shown to differ from the control, rather than selecting only one, then the problem may require a different formulation. Suppose the experiment is one of a series of similar experiments in which potential new treatments are compared with a control, and any test treatment showing promise is selected for further study. This is called screening (see Animal Screening Systems). In this case, interest would be in controlling the PCE rather than the FWE. Here the decision on each treatment does not depend on the
The test treatment may be a potential new treatment being compared with K standard treatments to determine whether it meets requirements for approval by the regulatory authority. The same hypotheses as before may be formulated, with µK+1 now denoting the mean for the test treatment and µk the mean for the kth standard. Rejection of H0k means that we conclude that the test treatment is different from the kth standard. To control the risk of any false claim, i.e. a claim that the test treatment differs from a particular standard when it does not, we would adopt a procedure that controls the FWE. Since the experimenter would like to find as many differences from the K standards as possible, one of the stepwise procedures, either step-down or step-up, should be used.
Superiority/Equivalence of a New Treatment Compared with K Standards In the family of hypotheses used in the previous example, the alternative hypotheses, Hak , were twosided: rejection of H0k , meant that we concluded there is a difference between the test treatment and the kth standard, but it could be either better or worse depending on the direction of the difference. Such a formulation is equivalent to testing a pair of hypotheses with one-sided alternatives, namely H0k vs. H1ak : µK+1 > µk (test treatment is better than kth standard), and H0k vs. H2ak : µK+1 < µk (test treatment is worse than kth standard). Here we describe an alternative formulation, proposed by Dunnett & Tamhane [18]. We replace the hypothesis in each pair for determining whether the test treatment is worse with one which tests whether the test treatment is equivalent to the kth standard, namely H0k : µK+1 = µk − δ vs. Hak : µK+1 > µk − δ. (see Bioequivalence; Equivalence Trials). Here, δ > 0 is a prespecified value representing a difference that is clinically unimportant. Rejection of H0k leads
Multiple Comparisons us to conclude that µK+1 cannot be less than µk by more than an amount δ and hence, by definition, it is equivalent to the kth standard. Dunnett & Tamhane [18] show how the stepwise multiple testing procedures described above can be adapted to this multiple testing problem. By using this superiority/equivalence testing formulation, we increase the power over the two-sided formulation, but of course this is addressing a slightly different problem. By testing simultaneously for superiority and equivalence, we also obtain more information than we would from a formulation that tests only for superiority using one-sided alternate hypotheses.
Comparisons with Both Active and Placebo Controls D’Agostino & Heeren [8] described a trial where comparisons between a new treatment, T , two known active treatments, A1 and A2 , and a placebo, P , are of interest. One proposal made was that the pairwise differences between treatment groups be tested with the experimentwise error rate controlled, which means that all comparisons are handled as a single family. Dunnett & Tamhane [16] pointed out that there were actually three families of comparisons, each answering a different question: (i) A1 and A2 vs. P to test the sensitivity of the experiment, defined as its ability to identify that the two known active treatments are efficacious; (ii) T vs. P to show that the new treatment is better than the placebo; and (iii) T vs. A1 and A2 to determine whether the new treatment can be shown to be superior to either of the known active treatments. Each family should be tested with FWE controlled at ≤ α. Failure to show the sensitivity of the experiment, or failure to find the new treatment better than placebo, would invalidate the comparisons made in (iii). In this case, controlling the FWE at α for each of the three families individually serves to control the overall error rate for the combined families at α as well, so there is no need to apply an experimentwise multiplicity adjustment. This is an example of what is known as a priori ordered families of hypotheses [1]. A single hypothesis test serves to cover (i), namely H01 : µT − µP ≤ 0 vs. Ha1 : µT − µP > 0. The following pair of hypotheses tests covers the comparisons in (ii):
9
H02 : µA1 − µP ≤ 0 vs. Ha2 : µT − µP > 0, H03 : µA2 − µP ≤ 0 vs. Ha3 : µA2 − µP > 0, while the following pair of hypotheses tests covers the comparisons in (iii): H04 : µT − µA1 ≤ 0 vs. Ha4 : µT − µA1 > 0, H05 : µT − µA2 ≤ 0 vs. Ha5 : µA2 − µA2 > 0. To test H01 , we use an ordinary Student t test at level α, since there is only one hypothesis in the family. To test H02 and H03 , since we require both to be rejected to establish sensitivity, we use the MIN test of Laska & Meisner [36] or its extension, the stepup test described earlier in the article. To test H04 and H05 , we use the step-down test if we expect at most one of the two hypotheses to be rejected, or the step-up test if we expect both to be rejected.
Comparisons in a Dose Finding Experiment In a dose finding experiment several dose levels of a compound along with a zero dose control are studied with respect to a specified response, usually some measure of efficacy or toxicity. The goal is to determine the lowest dose that produces a response that exceeds the control response (or, more generally, exceeds it by more than a specified amount, δ), denoted as the minimum effective dose (MED) [45] (see Minimum Therapeutically Effective Dose). Say there are K dose levels (usually, K = 3 or 4), and denote the mean responses for the control and the K dose levels by µ0 , µ1 , . . . , µK . Then we define MED = min(k : µk > µ0 ). Ruberg [45] (see also Tamhane et al. [51]) formulates the problem of identifying the MED as the following multiple hypotheses testing problem: H0k : µ0 = µ1 = · · · = µk vs. Hak : µ0 = µ1 = · · · = µk−1 < µk , for k = 1, . . . , K. The estimated MED, or minimum detectable dose (MDD), is the lowest index k for which H0k is rejected. Strong control of the FWE is needed in testing this family of hypotheses in order to
10
Multiple Comparisons
control the probability of obtaining an estimate that is less than the true MED. A class of test statistics which may be used are based on contrasts in the observed means. Various stepwise tests are given in Tamhane et al. [51] and compared in a simulation study with respect to their FWE and power under various forms of the dose response relationship.
[4] [5]
[6]
[7]
General Remarks The literature on multiple comparisons is voluminous and it is not possible to cover adequately all aspects and developments in an encyclopedia entry such as this. Two particular topics which are related, and warrant further mention, are selection and order restrictions. Selection problems have the general objective of identifying a favorable treatment or treatments from a collection of treatments. As might be expected, there are close links with multiple comparison problems and these connections are widely recognized (see [32] for example). Order restrictions in the hypotheses arise when there is added structure to the problem such as in dose-ranging studies when it is “known” that larger doses will be associated with nondecreasing means. Methods for the testing of order restricted hypotheses involve introducing constraints in the likelihood functions and so have links with isotonic regression (see Isotonic Inference). It should be noted that most of the discussion thus far has assumed that two-sided tests are of interest. The methods described all apply for the case of one-sided tests following minimal modifications. In addition we have emphasized hypothesis testing throughout. Inferences regarding interval estimates are often equivalently possible, with the focus on the simultaneous coverage probability of collections of intervals, rather than FWE.
References [1] [2]
[3]
Bauer, P. (1991). Multiple testing in clinical trials, Statistics in Medicine 10, 871–890. Bechhofer, R.E. & Dunnett, C.W. (1988). Tables of percentage points of multivariate t distributions, Selected Tables in Mathematical Statistics, 11. American Mathematical Society, Providence, pp. 1–371. Benjamini, Y. & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing, Journal of the Royal Statistical Society, Series B 57, 289–300.
[8]
[9]
[10] [11] [12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20] [21]
Berger, R.L. (1982). Multiparameter hypothesis testing and acceptance sampling, Technometrics 24, 295–300. Brown, C.C. & Fears, R.R. (1981). Exact significance levels for multiple binomial testing with application to carcinogenicity screens, Biometrics 37, 763–774. Cook, R.J. & Farewell, V.T. (1996). Multiplicity considerations in the design and analysis of clinical trials, Journal of the Royal Statistical Society, Series A 159, 93–110. Cox, D.R. (1965). A remark on multiple comparison methods. Technometrics 7, 223–224. D’Agostino, R.B. & Heeren, T.C. (1991). Multiple comparisons in over-the-counter drug clinical trials with both positive and placebo controls (with comments), Statistics in Medicine 10, 1–31. Duncan, D.B. (1951). A significance test for differences between ranked treatments in an analysis of variance, Virginia Journal of Science 2, 171–189. Duncan, D.B. (1955). Multiple range and multiple F tests, Biometrics 11, 1–42. Duncan, D.B. (1957). Multiple range tests for correlated and heteroscedastic means, Biometrics 13, 164–176. Dunnett, C.W. (1980). Pairwise multiple comparisons in the homogeneous variance, unequal sample size case, Journal of the American Statistical Association 75, 789–795. Dunnett, C.W. (1997). Comparisons with a control, in Encyclopedia of Statistical Sciences, Update Vol. 1, S. Kotz, B.C. Read & D.L. Banks, eds. Wiley, New York. Dunnett, C.W. & Tamhane, A.C. (1991). Step-down multiple tests for comparing treatments with a control in unbalanced one-way layouts, Statistics in Medicine 10, 939–947. Dunnett, C.W. & Tamhane, A.C. (1992). A step-up multiple test procedure, Journal of the American Statistical Association 87, 162–170. Dunnett, C.W. & Tamhane, A.C. (1992). Comparisons between a new drug and active and placebo controls in an efficacy clinical trial, Statistics in Medicine 11, 1057–1063. Dunnett, C.W. & Tamhane, A.C. (1995). Step-up multiple testing of parameters with unequally correlated estimates, Biometrics 51, 217–227. Dunnett, C.W. & Tamhane, A.C. (1997). Multiple testing to establish superiority/equivalence of a new treatment compared with k standard treatments, Statistics in Medicine 16, 2489–2506. Finney, D.J. (1978). Multiple assays. in Statistical Methods in Biological Assay, 3rd Ed. Griffin, London, Chapter 11. Fisher, R.A. (1935). The Design of Experiments. Oliver & Boyd, Edinburgh. Genizi, A. & Hochberg, Y. (1978). On improved extensions of the T-method of multiple comparisons for unbalanced designs, Journal of the American Statistical Association 73, 879–884.
Multiple Comparisons [22]
[23]
[24]
[25]
[26] [27]
[28] [29]
[30] [31]
[32]
[33] [34]
[35]
[36] [37]
[38] [39]
Harter, H.L. (1960). Tables of range and Studentized range., Annals of Mathematical Statistics 31, 1122–1147. Hayter, A.J. (1984). A proof of the conjecture that the Tukey-Kramer multiple comparisons procedure is conservative, Annals of Statistics 12, 61–75. Hochberg, Y. (1974). Some generalizations of the Tmethod in simultaneous inference, Journal of Multivariate Analysis 4, 224–234. Hochberg, Y. (1988). A sharper Bonferroni procedure for multiple tests of significance, Biometrika 75, 800–802. Hochberg, Y. & Tamhane, A.C. (1987). Multiple Comparison Procedures. Wiley, New York. Holm, S. (1979). A simple sequentially rejective multiple test procedure, Scandinavian Journal of Statistics 6, 65–70. Hommel, G. (1986). Multiple test procedures for arbitrary dependence structures, Metrika 33, 321–336. Hommel, G. (1988). A stagewise rejective multiple test procedure based on a modified Bonferroni test, Biometrika 75, 383–386. Hommel, G. (1989). A comparison of two modified Bonferroni procedures, Biometrika 76, 624–625. Hsu, J.C. (1981). Simultaneous confidence intervals for all distances from the “best”, Annals of Statistics 9, 1026–1034. Hsu, J.C. (1984). Ranking and selection and multiple comparisons with the best, in Design of Experiments: Ranking and Selection (Essays in Honor of Robert E. Bechhofer), T.J. Santner & A.C. Tamhane, eds. Marcel Dekker, New York. Hsu, J.C. (1996). Multiple Comparisons, Chapman & Hall, London. Keuls, M. (1952). The use of the “Studentized range” in connection with an analysis of variance, Euphytica 1, 112–122. Kramer, C.Y. (1956). Extension of multiple range tests to group means with unequal number of replications, Biometrics 12, 307–310. Laska, E.M. & Meisner, M.J. (1989). Testing whether an identified treatment is best, Biometrics 45, 1139–1151. Marcus, R., Peritz, E. & Gabriel, K.R. (1976). On closed testing procedures with special reference to ordered analyses of variance, Biometrika 63, 655–660. Miller, R.G., Jr (1981). Simultaneous Statistical Inference. 2nd Ed. Springer-Verlag, New York. Newman, D. (1939). The distribution of the range in samples from a normal population, expressed in terms of an independent estimate of standard deviation., Biometrika 31, 20–30.
[40]
[41]
[42]
[43]
[44]
[45]
[46] [47]
[48]
[49]
[50]
[51]
[52]
[53]
[54]
11
Neyman, J. & Pearson, E.S. (1928). On the use and interpretation of certain test criteria for purposes of statistical inference, Biometrika 20, 175–240. Pocock, S.J., Geller, N.L. & Tsiatis, A.A. (1987). The analysis of multiple endpoints in clinical trials, Biometrics 43, 487–498. Redman, C.E. & Dunnett C.W. (1994). Screening compounds for clinically active drugs, in Statistics in the Pharmaceutical Industry, 2nd Ed. Marcel Dekker, New York, Chapter 24. Rom, D.M. (1990). A sequentially rejective test procedure based on a modified Bonferroni inequality, Biometrika 77, 663–665. Roy, S.N. (1953). On a heuristic method of test construction and its use in multivariate analysis, Annals of Mathematical Statistics 24, 220–238. Ruberg, S.J. (1989). Contrasts for identifying the minimum effective dose, Journal of the American Statistical Association 84, 816–822. Scheff´e, H. (1953). A method for judging all contrasts in the analysis of variance, Biometrika 40, 87–104. ˇ ak, Z. (1967). Rectangular confidence regions for the Sid´ means of multivariate normal distributions, Journal of the American Statistical Association 62, 626–633. Simes, R.J. (1986). An improved Bonferroni procedure for multiple tests of significance, Biometrika 73, 751–754. Spjøtvoll, E. & Stoline, M.R. (1973). An extension of the T -method of multiple comparison to include the cases with unequal sample sizes Journal of the American Statistical Association 68, 975–978. Tamhane, A.C. (1996). Multiple Comparisons. in Handbook of Statistics, Vol. 13, S. Ghosh & C.R. Rao eds. pp. 587–630. Tamhane, A.C., Hochberg, Y. & Dunnett, C.W. (1996). Multiple test procedures for dose finding, Biometrics 52, 21–37. Tukey, J.W. (1953). The Problem of Multiple Comparisons, Mimeographed Notes, Princeton University, Reprinted in The Collected Works of John W. Tukey, Vol. VIII – Multiple Comparisons: 1948–1983, H.I. Braun, ed. Chapman & Hall, New York, 1994. Westfall, P. (1985). Simultaneous small-sample multivariate Bernoulli confidence intervals, Biometrics 41, 1001–1013. Westfall, P.J. & Young, S.S. (1989). p value adjustments for multiple tests in multivariate binomial models, Journal of the American Statistical Association 84, 780–786.
RICHARD J. COOK & C.W. DUNNETT