multiple testings: multiple comparisons and multiple ...

3 downloads 0 Views 91KB Size Report
multiple testing problems as they are encountered in clinical drug trials. ... Key Words: Multiple comparisons; Closed testing procedure; Multiple endpoints; ...
Drug Information Journal, Vol. 32, pp. 1347S–1362S, 1998 Printed in the USA. All rights reserved.

0092-8615/98 Copyright  1998 Drug Information Association Inc.

MULTIPLE TESTINGS: MULTIPLE COMPARISONS AND MULTIPLE ENDPOINTS* GEORGE Y. H. CHI, PHD Director, Division of Biometrics I, Food and Drug Administration, Rockville, Maryland

This paper gives a brief overview of some issues and some general concepts related to multiple testing problems as they are encountered in clinical drug trials. Some recent methods for handling multiple comparisons are discussed. Emphasis is given to the pvalue-based procedures and the general closed testing procedure of Marcus, Peritz, and Gabriel (1). Examples are given to illustrate these procedures and to highlight some important issues. Regarding multiple endpoints, the discussion focuses on the need to have a better definition of what primary, coprimary, and secondary endpoints should mean, and suggests that a clinical decision rule defined in terms of primary, coprimary, and relevant secondary endpoints is a clinically relevant way of reducing the dimensionality problem of multiple endpoints. These clinical decision rules, however, often lack a statistical support structure. Further research in this area is needed. Key Words: Multiple comparisons; Closed testing procedure; Multiple endpoints; Primary, coprimary, secondary endpoints; Global test; Clinical decision rule; Type I error; Strong control; Power; Bootstrap; Clinical trials

INTRODUCTION MULTIPLE TESTINGS CAN arise in a variety of situations in clinical trials, for example, multiple hypotheses, interim analyses conducted in a group sequential trial design, multiple comparisons in a trial with multiple treatments or multiple doses, and a control, in a trial with multiple endpoints, in subgroup analyses, or various combinations thereof. The issues involved in multiple testings are numerous depending upon the specific situa-

Presented at the DIA First International Symposium “Drug Clinical Development and Application of Statistics,” October 8–9, 1997, Beijing, China. Reprint address: George Y. H. Chi, PhD, Director, Division of Biometrics I, FDA, HFD-710, WOCII, RM. 5050, 5600 Fishers Lane, Rockville, MD 20857. *The views expressed in this paper are those of the author and not necessarily those of the Food and Drug Administration.

tion one is in. The focus of this paper will be on multiple testing issues as they arise in multiple comparisons and multiple endpoints. The purpose of this paper is to provide a brief overview of some of the more recent multiple testing methodologies with a focus on multiple comparisons and multiple endpoints. MULTIPLE COMPARISONS In clinical drug trials, it is often of interest to design a parallel placebo-controlled study with multiple treatments, or multiple doses, of a test drug. The primary objective of such a trial is to demonstrate that one or more of the treatments, or doses of the drug, work, and a secondary objective is to compare different treatments, or to characterize the dose response relationship of the drug. For such a study, how should one most efficiently

1347S

1348S

George Y. H. Chi

evaluate the effect of the treatments or the effect of the drug? How should the primary hypothesis or hypotheses be defined? And how should the test or tests be carried out? For purposes of discussion, consider N treatments or doses of a drug and a placebo. Let µi, i = 0, 1, 2, . . . , N represent the mean changes from baseline for placebo and the N treatments or doses of the drug respectively, and let Hoi: ∆µi = 0, i = 1, 2, . . . , N denote the individual null hypotheses of no difference between the ith-treatment or the ithdose of the drug and placebo. Suppose that one has suitable test statistics to test each one of these null hypotheses, Hoi: ∆µi = 0, i = 1, 2, . . . , N and let poi, i = 1, 2, . . . , N denote the p-values associated with the corresponding tests. Now can one reject a null hypothesis, Hoi: ∆µi = 0, if the p-value poi < α? The answer is not necessarily so, because testing the individual null hypothesis Hoi at the same α level does not control the overall type I error, the probability of rejecting at least one individual null hypothesis when the global null hypothesis is true, at the α level. This probability will be inflated as a result of the multiple comparisons. The inflation of the overall type I error can be seen as follows. If there are N individual and independent comparisons with each individual comparison done at a nominal significance level of αi, and if Ho123 . . . N: ∆µ1 = ∆µ2 = . . . = ∆µN = 0 represents the global null hypothesis, then the probability of making at least one type I error: = P (Reject Ho123 . . . N * Ho123 . . . N) = P (At least one ∆µi ≠ 0 * Ho123 . . . N)

(2)

Table 1 illustrates the inflation of the overall type I error as a result of multiple comparisons with α = 0.05. From Table 1, equation (1), and inequality (2) it is clear that in order for the overall type I error to be controlled at a given α level, it is necessary to test the individual null hypothesis Hoi at a significance level α i < α, i = 1, 2, . . . , N, in such a way that the probability of making at least one type I error does not exceed α. Various procedures have been proposed to accomplish this. The most commonly used procedures are those that can be described as p-value-based procedures. These general procedures include the simple Bonferroni procedure, the Holm’s (2) sequential rejective procedure, the Hommel (3) procedure, and the Hochberg (4) procedure. These procedures are not powerful when the number of comparisons increases. They control the overall type I error at the desired α level, however, and they have enjoyed great popularity due to their relative simplicity. Some Common P-value-Based Multiple Comparison Procedures The procedures to be discussed below compare the p-values, poi, i = 1, 2, . . . , N derived from a test of the individual null hypotheses Hoi against some adjusted significance level α i < α in such a way that the overall type I error is controlled at the α level. Let po(i), i = 1, 2, . . . , N, denote the nonincreasingly ordered p-values. The Bonferroni procedure is a simple, TABLE 1 The Inflation of Overall Type I Error

= 1 − P (All ∆µi = 0 * Ho123 . . . N) = 1 − ΠP (∆µi = 0 * Ho123 . . . N), assuming ∆µis are independent = 1 − Π (1 − αi).

= 1 − (1 − α)N > α.

(1)

If the individual comparisons are made at the same significance level of αi = α, then from Equation (1), it follows that the probability of making at least one type I error:

Number of Probability of Making Comparisons At Least One Type I Error N 1 − (1 − α)N 1 2 3 5 10

0.050 0.098 0.143 0.226 0.401

Multiple Testings

general, well-known but conservative multiple comparison procedure that is applicable to most situations. It simply tests the individual null hypotheses at a reduced significance level of αi = α/N where N is the total number of individual hypotheses. For example, if N = 4, then one would test each individual null hypothesis at the level αi = 0.05/4 = 0.0125, for i = 1, 2, 3, 4. Due to its relatively conservative nature, this procedure is not generally recommended. The adjusted p-values based on the Bonferroni procedure are given by, pa(i) = Npo(i), i = 1, 2, . . . , N. Example 1. If po(1) = 0.07 > po(2) = 0.03 > po(3) = 0.01, then only Ho(3) will be rejected, since only po(3) < α/3 = 0.0167. The above decision can also be seen from the perspective of the adjusted p-values. The adjusted p-values under the Bonferroni procedure would be pa(1) = 3 x 0.07 = 0.21, pa(2) = 3 × 0.03 = 0.09, and pa(3) = 3 × 0.01 = 0.03. It is clear that at the α = 0.05 nominal significance level, only pa(3) = 0.03 < 0.05. The Holm (2) sequential rejective procedure is an improvement over the Bonferroni procedure. This procedure requires that one order the individual p-values poi in a nonincreasing manner: po(1) > po(2) > . . . > po(N), and let their corresponding individual null hypotheses be denoted by, Ho(1) > Ho(2) > . . . > Ho(N). The Holm procedure starts with the smallest p value, po(N). If po(N) > α/N, then stop and accept Ho(i), i = 1, 2, . . . , N. Otherwise, if po(N) < α/N, then reject Ho(N) and continue to compare po(N − 1). If po(N−1) > α/(N − 1), then stop and accept Ho(i), i = 1, 2, . . . , N − 1. If po(N−1) < α/(N − 1), then reject Ho(N−1), and continue. The adjusted p-values are derived as follows: Let m = min { i * po(i) < α/i, i = 1,2, . . . , N}. If m does not exist, then pa(i) = Npo(i), i = 1, 2, . . . , N, and these would be the same adjusted values as given by the Bonferroni procedure.

1349S

Otherwise, the adjusted p-values are given by, pa(1)= (m − 1) po(1), pa(2) = (m − 1) po(2) , . . . , pa(m−1) = (m − 1)po(m−1), pa(m) = m po(m), pa(m+1) = (m + 1)po(m+1), . . . , pa(N) = N po(N). Example 2. If po(1) = 0.07 > po(2) = 0.02 > po(3) = 0.01, then under the Bonferroni procedure, the null hypothesis Ho(1) will be rejected, since po(1) = 0.01 < 0.0167 = α/3. Ho(2), and hence Ho(1) will not be rejected, since po(2) > α/3 = 0.025. This can also be readily seen by the adjusted p-values under the Bonferroni procedures, pa(1) = 3 × po(1) = 0.21 > pa(2) = 3 × po(2) = 0.06 > pa(3) = 3 × po(3) = 0.03. Under the Holm procedure, po(3) = 0.01 < α/3 = 0.0167, one can reject Ho(3), thus continuing to test Ho(2). Since po(2) = 0.022 < α/2 = 0.025, one can reject Ho(2), and continue to test Ho(1). One cannot reject Ho(1), since po(1) = 0.07 > α = 0.05. This can also be seen with the adjusted p-values under the Holm procedure. Note, since m = 2, the adjusted p-values are: pa(1) = 1 × po(1) = 0.07, pa(2) = 2 × po(2) = 0.04, and pa(3) = 3 × po(3) = 0.03. This example shows that the Holm procedure is preferable because it can reject more hypotheses than the Bonferroni procedure. The Hochberg (4) procedure further improves upon the Holm procedure and is more powerful than the Holm procedure. Again by ordering the p-values and their associated hypotheses the same way, the Hochberg procedure starts by testing Ho(1) corresponding to the largest p-value po(1). If po(1) < α, then reject all Ho(i), i = 1, 2, . . . , N and stop. If po(1) > α, then accept Ho(1), and continue to test Ho(2) at α/2. If po(2) < α/2, then reject all Ho(i), i = 2, 3 . . . , N and stop. If po(2) > α/2, then accept Ho(2), and continue to test Ho(3) at α/3, and so on. The adjusted p-values are derived as follows: Let m = min {i * po(i) < α/i, i = 1, 2, . . . , N}.

1350S

George Y. H. Chi

If m does not exist, then it fails to reject all individual null hypotheses Ho(i) at the α level, and the adjusted p-values are the same as the unadjusted p-values. Otherwise, the adjusted p-values are given by, pa(1) = 1 po(1), pa(2) = 2 po(2) , . . . , pa(m)

and reject all individual null hypotheses, Hoj, with po(j) < α/(i − 1), j = i, i + 1, . . . , N: I = 1 po(1) > α I = 2 po(1) > α, po(2) > α/2 I = 3 po(1) > α, po(2) > 2α/3, po(3) > α/3

= m po(m), pa(m+1) = m po(m+1), pa(m+2)

?

= m po(m+2) , . . . , pa(N) = m po(N).

I = i po(1) > α, po(2) > (i − 2 + 1)α/i, po(3)

Example 3. Let po(1) = 0.04 > po(2) = 0.03 > po(3) = 0.02. The Holm procedure fails to reject Ho(3) and Ho(2), but rejects Ho(1), since po(3) = 0.02 > α/3 = 0.0167, and po(2) = 0.03 > α/2 = 0.025, while po(1) = 0.04 < α = 0.05. This can also be seen from the adjusted p-values under the Holm procedure: po(3) = 3 × 0.02 = 0.06. po(2) = 2 × 0.03 = 0.06, po(1) = 1 × 0.04 = 0.04. On the other hand, the Hochberg procedure rejects all Ho(i), i = 1, 2, 3, since po(1) = 0.04 < α = 0.05. Its adjusted p-values are given by [since m = 1], pa(1) = 1 × po(1) = 0.04, pa(2) = 1 × po(2) = 0.03, pa(3) = 1 × po(3) = 0.02. The conservativeness of the Holm procedure can also be readily seen from the following simple example with N = 2. For example, if po(1) = 0.03 and po(2) = 0.04, then the Holm prodecure will not reject Ho(1), since po(1) = 0.03 > 0.025, and the procedure will stop; whereas, the Hochberg procedure will reject both Ho(2) and Ho(1), since po(1) = 0.03 < po(2) = 0.04 < 0.05. The Hommel (3) procedure proceeds as follows: Start with the first step I = 1 below. If po(1) > α, then proceed to the next step I = 2. Otherwise, stop and accept all null hypotheses. At step I = 2, if both po(1) > α and po(2) > α/2, then proceed to the next step I = 3. Otherwise, stop and reject all individual null hypotheses, Hoi, with po(i) < α, i = 2,3, . . . , N. At step I = 3, if po(1) > α and po(2) > 2α/3 and po(3) > α/3, then proceed to the next step I = 4. Otherwise, stop and reject all individual null hypotheses, Hoi, with po(i) < α/2, i = 3,4, . . . , N. In general, at the I = i step, if all the inequalities at the I=i step hold as shown, then go to the next step I = i + 1. Otherwise, stop

...

> (i − 3 + 1)α/i, . . . , po(j) > (i − j + 1)α/i, . . . , po(i) > α/i ? ? I = N po(1) > α, po(2) > (N − 2 + 1)α/N, po(3) > (N − 3 + 1)α/N, . . . , po(j) > (N − j + 1)α/N, . . . , po(N) > α/N The adjusted p-values are derived as follows: Let M = max {i * po(1) > α, po(2) > (i − 1)α/i, po(3) > (i − 2)α/i, . . . , po(i) > α/i, i = 1,2, . . . , N}. If M does not exist, then all hypotheses are rejected at the α level, and the adjusted pvalues are the same as the unadjusted p-values. Otherwise, the adjusted p-values are given by, pa(1) = 1po(1), pa(2) = 2po( 2), . . . , pa(M) = Mpo(M), pa(M+1) = Mpo(M+1), pa(M+2) = Mpo(M+2), . . . , pa(N) = Mpo(N). Example 4. Let po(1) = 0.06 > po(2) = 0.03 > po(3) = 0.02. The Hochberg procedure cannot reject any single hypothesis, since po(1) = 0.06 > 0.05, po(2) = 0.03 > 0.025, and po(3) = 0.02 > 0.0167. The adjusted p-values are: pa(1) = 1 × po(1) = 0.06, pa(2) = 2 × po(2) = 0.06, pa(3) = 3 × po(3) = 0.06. Under the Hommel procedure, however, for I = 3, po(1) = 0.06 > 0.05 = α, po(2) = 0.03 < 2α/3, po(3) = 0.02 > 0.0167 = α/3, so M = 2,

Multiple Testings

and one rejects all individual null hypotheses Ho(i) with po(i) < α/M = α/2 = 0.025, i = 3. Thus, the procedure rejects Ho(3). The adjusted p-values are: pa(1) = 1po(1) = 0.06, pa(2) = 2po(2) = 2 × 0.03 = 0.06, pa(3) = 2po(3) = 2 × 0.02 = 0.04. Example 5. This example illustrates the different outcomes as a result of applying the four multiple comparison procedures. Consider seven treatments or doses and a placebo. Let the p-values corresponding to the test of the individual null hypotheses be ordered as follows: po(1) = 0.08 > po(2) = 0.030 > po(3) = 0.018 > po(4) = 0.012 > po(5) = 0.0113 > po(6) = 0.007 > po(7) = 0.005. The Bonferroni procedure would use the adjusted significance level of αi = α/7, i = 1, 2, . . . , 7. With α = 0.05, then αi = 0.0071, for i = 1,2, . . . , 7. Thus, one can only reject Ho(7), since po(7) = 0.005 < 0.007, but po(6) = 0.008 > 0.0071. One can also look at this in terms of the adjusted p-values. Relative to α = 0.05, these p-values need to be adjusted upwards. The adjusted p-values are:

1351S

of view of adjusted p-values. The adjusted p-values are: pa(1) = 0.08 × 5 = 0.40, pa(2) = 0.030 × 5 = 0.15, pa(3) = 0.018 × 5 = 0.090, pa(4) = 0.012 × 5 = 0.060, pa(5) = 0.0113 × 5 = 0.0565, pa(6) = 0.007 × 6 = 0.042, pa(7) = 0.005 × 7 = 0.035. Thus, one can reject both Ho(7), and Ho(6), since both pa(7) and pa(6) < 0.05. The Hochberg procedure will accept Ho(1), since pa(1) = 0.08 > α = 0.05. It will also accept Ho(2), since pa(2) = 0.030 > α/2 = 0.025. It will also accept Ho(3), since pa(3) = 0.018 > α/3 = 0.0167. But it will reject Ho(4), since pa(4) = 0.012 < α/4 = 0.0125. Hence, it will stop and reject all of the remaining hypotheses, Ho(5), Ho(6), and Ho(7) as well. The adjusted p-values under the Hochberg procedures are: pa(1) = 0.08 × 1 = 0.08, pa(2) = 0.030 × 2 = 0.06, pa(3) = 0.018 × 3 = 0.054, pa(4) = 0.012 × 4 = 0.048, pa(5) = 0.0113 × 4 = 0.0452, pa(6) = 0.007 × 4 = 0.028, pa(7)

pa(1) = 0.08 × 7 = 0.56, pa(2) = 0.030 × 7

= 0.005 × 4 = 0.020.

= 0.210, pa(3) = 0.018 × 7 = 0.126, pa(4) = 0.012 × 7 = 0.084, pa(5) = 0.0113 × 7 = 0.079, pa(6) = 0.0075 × 7 = 0.0525, pa(7) = 0.005 × 7 = 0.035. Thus, in terms of the adjusted p-values, one can see that at the α = 0.05 significance level, only Ho(7) can be rejected, since the adjusted p-value for po(7) is pa(7) = 0.035 which is the only one less than α = 0.05. For the same example, the Holm procedure will reject Ho(7)), since po(7) = 0.005 < α/7 = 0.05/7 = 0.0071. It will also reject Ho(6), since po(6) = 0.0075 < α/6 = 0.05/6 = 0.0083. While po(5) = 0.0113 > α/5 = 0.05/5 = 0.01. Thus, one cannot reject Ho(5), Ho(4), Ho(3), Ho(2), and Ho(1). This can also be seen from the point

The Hommel procedure will accept Ho(1) and Ho(2), and reject all Ho(i) such that pa(i) < α/2 = 0.025, i = 3, 4, 5, 6, 7. Thus, it will reject Ho(3), Ho(4), Ho(5), Ho(6), and Ho(7). This is because, i = 1, pa(1) = 0.08 > α = 0.05 i = 2, pa(1) = 0.08 > α = 0.05, pa(2) = 0.030 > α = 0.025 i = 3, pa(1) = 0.08 > α = 0.05, pa(2) = 0.030 < α = 0.033, pa(3) = 0.018 > α = 0.0167 ? ? ?

1352S

George Y. H. Chi

= 0.0286, pa(5) = 0.0113 < 3α/7

ual αi in such a manner that the overall type I error is controlled at the desired α level. If these p-value-based procedures reject any one of the individual null hypotheses at the αi level specified by their procedures, then the global null hypothesis,

= 0.0214, pa(6) = 0.0075 < 2α/7

Ho123 . . . N: ∆µ1 = ∆µ2 = ∆µ3 = . . . = ∆µN = 0,

i = 7, pa(1) = 0.08 > α = 0.05, pa(2) = 0.030 < 6α/7 = 0.0429, pa(3) = 0.018 < 5α/7 = 0.0357, pa(4) = 0.012 < 4α/7

= 0.0143, pa(7) = 0.005 < α/7 = 0.0071. Thus, M = max {i * {pa(j) > (i − j + 1)α/i, j = 1, 2, . . . , i}, i = 1, 2, . . . , N} = 2. Hence, by Hommel’s procedure, one rejects all Ho(i), such that pa(i) < α/2 = 0.025, for i = 3, 4, . . . , 7. This can also be seen from the adjusted p-values: pa(1) = 0.08 × 1 = 0.08, pa(2) = 0.030 × 2 = 0.06, pa(3) = 0.018 × 2 = 0.036, pa(4) = 0.012 × 2 = 0.024, pa(5) = 0.0113 × 2 = 0.0226, pa(6) = 0.007 × 2 = 0.014, pa(7) = 0.005 × 2 = 0.010. In this example, the Bonferroni procedure can only reject Ho(7). The Holm procedure can reject an additional Ho(6). The Hochberg procedure can reject two more hypotheses, Ho(4) and Ho(5). The Hommel procedure can reject an additional hypothesis, Ho(3). This and the preceding examples show that the Bonferroni procedure is the most conservative, while the Hommel procedure is the least conservative in the following order: Bonferroni < Holm < Hochberg < Hommel. These p-value-based procedures all have a common feature, namely, they are all inferences based on the individual p-values, po(i), i = 1, 2, . . . , N associated with the individual null hypotheses, Ho(i), i = 1, 2, . . . , N. These p-value-based procedures adjust the individ-

where ∆µi is the difference between the ithtreatment or the ith-dose of the test drug and placebo, will be rejected for sure. On the other hand, a test that rejects the global null hypothesis, Ho123 . . . N, at a significance level of α permits one to conclude that overall the drug is superior to placebo. Thus, as far as the primary objective of the study is concerned, the answer has been found, namely, that at least one treatment, or one dose, of the drug works. The rejection of this global null hypothesis does not, however, shed any light as to which specific treatment(s) or dose(s), if any, actually work. Certainly, if the test fails to reject the global null hypothesis, then one can conclude that the study fails to detect a difference between various treatments and placebo, or different doses of the drug and placebo. Now if a test rejects the global null hypothesis, it is usually of interest to know which one or more of these treatments or doses of the drug is actually superior to placebo. Can one simply use the p-value, po(i), obtained from the individual comparisons of Ho(i): ∆µi = 0,i = 1,2, . . . , N and compare them at the same nominal significance level say, α, and then conclude whether that dose is or is not superior to placebo? The answer is no, not in general, because suppose that the true states of nature are as follows: ∆µ1 = ∆µ2 = . . . = ∆µN−1 = 0 and ∆µN ≠ 0, then protecting the overall type I error corresponding to the global null will not protect from the type I error corresponding to the following partial null hypothesis: Ho123 . . . N − 1: ∆µ1 = ∆µ2 = ∆µ3 = . . . = ∆µN−1 = 0.

Multiple Testings

Therefore, having rejected the global null hypothesis, one still cannot simply test and reject the individual null hypotheses at the α level without inflating the type I error as just explained. In other words, simply because an individual p-value po(i) < α does not imply that one can reject the corresponding individual null hypothesis Ho(i), even when the global null hypothesis has been rejected. Peritz proposed the closure principle (5), and Marcus, Peritz, and Gabriel (1) proposed a general closed testing procedure which is designed specifically to control for this kind of type I error and which can determine whether an individual null hypothesis Ho(i) can be rejected at a given significance level α. The Closed Testing Procedure (Marcus, Peritz, and Gabriel [1]) For simplicity, consider four doses of a drug along with a placebo, and the following four levels of hypotheses: Level 0: Ho1234: ∆µ1 = ∆µ2 = ∆µ3 = ∆µ4 = 0 Level 1: Ho123: ∆µ1 = ∆µ2 = ∆µ3 = 0 Ho124: ∆µ1 = ∆µ2 = ∆µ4 = 0 Ho134: ∆µ1 = ∆µ3 = ∆µ4 = 0 Ho234: ∆µ2 = ∆µ3 = ∆µ4 = 0 Level 2: Ho12: ∆µ1 = ∆µ2 = 0 Ho13: ∆µ1 = ∆µ3 = 0 Ho14: ∆µ1 = ∆µ4 = 0 Ho23: ∆µ2 = ∆µ3 = 0 Ho24: ∆µ2 = ∆µ4 = 0 Ho34: ∆µ3 = ∆µ4 = 0 Level 3: Ho1: ∆µ1 = 0 Ho2: ∆µ2 = 0 Ho3: ∆µ3 = 0 Ho4: ∆µ4 = 0. The set of all null hypotheses appearing in the four levels is called a closed family of

1353S

hypotheses. The adjective ‘closed’ refers to the fact that all possible intersections of the individual null hypotheses in Level 3 are represented by one of the hypotheses in one of the earlier levels. For example, the intersection of Ho1: ∆µ1 = 0 and Ho3: ∆µ3 = 0 is the null hypothesis Ho13: ∆µ1 = ∆µ3 = 0 in Level 2. The intersection of all four individual null hypotheses Hoi: ∆µi = 0, i = 1,2, 3,4 is the global null hypothesis Ho1234 in Level 0. A hypothesis at a higher level is said to be implied by a hypothesis at a lower level if the truth of the lower level hypothesis implies the truth of the higher level hypothesis. For instance, Ho1 is implied by Ho12, Ho13, Ho14, Ho123, Ho124, Ho134, and Ho1234, while Ho12 is implied by Ho123, Ho124, and Ho1234. The closed testing procedure proceeds as follows. It is usually described in a step down manner starting with the first level, Level 0, by testing the global null hypothesis: Ho1234. • Step 0: Test the Level 0 global null hypothesis Ho1234 at the α level. If the p-value po1234 > α, then stop and accept the global null hypothesis and declare that the drug does not work. On the other hand, if po1234 < α, then proceed to Step 1, • Step 1: Test each of the partial null hypotheses (or marginal null hypotheses) in Level 1 at the same α-level. Reject a partial null hypothesis at the α-level if its p-value is less than α. If no partial null hypothesis in Level 1 is rejected, then the procedure stops. One then concludes that the drug works, but one cannot conclude from this which dose or doses of the drug are effective. Otherwise, go to Step 2, • Step 2: Test a partial null hypothesis in Level 2 at the same α level only if all of its implying hypotheses in the preceding level, Level 1, have been rejected. Reject this partial null hypothesis if its p-value is less than α. If no partial null hypothesis is rejected at Level 2, then the procedure stops. Otherwise, proceed to Step 3, and • Step 3: Test an individual null hypothesis in Level 3 at the same α level only if all of its implying partial null hypotheses in the preceding level, Level 2, have been re-

1354S

jected. Reject this individual null hypothesis at the α level, if its p-value is less than α. The procedure stops if no individual null hypothesis is rejected at Level 3, and one cannot identify from this closed testing procedure any dose or doses of the drug that are superior to the placebo. Otherwise, doses of the drug that correspond to the rejected individual null hypotheses are the ones that are superior to the placebo. The closed testing procedure guarantees the protection of the overall type I error at the given α level. In fact, the procedure protects the overall type I error strongly in the sense that the probability of making a type I error in any of the partial null hypotheses in the subfamilies is also maintained at the same α level. In other words, the probability of falsely rejecting an individual null hypothesis Hoi at the α level is controlled at the α level under any configuration of the true states as represented by the global null hypothesis, or the partial null hypotheses at the various levels of the closed family of hypotheses. It can be shown that all of the p-valuebased procedures described earlier are closed testing procedures in the sense of Marcus, Peritz, and Gabriel (1) and hence control the type I error strongly. The classical Scheffe method and the Tukey method in the analysis of variance setting control the overall type I error strongly. Example 1. Suppose in this example one has three doses, and the resulting p-values obtained from testing the various null hypotheses at the different levels are given as shown below: Level 0: po123 < α Level 1: po12 > α, po13 > α, po23 > α Level 2: po1 < α, po2 > α, po3 > α. According to the closed testing procedure, since all the hypotheses at Level 1 cannot be rejected, the procedure stops here. Since all of the following individual null hypotheses

George Y. H. Chi

in Level 2 must be implied by one of these null hypotheses in Level 1, they cannot be rejected by the closed testing procedure, even if some of them have their p-values less than α. Example 2. Suppose the p-values obtained from testing the various null hypotheses at the different levels are given as shown below: Level 0: po1234 < α Level 1: po123 < α, po124 < α, po134 < α, po234 > α Level 2: po12 < α, po13 < α, po14 < α, po23 < α, po24 > α, po34 > α Level 3: po1 < α, po2 < α, po3 < α, po4 > α. According to the closed testing procedure, only those hypotheses corresponding to the bold faced p-values above should be tested. Those subsequent hypotheses at higher levels that are implied by those hypotheses that had failed to be rejected at the lower levels should not be tested. The p-values for these hypotheses are displayed, however, just for the purpose of discussion in the next paragraph. Thus, only Ho1 is rejected, since all of its lower level implying hypotheses Ho12, Ho13, Ho14, Ho123, Ho124, Ho134, and Ho1234 were rejected. Ho2 and Ho3 cannot be rejected, because one of their implying hypotheses, Ho24 and Ho34, respectively, cannot be rejected. Therefore, even though po2 < α and po3 < α, they cannot be rejected. Finally, Ho4 cannot be rejected since po4 > α, or two of its implying hypotheses Ho34 and Ho234 failed to be rejected, since po34 > α and po234 > α respectively. Remark 1. It should be pointed out that given a closed testing procedure, there is no assurance that the testing procedure will necessarily proceed to the individual null hypotheses level (eg, Example 1). Thus, given a closed testing procedure, one may still end up with an indeterminate situation similar to that of merely rejecting the global null hypothesis, without being able to identify the specific

Multiple Testings

treatments or doses that are superior to placebo. Remark 2. It should further be noted that even if the closed testing procedure reaches the individual null hypotheses level, there is no assurance that one or more of the individual null hypotheses will be rejected. These individual hypotheses can be rejected only if their individual p-values are also less than α. A procedure having the property that whenever the global null hypothesis is rejected, one of the individual null hypotheses will also be rejected is said to be in consonance. Not every closed testing procedure is consonant. The previously discussed pvalue-based procedures are consonant. Remark 3. Example 2 also shows that even if the global null hypothesis is rejected, and the individual null hypotheses may have pvalues poi < α, for i = 1,2,3,4, it does not mean that these individual null hypotheses can automatically be rejected, unless all of their individual implying hypotheses have been rejected at the α level. Thus, despite the fact that po2 < α, and p03 < α, Ho2 and Ho3 cannot be rejected, since po24 > α, po34 > α, and po234 > α. Remark 4. From Example 2 and Remark 3, it also becomes clear that it is critically important to declare prospectively in the protocol which treatments or doses will be used in defining the primary hypothesis or hypotheses of interest to be tested. Depending on what treatments or doses to include or exclude from the hypotheses, the conclusion can be quite different as illustrated by the following two examples. Example 3. If one had excluded the treatment or dose corresponding to Ho4, then one would have the following p-values corresponding to the closed family of hypotheses, Ho123, Ho12, Ho13, Ho23, Ho1, Ho2, and Ho3: Level 0: po123 < α Level 1: po12 < α, po13 < α, po23 < α Level 2: po1 < α, po2 < α, po3 < α.

1355S

Thus, based on the closed testing procedure, one can reject all three individual null hypotheses, Ho1, Ho2, and Ho3, since the individual p-values are all α Level 2: po2 < α, po3 < α, po4 > α. Based on the closed testing procedure, one cannot reject any of the individual null hypotheses. Ho2 cannot be rejected even though po2 < α, because both of the Level 1 implying hypotheses, Ho23 and Ho24, failed to be rejected. Ho3 cannot be rejected despite the fact that po3 < α, because according to the closed testing procedure one of its implying hypotheses in Level 1, Ho34, failed to be rejected. For the same reason, Ho4 fails to be rejected because the procedure does not test for Ho4, since one of its implying hypotheses, Ho34 has already failed to be rejected. Compared to Example 2, Example 3 and Example 4 demonstrate that by excluding one or more treatments or doses from the original four treatments or doses, one may reach a very different outcome than by including them. Example 3 shows that one is able to reject two additional hypotheses: Ho2 and Ho3. Whereas Example 4 shows that one is unable to reject any individual null hypothesis. The point here is that it is very important to prespecify the primary objective and the individual null hypotheses that are of primary interest and that are to be tested. If they were not prespecified, then it may be necessary to construct the closure of the set of all individ-

1356S

ual null hypotheses and then apply the closed testing procedure. An individual null hypothesis Hoi can then be rejected if the p-value is poi < α and if all of its implying partial null hypotheses at all lower levels including the global null have been rejected at the α level. Remark 5. It is obvious that if one is only interested in two of the four treatments or two of the four highest doses, then in this procedure, one can exclude the two treatments or the two lower doses from consideration. This effectively will simplify the situation, and will be more likely to succeed in identifying an effective treatment or a dose of the drug. But the principle is the same. To carry the situation to the extreme, if one is only interested in the highest dose, then one can simply test the individual null hypothesis HO4 at the α level, and there will be no multiple comparison adjustments to be made. Such considerations, however, should be proposed in the protocol and not to be done in a posthoc manner. It is very important to define clearly the primary objective of the trial, and to describe clearly the proposed procedure for addressing this primary objective. Remark 6. In a multiple dose study, if one expects a nondecreasing dose-relationship to exist, then one may wish to test the global null hypothesis, Ho123 . . . N: ∆µ1 = ∆µ2 = ∆µ3 = . . . = ∆µN against the ordered alternative, Ha123 . . . N: ∆µ1 ≤ ∆µ2 ≤ ∆µ3 ≤ . . . ≤ ∆µN with at least one strict inequality. Various global tests, both parametric and nonparametric have been proposed for this situation, for example, Bartholomew (5,6) and Williams (7,8). Marcus, Peritz and Gabriel (1) applied the global test of Bartholomew to a closed testing procedure to permit individual tests between successive doses. The Williams test, which allows the identification of a minimally effective dose, is a closed test. Tamhane, Hochberg and Dunnett

George Y. H. Chi

(9) proposed a general framework for a stepwise testing procedure that uses contrasts among the doses to identify the so-called minimally detectable dose (MDD). The use of a stepwise closed testing procedure that allows for testing of specific constrasts of interests have also been discussed by Bauer (10). Recently, Rom, Costello and Connell (11) proposed a closed testing procedure for testing a monotonic dose-response relationship, as well as a closed testing procedure for cases where a reversal of the dose effect relationship at higher doses are expected. Their method allows for comparisons between successive doses as well as comparisons to the control. MULTIPLE ENDPOINTS In clinical drug trials, the effects of a drug are frequently reflected in a number of clinical endpoints. It is common to find that one or more of these clinical endpoints may be designated as the primary endpoint(s), while the remaining endpoints are designated as secondary or tertiary endpoints. The sample size for the trial is determined based on one of these primary endpoints. The primary hypothesis is stated in terms of this primary endpoint which will be tested upon the completion of the trial. Frequently, it is not entirely clear what distinguishes among these primary endpoints or between these primary endpoints and the secondary endpoints other than their designations. For instance, it is common to find that when the statistical test failed to reject the primary hypothesis, the sponsors then after the fact turned to test the secondary hypotheses and calculate their associated p-values. It is common to find in many instances that some of these p-values were very small and the sponsors wanted to assert the superiority of the drug based on these ‘apparently’ significant p-values (see, for example, the SOLVD prevention trial in Chi [12]). It is clear that by testing the primary hypothesis at the nominal significance level of α = 0.05, one has expended all the α available. Testing of additional hypotheses will inflate the type

Multiple Testings

I error for sure, unless one has prospectively defined a way for doing this so that the overall type I error is maintained at the desired α level. Three important issues are intertwined here in addition to the well-known multiple testing problem related to the inflation of the overall type I error. Primarily, the concept of primary, coprimary, and secondary endpoints are not very well defined, and they are frequently very muddled. Secondly, there is a serious disconnect between these clinical endpoints and the actual clinical decision rule for determining whether the drug has demonstrated its effectiveness relative to the control. Thirdly, the statistical testing procedure often does not reflect the predefined clinical decision rule. The main purpose of this section is to discuss these issues. The first issue shall be addressed by defining in a hierarchical sense what primary, coprimary, and secondary endpoints should mean. Then the second issue is clarified by discussing the interrelationship between the clinical decision rule and the set of relevant clinical endpoints under consideration. The third issue will be illustrated by discussing statistical testing procedures for some clinical decision rules. What are the Important Issues? In a clinical drug trial, what is the primary objective? The primary objective is usually to demonstrate that the drug can be indicated for the specific patient populations under study. In other words, the primary objective is to show that the drug is providing clinical benefit to the specific patient population under study. For a specific patient population, what constitutes clinical benefit should be determined by the medical community, or a consensus of the medical experts. Sometimes what constitutes clinical benefit in a specific disease can be very controversial. The point, however, is that when a sponsor is planning a clinical drug trial, it is important to understand that one must be very clear about what constitutes clinical benefit in the specific situation at hand, and how one can define it

1357S

in an as unambiguous manner as possible. Usually, such clinical benefits are defined in terms of one or more relevant clinical endpoints. How that is done is very important. It is not enough to just list a set of primary endpoints and another set of secondary endpoints. One must also define the clinical decision rule along with all relevant decision paths that can lead to a positive outcome regarding clinical benefit. In a way, this by itself may not necessarily eliminate the multiple endpoints problem, because one is replacing multiple endpoints with multiple clinical decision paths; but in many cases, it will either eliminate or reduce the complexity of the multiple endpoints problem. Once all relevant clinical decision paths have been identified upfront, one can then design the proper statistical testing procedure to reflect the complete clinical decision rule. The statistical procedure must take into account the proper null hypotheses to be tested, the appropriate test statistics, and the proper allocation of α among all the decision paths in such a manner that the overall type I error will be protected at the desired α level. Thus, the idea is to reduce or eliminate the multiple endpoints problem by defining the clinical decision rule along with all relevant decision paths that can lead to a positive outcome regarding clinical benefit, and by properly allocating the overall α among these clinical decision paths. Inherent in the idea of the clinical decision rule along with all relevant decision paths is the important concept of dimension reduction. In most situations, there may be several primary and secondary endpoints, but only one clinical decision rule along with one or more clinical decision paths will be used to determine whether the drug provides clinical benefit to the specific patient population under study. These clinical decision paths should be defined in terms of a set of relevant primary, coprimary, and secondary endpoints. Consequently, instead of testing various endpoints separately, one simply defines an appropriate statistical testing procedure that reflects the clinical decision rule along with its clinical decision paths.

1358S

The concept of dimension reduction is not new. In fact, the idea of a combined or composite endpoint is a case in point. Dimension reduction is a mathematical or statistical concept, however, whereas the clinical decision rule is not. The clinical decision rule reduces dimension in a clinically relevant and meaningful way. These clinical decision rules with the relevant decision paths are normally the way the clinicians evaluate clinical benefits. Consequently, it should be the most natural and appropriate way of assessing the effectiveness of the drug. In many instances, however, they lack the statistical support structures, or the statistical testing procedures applied do not reflect the clinical decision rules. In the following discussion, some important definitions will be introduced, and then the idea of clinical decision rules will be illustrated by a statistical testing procedure that uses the method of bootstrap to get at the inherent correlation structure among the relevant endpoints involved. Various Types of Clinical Endpoints As indicated earlier, in clinical drug trials, the concept of primary, coprimary, and secondary endpoints are not very well defined. First, several types of endpoints shall be defined. Although these definitions reflect the regulatory perspective, they actually conform to sound scientific principles and reflect actual clinical trial practices. Definition 1. A clinical endpoint is a clinical variable which either directly or indirectly reflects the condition of an underlying disease. Definition 2. A clinical endpoint is a primary endpoint if it satisfies the following conditions: • It provides a measure of clinical benefit realized in the patient that is acceptable by the clinician as a meaningful measure of the drug effect for the disease under treatment, and

George Y. H. Chi

• It is an endpoint such that a positive finding in this endpoint alone is sufficient to result in the claim, regardless of what other endpoints show. Comment: Primary endpoints are rare and must be well established in their relationship to the disease, the treatment, and the quality and objectivity of the endpoint. A good example of a primary endpoint is mortality, say, in a congestive heart failure trial. Definition 3. A clinical endpoint is a coprimary endpoint if it satisfies the following conditions: • It provides a measure of clinical benefit realized in the patient that is acceptable by the clinician as a meaningful measure of the drug effect for the disease under treatment, and • It is an endpoint such that a positive finding in this endpoint alone is sufficient to result in the claim, provided any primary endpoint(s), or some equally important coprimary endpoints are not showing a negative effect. Definition 4. A clinical endpoint is called secondary if it satisfies the following conditions: • It provides a measure of clinical benefit realized in the patient that is acceptable by the clinician as a meaningful measure of the drug effect for the disease under treatment, and • It is an endpoint such that a positive finding in this endpoint alone is not sufficient to result in the claim. Comment: A secondary endpoint together with other secondary endpoints, however, may provide sufficient evidence to result in a claim. It may provide supportive evidence to the primary or coprimary endpoints. In the presence of a positive finding in the primary endpoint or coprimary endpoint, however, whether positive findings in the secondary endpoints can be used in a labeling

Multiple Testings

claim need further discussion. If they are, at what level they should be tested is not clear. Assuming that one accepts these definitions of primary, coprimary, and secondary endpoints, then the question is that if one has a set of such endpoints, then how does one handle the multiple endpoints testing problem?

Clinical Decision Rules and Statistical Testing Procedures An important principle is that one first needs to identify a subset of clinically important and relevant primary, coprimary, and secondary clinical endpoints. Then one needs to define the clinical decision rule with all the relevant decision paths that can result in a positive outcome for the drug regarding its clinical benefit. This decision rule and its decision paths will be defined by the endpoints in this subset. The statistical testing procedure, which includes specifying the appropriate hypotheses, the test statistics at various decision paths, the overall type I error, the proper allocation of α among the various decision paths, and the power, should reflect this clinical decision rule. Consider some of the commonly encountered clinical decision rules. To facilitate the discussion, consider a clinical drug trial with three clinical endpoints, T1, T2, and T3. Example 1. The clinicians agree that T1 is the primary endpoint, T2 and T3 are secondary endpoints, and the clinical decision rule is that the drug can claim to have demonstrated its effect only if the drug shows a clinical benefit based on the primary endpoint T1, irrespective of the outcome of T2 and T3. In this clinical decision rule, there is only one decision path. The International Conference on Harmonization’s E-9 Document on Statistical Principles for Clinical Trials (25) recommends that if possible, a single primary endpoint be selected. In this example, the clinical decision rule is simple, and it has eliminated the three endpoint problem by reducing the assessment of a claim of clinical benefit to

1359S

just one endpoint, T1. The statistical testing procedure is fairly straightforward. Example 2. In this example, the clinicians agree that T1 and T2 are equally important coprimary endpoints, T3 is a secondary endpoint, and the clinical decision rule is that the drug can claim to have demonstrated its effect if the drug shows a clinical benefit based on the coprimary endpoint T1 or the coprimary endpoint T2, irrespective of the outcome on T3. In this decision rule, there are two decision paths, one through T1 and the other through T2. In this example, neither of the coprimary endpoints should be classified as secondary, because the drug can demonstrate its effect based on either endpoint alone. This has frequently been the case as discussed earlier, when the sponsor tried to assert the drug effect based on a secondary endpoint which was not prespecified as a coprimary endpoint that can lead to a positive outcome for the drug regarding clinical benefit. Stating upfront that these are coprimary endpoints, and hence will be part of the possible decision paths, will eliminate the problem of inflation of the overall type I error if one can preallocate α to allow for testings based on these coprimary endpoints. Since in this decision rule, either T1 or T2 can lead to a claim of positive benefit, there are two decision paths. Therefore, this clinical decision rule does not fully help reduce the multiple endpoints problem. One must still apply an appropriate statistical testing procedure to reflect this clinical decision rule and to protect against the inflation of the overall type I error by appropriately allocating the α between the two decision paths. This example illustrates the typical multiple endpoints problem where one has several primary and coprimary endpoints, and the clinical decision rule simply has almost just as many decision paths, one corresponding to each endpoint. There are several approaches for handling multiple endpoints problems of this type. One approach is to consider all of these endpoints simultaneously by applying a global test. These global

1360S

tests (For example, Hotelling T2, O’Brien [13] OLS and GLS, Tang, Geller and Pocock’s [14] modified GLS, Tang, Gnecco, and Geller’s [15] ALR) provide a test of the global null hypothesis that at least one of the endpoints is significant. These global methods, however, do not identify which specific endpoint is significant. Just as in multiple comparisons, it is possible for a global test to reject the global null hypothesis without any individual null hypotheses being rejected. Therefore, they suffer from the problem of being nonendpoint specific. The application of some of these global tests in the framework of a closed testing procedure proposed by Marcus, Peritz, and Gabriel (1), however, may help resolve the problem of nonspecificity in some cases. For a detailed discussion along this line, see the recent papers by Huque and Sankoh (16), Sankoh, Huque, and Dubey (17), Lehmacher, Wassmer, and Reitmer (18), Dunnett and Tamhane (19), and Bauer (10). On the other hand, in this example, the pvalue-based procedures discussed in the preceding section within the context of multiple comparisons can also be applied. These pvalued-based procedures, however, assume independence between the endpoints. Since in most situations, these endpoints are moderately correlated, these p-value-based procedures tend to be on the conservative side. Example 3. The clinicians agree that T1 is primary, T2 is coprimary, and T3 is secondary, and the clinical decision rule is that the drug will have demonstrated its effect if either T1 is significant, or upon failing that, T2 is significant, irrespective of the outcome on T3. In this decision rule, there are two decision paths, one through T1 and the other through T2 conditional upon failing T1. A statistical testing procedure reflecting this clinical decision rule can be proposed as follows (20). Test the primary endpoint T1 at some α1 < α. If the T1 fails to achieve statistical significance at α1, then use the bootstrap resampling method to test the coprimary endpoint T2 at a conditional critical value of cα2 determined from the empirical bootstrap resampling distribution at a level α2, where

George Y. H. Chi

α1 + α2 = α. This approach has at least two desirable features. First, the primary endpoint is being tested in the usual way and does not depend on the unknown correlation between T1 and T2. Second, one does not need to estimate the unknown correlation. The bootstrap resampling distribution itself has captured the correlation inherent in the empirical distribution. Preliminary research shows that there is much power to be gained with this conditional testing procedure. Furthermore, it shows that there is interesting design information concerning the optimal allocation of α1 and α2. This method may be applicable even when the distributions are nonnormal. Example 4. The clinicians agree that T1 is primary, but T2 and T3 are secondary, and the clinical decision rule is that the drug will have demonstrated its effect if either T1 is significant, or if T2 and T3 are both significant. In order for the overall type I error to be controlled at the desired α level, T1, T2, and T3 should be individually tested at some significance level αi < α, i = 1,2,3. The proper statistical testing procedure for this clinical decision rule has yet to be worked out, and is currently under study. For a related discussion, see Capizzi and Zhang (21). Example 5. The clinicians agree that T1 is primary, but T2 and T3 are secondary, and the clinical decision rule is that the drug will have demonstrated its effect if T1 is significant, and if T2 and T3 both do not show negative effects. The statistical testing procedure reflecting this clinical decision rule has not been worked out yet. The method proposed by Follman (22) may be relevant here. Example 6. The clinicians agree that T1, T2, and T3 are all secondary, and the clinical decision rule is that the drug will have demonstrated its effect if any two of the three secondary endpoints are significant, and there is no negative effect in the third. Ruger’s (23) generalization of the Bonferroni procedure is tailored for this kind of decision rule. It also suffers, however, from the conservatism of the Bonferroni proce-

Multiple Testings

dure when the correlations among the endpoints are moderate. It seems that improvement of this procedure by introducing the Min test of Laska and Meisner (24), and the application of a closed testing procedure may be possible. Example 7. In case T1, T2, and T3 are clinical events of interest, then sometimes one sees the following kind of clinical decision rule (composite endpoint): Time to the first observed event of interest. A statistical testing procedure that reflects this clinical decision rule needs to consider the problem of competing risks. The bottom line is that one should clearly define the primary, coprimary, and secondary endpoints, the clinical decision rule with all relevant decision paths that can lead to a positive outcome in terms of clinical benefit, and the proper statistical testing procedure reflecting the clinical decision rule along with its decision paths. Most of the published literature on multiple endpoints have focused on the multiple endpoint problems as exemplified by Example 2. Other examples of clinical decision rules, however, pose some interesting and challenging statistical problems that await resolutions. SUMMARY It is clear from this discussion on multiple comparisons and multiple endpoints that it is very important to clearly define the primary objective in a clinical drug trial. Once the primary objective has been defined, one can then determine the appropriate decision rule based on which one can make an assessment regarding the effectiveness of the treatments or specific doses of a drug. For multiple comparisons, one needs to decide, relative to the primary objective, which comparisons are critical, and then define the corresponding null hypotheses to be tested by an appropriate multiple comparisons procedure. The closed testing procedure of Marcus, Peritz, and Gabriel is recommended since it strongly controls the overall type I error. The well-known p-valued-based multiple comparison procedures such as the

1361S

Bonferroni, Holm, Hochberg, and Hommel are all special cases of the general closed testing procedure. They generally are on the conservative side, since they assume independence. Among these four multiple comparison procedures, Bonferroni is the most conservative, next is Holm, and then Hochberg followed by Hommel. There are some recent works in applying the idea of a closed testing procedure to dose-response studies. For multiple endpoints, one needs to decide relative to the primary objective what the most appropriate and relevant clinical decision rule is to use to assess the effectiveness of the drug. This clinical decision rule should be defined in terms of the most important and relevant clinical endpoints. The clinical decision rule may reduce the dimensionality of the multiple endpoints problem. Once the clinical decision rule has been agreed upon, one must define the appropriate null hypotheses and the associated statistical testing procedure that together reflect this clinical decision rule. The testing procedure should include the allocation of α, the appropriate test statistics for each decision path, and the power or sample size consideration. A clinical decision rule not only helps reduce dimensionality of the multiple endpoints problem, it also reflects the actual clinical perspective regarding the assessment of the efficacy of a treatment or drug for a specific disease. All too often, the clinical decision rules lacked the necessary statistical support structure. There are several clinical decision rules that are currently missing such statistical support. More research is needed. For situations where clinical decision rules fail to reduce the dimensionality problem related to multiple endpoints, some recent works have taken the direction of applying an appropriate global test along with a closed testing procedure to help identify the individual significant endpoints. REFERENCES 1. Marcus R, Peritz E, Gabriel KR. On closed testing procedure with special reference to ordered analysis of variance. Biometrika. 1976;63:655–660. 2. Holm S. A simple sequentially rejective multiple test procedure. Scandinavian J Stat. 1979;6:65–70.

1362S 3. Hommel G. A stagewise rejective multiple test procedure based on a modified Bonferroni test. Biometrika. 1988;75:383–386. 4. Hochberg Y. A sharper Bonferroni procedure for multiple tests of significance. Biometrika. 1988;75: 800–802. 5. Bartholomew DJ. A test of homogeneity for ordered alternatives. Biometrika. 1952;46:36–48. 6. Bartholomew DJ. A test of homogeneity for ordered alternatives. Biometrika. 1959;46:328–335. 7. Williams DA. A test for differences between treatment means when several dose levels are compared with a zero dose control. Biometrics. 1971;27:103– 117. 8. Williams DA. The comparison of several dose levels with a zero dose control. Biometrics. 1972;28:519– 531. 9. Tamhane AJ, Hochberg Y, Dunnett CW. Multiple test procedures for dose finding. Biometrics. 1996; 52:21–37. 10. Bauer P. Multiple testings in clinical trials. Stat Med. 1991;10:871–890. 11. Rom DR, Costello RJ, Connell LT. On closed test procedures for dose response analysis. Stat Med. 1994;13:1583–1596. 12. Chi GYH. A regulatory perspective for handling multiple endpoints. Presented at the International Workshop on Stroke Nosometrics & Design of Stroke Clinical Trials. October 16–17, 1995, Dusseldorf, Germany. 13. O’Brien PC. Procedures for comparing samples with multiple endpoints. Biometrics. 1984;40:1079–1087. 14. Tang D-I, Geller NL, Pocock SJ. On the design and analysis of randomized clinical trials with multiple endpoints. Biometrics. 1993;49:23–30. 15. Tang DI, Gnecco C, Geller NL. An approximate likelihood ratio test for a normal mean vector with

George Y. H. Chi non-negative components with application to clinical trials. Biometrika. 1989;76:577–583. 16. Hugue MF, Sankoh AJ. A reviewer’s perspective on multiple endpoint issues in clinical trials. J Biopharmaceutical Stat. 1997;7:545–564. 17. Sankoh AJ, Huque MF, Dubey SD. Some comments on frequently used multiple endpoint adjustment methods in clinical trials. Stat Med. 1997;16:2529– 2542. 18. Lehmacher W, Wassmer G, Reitmer P. Procedures for two sample comparisons with multiple endpoints controlling the experimentwise error rate. Biometrics. 1991;47:511–521. 19. Dunnett CW, Tamhane AC. Step-up multiple testing of parameters with unequally correlated estimates. Biometrics. 1995;51:217–227. 20. Jin K, Chi GYH. Application of bootstrap in handling multiple endpoints. Presented at the 1997 Annual JSM meeting at Anaheim, California. 21. Capizzi T, Zhang J. Testing the hypothesis that matters. Invited Presentation at the ENAR meeting, Cleveland, Ohio, 1994. 22. Follman D. Multivariate tests for multiple endpoints in clinical trials. Stat Med. 1995;14:1163–1175. 23. Ruger B. Das maximale Signifikanzniveau des tests “Lehno Ho ab, wenn k under n gegebenen tests zur Ablehnung fuhren.” Metika. 1978;25:171–178. 24. Laska EM, Meisner MJ. Testing whether an identified treatment is best. Biometrics. 1989;45:1139– 1152. 25. International Conference on Harmonization of Technical Requirements for Registration of Pharmaceuticals for Human Use: ICH Tripartite Guideline—E9 Document: Statistical Principles for Clinical Trials. February 5, 1998 (version). 26. Senn S. Statistical Issues in Drug Development. New York: John Wiley; 1997.

Suggest Documents