Controlling Error in Multiple Comparisons, with Examples ... - CiteSeerX

54 downloads 42 Views 1MB Size Report
oped by Hochberg (1988), and a sequential approach for controlling the false discovery rate ... Hochberg (B-H) procedure, as demonstrated in these examples, is the greater invariance of statistical ...... In Fred M. Hoppe. (Ed.), Multiple ...
Journal of Educational and Behavioral Statistics Spring 1999, Vol. 24, No. 1, pp. 42-69

Controlling Error in Multiple Comparisons, with Examples from State-to-State Differences in Educational Achievement Valerie S. L. Williams National Institute o f Statistical Sciences Lyle V. Jones The University o f North Carolina at Chapel Hill

John W. Tukey Princeton University Keywords: educational assessment, hypothesis testing, multiple comparisons, National Assessment of Educational Progress (NAEP) Three alternative procedures to adjust significance levels for multiplicity are the traditional Bonferroni technique, a sequential Bonferroni technique developed by Hochberg (1988), and a sequential approach for controlling the false discovery rate proposed by Benjamini and Hochberg (1995). These procedures are illustrated and compared using examples from the National Assessment of Educational Progress (NAEP). A prominent advantage of the Benjamini and Hochberg (B-H) procedure, as demonstrated in these examples, is the greater invariance of statistical significance for given comparisons over alternative family sizes. Simulation studies show that all three procedures maintain a false discovery rate bounded above, often grossly, by ct (or c~/2). For both uncorrelated and pairwise families o f comparisons, the B-H technique is shown to have greater power than the Hochberg or Bonferroni procedures, and its power remains relatively stable as the number of comparisons becomes large, giving it an increasing advantage when many comparisons are involved. We recommend that results from NAEP State Assessments be reported using the B-H technique rather than the Bonferroni procedure. Two questions often asked about each o f a set of observed comparisons are: (a) should we be confident about the direction or the sign o f the corresponding underlying population comparison, and (b) for what interval of values should we be confident that it contains the value for the population comparison? Most This project was supported by the National Institute of Statistical Sciences through grants from the National Science Foundation (No. DMS-9208758 and RED-9350005). The authors are grateful to Susan Ahmed, Robert Burton, and others at the National Center for Education Statistics for their help in framing the issues, and to Jerome Sacks, Juliet P. Shaffer, and David Thissen for constructive suggestions. Special thanks are directed to Christopher Wiesen who provided critical support in developing software for the graphic displays and simulation studies. 42

Controlling Error in Multiple Comparisons often, each comparison will be a simple difference between two separately estimated quantities. The present report focuses on (a) above, particularly in cases where the number of comparisons is large. It expands on concepts introduced by Tukey (1991, 1993) and by Benjamini and Hochberg (1995). Assume that statistical procedures are required to control a Type I error rate at a conventional value (a = .05, perhaps). For a single comparison, ~/2 provides a bound related to the probability of deciding with confidence that a population comparison goes in one direction when that population comparison actually goes in the opposite direction (has the opposite sign). This formulation assumes, as experience has taught us, that no population comparison is exactly zero (to many decimal places). Nonetheless, the conventional emphasis on the "null" hypothesis is not surprising. The importance of the null hypothesis is not that it is null, but rather that (a) as a lira!t, it is the least favorable case, and (b) situations with small non-zero values of the population comparison ("perinull" situations) behave much as if they were at that limit. The probability of erroneous confidence, defined more generally than above, differs by a factor of two depending on whether the population comparison is zero or near-zero. This is so because confidence in either of the two directions is erroneous if a population comparison is exactly zero, but only one direction is erroneous when the population comparison is not precisely zero. In addition, so long as the true difference is close to zero, values beyond the selected critical value are very nearly as likely to be of one sign as the other. The probability, c~, is the maximum probability of the traditional Type I error. Frequently--and rather misleadingly--it is considered to be the probability of deciding to be confident about the direction of an observed comparison when the population difference is exactly zero. Instead, we recommend thinking of e~, in the simplest case, as "a bound on twice the probability of being erroneously confident about the direction of the population comparison." Multiplicity arises in situations where more than one comparison is evaluated. Unless some correction is incorporated, the overall (simultaneous) Type I error rate--the probability that the decision for any one or more comparison will be in error--will exceed (often very substantially) the nominal ~ (which still would apply to any single comparison whenever that comparison can be appropriately assessed alone). With multiplicity, it is appropriate--and usually essential--to adjust for the increased probability of simultaneous Type I error, that is, the probability of finding at least one erroneous confident direction. Shaffer (1994) reviews the range of multiple comparison adjustments that have been proposed to control one kind or another of overall Type I error rate. The Bonferroni adjustment is a simple and trustworthy statistical procedure for assuring simultaneously that the probability of any Type ] error is no greater than oc. However, power is severely restricted when the simultaneous error rate is made no greater than oc by the use of the Bonferroni adjustment. L Power becomes extremely low when the number of comparisons is very large, that is, for very large family sizes. Moreover, conclusions from the Bonferroni procedure are 43

Williams, Jones, a n d Tukey

highly sensitive to differences in family size. F a m i l y s i z e is always the number of contrasts under consideration. However, there may be legitimate ambiguities about family size for a particular set of data. A desirable feature for an otherwise satisfactory multiple comparison procedure is that it provide decisions about significance that are reasonably invariant over alternative choices of family size. Two sequentially-rejective techniques described in Benjamini and Hochberg (1995) provide greater statistical power than the Bonferroni correction while still attempting to control the rate of erroneous declarations of confidence. The Hochberg procedure (Hochberg, 1988), controls the familywise error rate at et, which is then a bound on the probability of making any (one or more) Type I errors in a given family of comparisons; this bound is very nearly sharp when all population comparisons are zero. In contrast, the Benjamini and Hochberg technique (B-H) attempts to control the fraction of t'alse discoveries, roughly, the average fraction of erroneous assertions among all confident directions asserted; therefore, et/2 provides an approximate bound for a given family of comparisons on the expected value of the ratio of (a) the number of erroneous declarations of confident differences to (b) the maximum quantity of either the total number of declarations of confidence or I. In other words, the B-H approach is designed to maintain at et/2 or below the probability that a confident direction will be asserted when, in fact, the population difference is in the opposite direction. When tests are independent, the B-H approach has been proven to control not only the false discovery rate, but also---when all population differences are z e r o - - t h e familywise error rate. Let Pcrit be the tail area (usually for each of two tails) of the null sampling distribution of the test statistic for any single comparison being judged by a multiplicity-respecting procedure. Each procedure will stipulate a probability of error or average fraction of error that is bounded by et when judging confidence of direction. The value of Pcrit will depend on the sort of confidence to be attained. Let m be the number of comparisons and i -- 1. . . . . m be the rank of the p-value associated with the t-statistic for the comparison concerned when ordered from smallest to largest, so that the observed p - v a l u e s - - p i for the i th c o m p a r i s o n - - a r e Weakly increasing from i = 1 to i -- m. Four distinct approaches are defined as follows: Bonferroni: the critical value of the statistic is such that Pcr~, = PBON = et/2m in each tail of the distribution of that statistic. Hochberg (1988): be confident of the observed direction of the i'h comparison when, beginning with i = m and continuing toward i = 1, p , , -< Pcrit PHOC(i) = cd2(m - i +' 1); then stop and declare a confident direction for all comparisons for which j -< i. Thus, PHoc(i) = mpBorq/(m -- i + I ). =

Benjamini and Hochberg (1995): be confident of the observed direction of the comparison when, beginning with the mth comparison, P~i) -< Pcr~,= PB.H(i) = ia/2m; then stop and declare a confident direction for all comparisons for which j 214(1.2)

267(1.1)> 263(1.6)

301(1.1)> 297(1.4)

Female

1992 1990

217(I.0)> 212(I.1)

268( 1.0)> 262(I.3)

297(I.0)> 292(I.3)

White

1992 1990

227(0.9)> 220( I. I )

277(I.0)> 270(1.4)

305(0.9)> 300(1.2)

Black

1992 1990

192(1.3) 189(1.8)

237(1.4) 238(2.7)

275( 1.7)> 268(1.9)

Hispanic

1992 1990

201(I.4) 198(2.0)

246(1.2) 244(2.8)

283( 1.8)> 276(2.8)

Asian/Pacific Islander

1992 1990

231(2.4) 228(3.5)

288(5.5) 279(4.8)!

315(3.5) 311(5.2)

American Indian

1992 1990

209(3.2) 208(3.9)

254(2.8) 246(9.4)

281 (9.0) 288(10.2)r

Advantaged Urban

1992 1990

237(2. I ) 231(3.0)

288(3.6) 280(3.2)

316(2.6) 306(6.2)

Disadvantaged Urban

1992 1990

193(2.8) 195(3.0)

238(2.6)< 249(3.8)!

279(2.4) 276(6.0)

Extreme Rural

1992 1990

216(3.6) 214(4.9)

267(4.6) 257(4.4)

293(1.9) 293(3.3)

Other

1992 1990

219(0.9)> 213(1.1)

268(1. I)> 262(1.7)

300(0.9)> 295(1.3)

Northeast

1992 1990

223(2.0)> 215(2.9)

269(2.7) 270(2.8)

302(1.5) 300(2.3)

Southeast

1992 1990

210(I.6)> 205(2. I)

260(1.4) 255(2.5)

291 ( 1.4)> 284(2.2)

Central

1992 1990

223( 1.9)> 216(1.7)

274( 1.9)> 266(2.3)

303(1.8) 297(2.6)

West

1992 1990

218(1.5) 216(2.4)

268(2.0)> 261 (2.6)

298(1.7) 294(2.6)

>The value for 1992 was significantly higher than the value for 1990 at about the 95 percent confidence level.

Suggest Documents