A Retrievable Recipe for Inverse t Donald P. Gaver

2 downloads 0 Views 168KB Size Report
Jul 9, 2007 - *Donald P. Gaver 1s Professor, Department of Operations Research. ... ther accurate approximations are reviewed by Bailey ( 1980).
A Retrievable Recipe for Inverse t Donald P. Gaver; Karen Kafadar The American Statistician, Vol. 38, No. 4. (Nov., 1984), pp. 308-311. Stable URL: http://links.jstor.org/sici?sici=0003-1305%28198411%2938%3A4%3C308%3AARRFIT%3E2.0.CO%3B2-R The American Statistician is currently published by American Statistical Association.

Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://www.jstor.org/about/terms.html. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at http://www.jstor.org/journals/astata.html. Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission.

The JSTOR Archive is a trusted digital repository providing for long-term preservation and access to leading academic journals and scholarly literature from around the world. The Archive is supported by libraries, scholarly societies, publishers, and foundations. It is an initiative of JSTOR, a not-for-profit organization with a mission to help the scholarly community take advantage of advances in technology. For more information regarding JSTOR, please contact [email protected].

http://www.jstor.org Mon Jul 9 17:23:35 2007

RALL. L.B. (1981). "Automatic Differentiation: Techniques and Applications," Lecture Notes in Computer Science. No. 120. New York: Springer-Verlag. SALL. J. (1982), "Packaging Nonlinear Regression Software for Users." Proceedings of the Statistical Computing Section of the American Stutisticul Association. 7- 12. SAWYER, J.W ( 1981). "A Computer Program for We~ghtedLeast Squares

Analysis of Categorical Data W~thoutLog. Llnear, and Exponential Operators." Proceedings of the Stutistical Computing Sectron of the American Statistical Assoc.iatron, 224-228. STANISH, W . M . . CHI. G . Y .H.. JOHNSON. W . D . . KOCH. G . G . . LANDIS. J.R.. and LIU-CHI. S . ( 1978). "A Computer Program for the Generalized Chi-square Analysls of Competing Risks Grouped Survival Data (CRISCAT)." Computer Programs in Biomedicine. 8. 208-223.

A Retrievable Recipe for Inverse t DONALD P. GAVER and KAREN KAFADAR*

Although the percentage points of the Student-t distribution have been widely tabulated, a simple approximation is given and derived in this article. The approximation can be rederived easily. since it is based on the percentage points from the Gaussian distribution, and can thus be used for applications requiring non-integer degrees of freedom (e.g.. Welch's two-sample t test) and for arbitrary significance levels ( e . g . , for Bonferroni multiple comparison procedures). Comparisons between this approximation and others suggested in the literature indicate three-digit accuracy for even small degrees of freedom and tail areas. KEY WORDS: Student's t; Percentage points: Inverse t.

1. INTRODUCTION Various critical values (synonymous with percentage points or evaluations of the inverse distribution function) of the classical Student-t distribution are frequently useful in applied statistics. Selected such values are, of course, widely tabulated: see Fisher and Yates (1963) and Pearson and Hartley (1976): the latter are also reproduced with extensions by E . T . Federighi (in Abramowitz and Stegun 1968). In certain circumstances. however, it is convenient to be able to compute t percentage points directly, accurately, and simply, without extensive tables except. perhaps. for a normal (Gaussian) table (but see Sec. 5). A simply derived. o r retrievable, computational procedure for doing so is presented in this article. It can be carried out quickly on a handheld calculator and has been programmed, for example, on the TI-59, HP-41C. and HP-15C calculators as well as the TRS-80 microcomputer. It seems that the accuracy of the numerical values obtained, especially at the usually required levels ( e . g . . 95%)-but also at much more extreme

*Donald P. Gaver 1s Professor, Department of Operations Research. Naval Postgraduate School. Monterey, CA 93943. Karen Kafadar is Statistician. Member of the Technical Staff. Hewlett-Packard. Stanford Park Division, Palo Alto. CA 94304. The research of D.P. Gaver was part~ally supported by the Office of Naval Research, Probability and Statistics Program. Most of this work was completed whlle K. Kafadar was at the Nat~onal Bureau of Standards. Statistical Engineer~ngDivision. The authors wish to thank John Orav for helpful comments.

ones-coupled with the ease of their computation should provide a tempting argument for their wide use. Several similar approximations have appeared in various journals over the last two decades. Among the most successful of these is that derived by Peizer and Pratt (1968: PP):

=

{n e x p [ z 2 ( a ) (n

-

i)/(n

-

i+ I I n ) ' ]

-

n}I1.

( I . 1)

where a is the right single-tail probability. so 0 < a 5 . 5 . Approximations based on asymptotic expansions appeared earlier (Wallace 1958,1959) and were successful for moderate degrees of freedom and not-too-extreme tail areas. Other approaches have involved rational functions in the degrees of freedom (Gardiner and Bombay 1965 and Kramer 1966) or the logistic distribution (Mudholkar and Chaubey 1975). A formula derived in Koehler ( 1983) is based on a novel data-analytic approach to the t tables. pioneered by Hoaglin (1977); let t,,(a), represent Koehler's values. Further accurate approximations are reviewed by Bailey ( 1980). Suggested approximations are often either simple but not terribly accurate o r extremely complicated and involve many coefficients. The present approach offers both simplicity and a high degree of accuracy. yielding two-digit or better accuracy for moderate degrees of freedom across a broad range of tail areas. W e call it a retrie~,ablerecipe because the simple basic idea allows it to be rederived quickly when needed.

2.

DERIVATION

Examination of an extensive table of Student's t, or some mathematical analysis. shows that for a < .5. there is a monotonically increasing transformation that stretches a normal quantile z ( a ) into a Student-t quantile. t,,(a). Let t,,(a) = 4: (:(a)), with 4,T(. ) representing the transform. W e search for a simple approximation to 4; ( . ); call it &,,(. ) . By definition.

where C(n) is the normalizing constant. Equivalently,

Differentiation of both sides with respect to z now leads

3.

IMPROVING THE ACCURACY OF THE APPROXIMATION

Before comparing numerically the accuracy of ;,(a),, with G K ( l ) , we consider a method for improving the accuracy as follows. Let us assume that the true value of Student's r can be written as in (2.7) but with a slightly different tail area, that is, with a * a function of a :

to On rewriting (3. I), we see that a*(n) = @{[ln(l + ti(cu)/n)] [(n - l)'/(n -:)]}'/', Our approximation has origin in the fact that t,(a) approaches z(a) and d$:(z)/dz + 1 as t~ becomes large for fixed a . Consequently, simply allow the approximation $?,(z) to satisfy

(3.2)

denotes the standard Gaussian cumulative distriwhere bution function. Figure 1 shows that ln(a*(n) - a) is roughly linear in ln(n) for several values of a. The least squares estimates for the slope and intercept for a few values of a are shown in Table 1. A typical value for the slope is taken to be - 1.86; the intercept behaves like - 3 + .62(ln a ) . Thus

for every n. Solving (2.4) for &(z) leads to an expression of the following general form:

) t,(a) = 0, so K(n) = 1 for all But for a = .5, ~ ( a = n. To determine H(n), consider matching expectations of random variables. On the left side of (2.5), E(t;) = var[t,] = nl(n - 2); the right side requires the evaluation

So our improved percentage point should be

Note that the adjustment to a in (3.3) decreases rapidly as n increases. Of course the above correction is empirical and probably can be improved further. Unfortunately, it is not easily retrieved in a manner analogous to the derivation of it,(a)ci~o,. where Z is a unit normal random variable. Notice also that this evaluation may be recovered easily from the moment generating function of the X: distribution function. Thus for second moment matching,

4. COMPARING THE APPROXIMATIONS Figure 2 compares the accuracy of the three approximations-(1 .1), (2.7), and (3.4)-and Koehler's (1983)

Our suggested first approximation (GK) is, then,

for a < .50. Notice that this expression strongly resembles the Peizer-Pratt (1968) approximation but has a somewhat different exponent. Numerical examples (displayed later) also suggest that it is of acceptable accuracy, usually being somewhat superior to that of Peizer and Pratt. A distinctive feature of the above approximation (termed GK (1) for short) is its intuitively appealing and easily recollected derivation: it is retrievable. Note that this expression is convenient for simulating t values, as in Ury (1980). Iteration of the expression (i.e., replacing z by t,, on the right side of (2.7)) yields samples from even longer-tailed distributions; such may be useful in robustness studies.

--

1.0

1.5

2.0

2.5 3.0 In (Degrees of Freedom )

3.5

4.0

4.5

Figure 1. Alpha values are as follows: -, .05; - -, .01; - -, ,001; ---, ,0001; ----,.00001; and ...., .000001.

0 The American Statistician, November 1984, Vol. 38, No. 4

309

Table 1. Linear Fits of In (a' CY

Slope

.05 .02 .01 ,005 ,001 ,0005 .00005 .00001 .000005 .000001

2.876 1.308 1.809 1.828 1.928 1.943 - 1.930 1.927 - 1.822 - 1.771

-

a) Versus In (n) Intercept

-

- 3.447

-

-

-

- 6.495

-

- 6.473

-

-

8.487

- 6.800 -7.138 8.683 - 9.872 - 10.664 11.978 -

-

formula as a function of x- = - 10 log (tail area), for tl = 6 , 10, 20. and 30. by plotting the relative error [ = ( t , , ( a ) - i,,(a))/t,,(a)].Notice that in all of the graphs, the simple approximation given by G K ( I ) (2.7) is slightly better than that suggested by Peizer and Pratt (1968). Considerable improvement is attained by using the adjusted value of cu given by G K ( 2 ) in (3.4). A few values of each approximation are tabulated in Table 2 and compared with the true percentage points. Notice that although G K ( 2 ) is initially worse than G K ( 1 ) for low degrees of freedom, it results in an extra digit of accuracy for moderate n and extremely small a . In fact. G K ( 2 ) yields two to three decimals of accuracy for n 2 10 over the entire range of a considered, .05-,00000 1 . Koehler's formula is better for small n ( = 4) and moderate a ( 2 ,025) and is about the same as GK(1) and GK(2) when n is very large ( = 60). The choice of approximation at n = 60, however.

X =

Figure 2. (D) 30 df.

3 10

Comparison of Approximations. -,

-

Koehler;

is possibly academic, as many users would be satisfied with Gaussian percentage points for such large degrees of freedom. In brief. G K i 2 ) obtains an extra digit of accuracy for extreme tail areas and moderate degrees of freedom. Notice that the correction factor is essentially 0 for large 1 1 , so there is no advantage of G K ( 2 ) over G K ( I ) for 11 greater than. say. 30. All approximations requiring : ( a )used formula 26.2.23 from Abramowitz and Stegun i 1968) in the table and figures of comparisons. It may be noted that the approximation ( 2 . 7 ) . G K ( 1 ). may be inverted to determine approximate probability values (so-called " p values"). A table of the Gaussian distribution o r an approximation to the Gaussian percentage points is required.

5. TOWARD A SIMPLE STAND-ALONE APPROXIMATION It is tempting to calculate our r value approximations. which depend on tabulated normal values. with the aid of approximate normal values that can be computed easily from scratch. The result is a stand-alone r value approximation. accurate to nearly two digits over a surprisingly large range. Here is a suggested way of proceeding. Tukey's A distribution ( T ) (see Tukey 1970, as referred to in McNeil 1977, p. 88) provides zT(a) = W 1 ( l =

-

2 a ; A)

-

( ~ ~ i 2 2 " / [2( l~-) a)"

~ " 1 ;

(5.1)

with A = .14. it yields inverse normal values to three-digit

10 log ( s ~ n q l etoll area

- -,

GK(1);

C The American Srarisricinn. .Yovemher. 1984. Vol. 38, 'Yo. 4

- - -, GK(2); . . . ., Peizer-Pratt.

(A) 6 df; (8)10 df; (C) 20 df; and

Integration gives

Table 2. Comparing Approximations Single Tail Area, n

.05, 13

,025, 16

.01, 20

-

10 log (tail area) ,005, .OO1, .0001, 23 30 40

4 True K PP GK(1) GK(2) GK(3)

-

uo)

(5.5)

(for "large" u, here 2 < u 5 6). Take uo = - log(.Ol) = 2 and z(u0) = zT(.O1) = 2.58, and replace 2 In 10 by 4.32 to achieve slightly better results. Then use these numbers to find, for a < .01,

In summary, use the following prescription for the normal values:

10 True K PP GK(1) GK(2) GK(3)

= v-4.32

20 True K PP GK(1) GK(2) GK(3)

log a - 3.284.

a < lo-'.

(5.7)

with close to two-digit accuracy throughout the stated range. Refinement or improvement is possible, but at the apparent price of a more elaborate representation. Table 2 includes t values computed using the normal approximation (5.7). These are labeled ;,,(a),,,,, .

30 True K PP GK(1) GK(2) GK(3) 60 True K PP GK(1) GK(2) GK(3)

Z(U)-- d z 2 ( u 0 ) + 2(1n 10) (u

[Received April 1983. Revised Ma! 1984.1

REFERENCES 1.671 1.668 1.671 1.668 1.668 1.680

2.000 1.996 2.001 1.996 1.996 2.013

2.390 2.383 2.391' 2.383 2.383 2.401

2.660 2.650 2.661' 2.650 2.650 2.665

3.232 3.218 3.232' 3.218 3.218 3.255

3.962 3.953 3.963' 3.953 3.953 3.989

'Closest approxlmatlon to true value

accuracy down to a = .01. To extend fairly satisfactorily to a = l o p 6 , proceed as follows: Put a = lo-" and write

Now differentiate, and examine the result as u becomes large (cf. Feller 1957, p. 193):

=

dz (u) z(u) -. du

ABRAMOWITZ, M., and STEGUN, I. (1968). Handbook ojMathemarical Functions. No. 55, Applied Mathematics Series, Washington. D . C . : U.S. Government Printing Office. BAILEY, B.J.R. (1980). "Accurate Normaliz~ngTransformat~onsof a Student's t Variable," Applied Statistics. 29, 304-306. FELLER, W. (1957), An lntroducrion to Probabiliy Theor\ and Its Applicarions (Vol. I ) . New York: John Wiley. FISHER, R.A.. and YATES. F. (1963), Statistical Tablec. London: Ol~ver & Boyd. GARDINER, DONALD A , , and BOMBAY. BARBARA FLORES ( 1965). "An Approximation to Student's t." Technometrics. 7. 71-72. HOAGLIN, D.C. (1977). "Direct Approximations for Chi-Squared Percentage Polnts," Journal of the American Statistical Assocranon. 72. 508-5 15. KOEHLER, K.J. (1983), "A Simple Approximation for the Percentiles of the r Distribution," Technometrrcs. 25, 103- 106. KRAMER, CLYDE Y. (1966), "Approximation to the Cumulative r Distribution," Technometrics. 8, 358-359. McNEIL, DONALD L. (1977), lnreracrrve Data Anal~sis.New York: John Wiley. MUDHOLKAR, GOVIND S . , and CHAUBEY, YOGENDRA P. (1975). "Use of the Logistic Distribution for Approximating Probabilities and Percentiles of Student's Distribution," Journal of Staristical Research. 9 , 1-9. PEARSON, E.S., and HARTLEY, H.O. (1976). Biometrika Tables for Staristicians, London: Biometrika Trust. PEIZER, DAVID B., and PRATT, JOHN W . (1968). "A Normal Approximation for Binomial, F. Beta, and Other Common Related Tail Probabilities, I," Journal of the American Staristical Associarion. 63. 1416-1456. TUKEY, J.W. (1970), ExploratopData Analysis (Vol. 3; limited preliminary ed.), Reading, Mass.: Addison-Wesley. URY, HANS K. (1980), "Calculator Quirks." RSS lYews and Notes. 7 (4), 83. WALLACE, DAVID L. (1958), "Asymptotic Approximations to Distrlbutions," Annals of Mathematical Statistics. 29. 636-654. (1959), "Bounds on Normal Approximations to Student's t and the Chi-Square Distributions," Annals of Mathematical Statistics. 30. 1121-1 130.

O The American Statistician, November 1984. Vol. 38. No. 4

311