Joint Statistical Meetings - Statistical Computing Section
NUMERICAL COMPUTATION of HIGH DIMENSIONAL INTEGRALS for MULTIPLE COMPARISON PROBLEMS Alan Genz, Washington State University Frank Bretz, University of Hannover Math Department, WSU, Pullman, WA 99164-3113,
[email protected] Key Words: Multivariate-t, Multiple Comparison
tion (see Tong 1990) is defined by Tν (a, b, Σ) =
Abstract: We consider classes of high dimensional integrals that are needed for the computation of critical values for multiple comparison problems. The numerical integration problems involve computation of multivariate t-distribution values with integration over regions determined by sets of linear inequalities. We discuss techniques for reduction of dimensionality based on the analysis and simplification of these sets of inequalities. We also consider efficient numerical integration algorithms for the final integrals. The use of the algorithms will be illustrated with examples from multiple comparison problem applications.
Km,ν |Σ| ≡
1− ν2
2 Γ( ν2 )
b1 b2 a1 a2
∞
where Km,ν =
am
sν−1 e−
s2 2
0 Γ( ν+m 2 )
Γ( ν2 )(νπ)
m 2
Φ(a, b, Σ) = |Σ|(2π)m
Introduction
In many statistical applications, the data model is a general linear model in the form Y = Xβ + , where the problem data is given by the N ×1 data vector Y , X is an N ×p design matrix, β is a p×1 unknown parameter vector, and is a normally distributed N ×1 error vector with unknown variance σ 2 . For multiple comparison problems (see Hochberg and Tamhane, 1987, Hsu, 1992 and 1996, Hsu and Nelson, 1998, and Westfall, Tobias, Rom and Wolfinger, 1999), an m × p comparison (or contrast) matrix C is also given, and the resulting m × m covariance matrix is Σ = C(X t X)−1 C t . The basic numerical problem isto determine confidence intervals (CI’s) for p xi = j=1 ci,j βj , i = 1, . . . , m. The distribution for x is an m–variate Student’s t, with covariance matrix Σ and degrees of freedom ν = N − rank(X). The desired confidence intervals are given by: 1
ˆ (ci (X t X)−1 cti ) 2 . xi ± t1−α σ High dimensional integration problems arise with the “all-pairs” comparison cases (Stoline, 1981) where m = p(p − 1)/2. The multivariate t (MVT) Tν (a, b, Σ) distribu-
1145
sa sb Φ( √ , √ , Σ)ds, ν ν , x = (x1 , x2 , ..., xm )t ,
−∞ ≤ ai < bi ≤ ∞ for all i, and Σ is a positive semi-definite symmetric m × m covariance matrix. The second form given for the MVT distribution (Cornish, 1954) uses the multivariate normal distribution function 1
1.
bm xt Σ−1 x − ν+m ) 2 dx ... (1 + ν
b1 b2
bm ...
a1 a2
−1 1 t e− 2 x Σ x dx.
am
The present authors have developed efficient and robust numerical methods for MVT probability computations for problems where m is in the range 1 − 20 (Genz and Bretz,1999-2002), but these methods require significantly increasing computational resources for larger m values. The multiple comparison confidence interval problem, given a desired confidence level α, is to determine the critical value t1−α satisfying P (t1−α ) = 1 − α, where P (t) = Tν (−∞, t, Σ) for one-sided CI’s, or P (t) = Tν (−t, t, Σ) for two-sided CI’s, with t = t(d1 , ..., dm )t , ( for di > 0; usually with all di = 1), and ∞ = (∞, ..., ∞)t . The computation of t1−α requires the use of numerical optimization method to find a zero of the function h(t) = P (t) − (1 − α), so the repeated evaluation of h(t) is necessary for accurate determination of t1−α . It is usually possible to use 2nd order modified Bonferroni bounds (Hsu, 1996) to find a small initial interval [ta , tb ] with t1−α ∈ [ta , tb ]. The second order bounds use only bivariate t probabilities (Dunnett and Sobel,
Joint Statistical Meetings - Statistical Computing Section
1954). The midpoint of the bounding interval provides a good value for starting the Newton-Raphson iteration method, which can be used to compute an accurate t1−α approximation. More details for this procedure can be found in an earlier paper by the present authors (Genz and Bretz, 2000). This procedure can often reduce the number of P (t) evaluations needed for the determination of t1−α , but the need for some evaluations of the the high-dimensional integrals for P (t) still remains.
2.
Dimension Reduction
The numerical integration method developed by the present authors (Genz and Bretz, 2002) for MVT probability computations uses the multivariate normal definition of the MVT distribution and begins with the Cholesky decomposition change of variables x = Ly, where Lt L = Σ, so that xt Σ−1 x = yt y. The result of the application of this change of variables is ν
Tν (a, b, Σ) = b2 (s,y1 )
a2 (s,y1 )
21− 2 Γ( ν2 )
s
ν−1
e
2 − s2
0
b 1 (s)
a1 (s)
bm (s,y1,...,ym−1 )
y2 2
e− 2 √ 2π
∞
··· am (s,y1 ,...,ym−1 )
with bi (s, y1 , ..., yi−1 ) =
√s (bi ν
−
and ai (s, y1 , ..., yi−1 ) =
√s (ai ν
−
i−1 j=1 i−1
y2 1
e− 2 √ 2π
2 ym
e− 2 √ dy ds, 2π
li,j yj )/li,i , li,j yj )/li,i .
j=1
Additional transformations (see Genz and Bretz, 2002) reduce this multidimensional integral to an integral over the m-dimensional unit hypercube, so that an efficient quasi-Monte Carlo integration method can used to compute the desired probability. The multiple comparison problem m × m covariance matrix has the form Σ = C(X t X)−1 C t and the original inner integration region has the form s s √ a ≤ x ≤ √ b, ν ν but this problem is equivalent to a problem with ˆ = (X t X)−1 and integration recovariance matrix Σ gion defined by s s √ a ≤ Cw ≤ √ b. ν ν
x = Cw to the second problem. The inner integration dimension for the second problem formulation is p with m constraints for integration. After the ˆ = LLt is use of transformation w = Ly, where Σ ˆ the Cholesky factorization for Σ, the inner integrand t becomes e−y y/2 and the inner integration region is defined by m constraints s s √ a ≤ CLy ≤ √ b. ν ν Unfortunately, these integration region inequalities are not in the lower triangular form that is needed for the use of the additional transformations that allow an transformation to a unit hypercube. However, a p × p orthogonal matrix Q can be constructed (as a product of orthogonal reflectors, using standard ¯ with linear algebra techniques) so that CLQt = L, ¯ lower triangular. If we let v = Qy, then vt v = L t yt y, and the inner integrand becomes e−v v/2 with integration region determined by s ¯ ≤ √s b. √ a ≤ Lv ν ν This integration region is now in a form that allows application of the additional transformations to a unit hypercube. The transformations that we have just described can also be used for singular problems (Genz and ˆ = r < p. For these Kwong, 1999), where rank(Σ) problems, and for problems where rank(C) < p, the Cholesky decomposition can be constructed with ¯li,j = 0 for all j > r. The resulting lower triangular Cholesky factor can be written in the form ¯ L
= [ L 0 ] 0 .. .. . . 0 ? .. .. . . = ? . . .. .. ? ? . .. .. . ?
?
0 .. . 0 0 .. . 0 .. .
0 1 .. .. . . 0 m1 0 1 .. .. . . , 0 m2 .. .. . . ··· 0 1 . .. .. . . .. mn ··· 0
··· ··· ··· ··· .. .. .. .. . . . . ··· ··· ··· ··· ··· ··· ··· ··· .. .. .. .. . . . . ··· ··· ··· ··· .. .. .. .. . . . .
··· .. .
? .. .
.. .
0 .. .
···
?
0
where indicates a nonzero entry, ? indicates an r mi = m, entry that could be zero or nonzero, i=1
The equivalence of these two problem formulations can be understood by applying the transformation
1146
¯ The final constraint and L has columns 1, ..., r of L. set is defined by s s √ a ≤ L v ≤ √ b. ν ν
Joint Statistical Meetings - Statistical Computing Section
with v = (v1 , ..., vr )t . There is no constraint on the variables vr+1 , ..., vm ), so the integrals for these variables are all equal to one, and the final integration problem has r variables. The multiple integral for Tν (a, b, Σ) can now be written in the form ν
Tν (a, b, Σ) = b2 (s,v1 )
a2 (s,v1 )
21− 2 Γ( ν2 )
sν−1 e
2 − s2
0
br (s,v1,...,vr−1 )
v2 2
e− 2 √ 2π
∞
··· ar (s,v1 ,...,vr−1 )
b 1 (s)
a1 (s)
v2 1
e− 2 √ 2π
2 vr
e− 2 √ dv ds, 2π
with bi (s, v1 , ..., vi−1 ) =
i−1 s √ min{mi−1 +1≤k≤mi } (bk − lk,j vj )/lk,i , ν j=1 and ai (s, v1 , ..., vi−1 ) =
i−1 s √ max{mi−1 +1≤k≤mi } (ak − lk,j vj )/lk,i . ν j=1 Now that the problem is in this form, the same additional transformations that are used for the nonsingular case can be used to reduce this multidimensional integral to an integral over the m-dimensional unit hypercube, so that an efficient quasi-Monte Carlo integration method can be applied.
3.
Examples
• Consider an all-pairs comparison problem with p = 10 and sample sizes 12, 14, . . . , 30. In this case ν = 200, (X t X)−1 is a diagonal matrix with entries 1/12, 1/14, . . . , 1/30, and the comparison matrix C has 45 rows in the form
−1 −1 .. .
··· ··· ··· 0 ··· ··· .. .. .. . . . −1 0 · · · · · · · · · 0 C = 0 −1 1 · · · · · · · · · .. .. .. .. .. .. . . . . . . 0 −1 0 · · · · · · 0 . .. .. .. .. .. .. . . . . . 0 · · · · · · · · · 0 −1 1 0 .. .
0 1 .. .
0 0 .. . 1 0 .. . 1 .. .
factor L takes the form 0 0 ? 0 ? 0 . . . . .. .. L = . ? ? ··· . .. .. .. . . ? ? ···
If the transformation that we have previously described are applied, then the final Cholesky
1147
.. .
···
?
0 0 0 .. .
;
L has 9 columns. If α = 0.1, second order bounds that use only bivariate T values can be used to determine a t1−α bounding interval [ta , tb ] = [2.8016, 3.0637]. The Newton-Raphson iteration with error tolerance τ = 0.001 converges with two iterations to t1−α ≈ 2.938. A total of 104880 integrand evaluations were used to compute the required P (t) values. If α = 0.05, second order bounds can be used to determine [ta , tb ] = [3.0819, 3.2814]. The Newton-Raphson iteration with error tolerance τ = 0.001 converges with two iterations to t1−α ≈ 3.192 using 126656 integrand evaluations. • Consider an all-pairs comparison problem with p = 16 and sample sizes 12, 14, . . . , 36. In this case ν = 416, (X t X)−1 is a diagonal matrix with entries 1/12, 1/14, . . . , 1/36, the comparison matrix C has 120 rows in the form −1 1 0 ··· ··· ··· 0 −1 0 1 0 ··· ··· 0 .. .. .. .. .. .. .. . . . . . . . −1 0 · · · · · · · · · 0 1 C = 0 −1 1 · · · · · · · · · 0 . .. .. .. .. .. .. .. . . . . . . . 0 −1 0 · · · · · · 0 1 . .. .. .. .. .. .. .. . . . . . . 0
1
··· ··· ··· ··· ··· ··· .. .. . . ··· ? .. .. . .
···
··· ···
The final Cholesky factor 0 0 ? 0 ? 0 .. . . . .. . L = . ? ? ··· . .. .. .. . . ? ? ···
0
−1 1
L takes the form ··· ··· 0 ··· ··· 0 ··· ··· 0 .. .. .. . . . . ··· ? .. .. .. . . . ··· ?
Joint Statistical Meetings - Statistical Computing Section
L has 15 columns. If α = 0.1, second order bounds can be used to determine a t1−α bounding interval [ta , tb ] = [3.0194, 3.3366]. The Newton-Raphson iteration with error tolerance τ = 0.001 converges with two iterations to t1−α ≈ 3.197 using 661056 integrand evaluations. If α = 0.05, second order bounds can be used to determine a t1−α bounding interval [ta , tb ] = [3.3008, 3.5345]. The Newton-Raphson iteration with error tolerance τ = 0.001 converges with two iterations to t1−α ≈ 3.432 using 1495232 integrand evaluations.
4.
Conclusions
The numerical evaluations of high dimensional multivariate t distribution integrals are needed for the computation of critical values for many types of multiple comparison problems. Fortunately, the structure of the sets of inequalities that define the integration regions for multiple comparison problems allows a transformation of the problem variables. The result of this transformation of variables is a significant reduction in the number of integration variables for the final numerical integration problems, and this allows the desired critical values to be accurately computed using quasi-Monte Carlo integration methods that were developed for multivariate t distribution computations.
5.
References
• Cornish, E.A. (1954), ‘The Multivariate tDistribution Associated with a Set of Normal Sample Deviates’, Australian Journal of Physics 7, pp. 531–542. • Dunnett, C.W. and Sobel, M. (1954), ’A Bivariate Generalization of Student’s t-Distribution, with Tables for Certain Special Cases’, Biometrika 41, pp. 153–169. • Genz, A. and Bretz, F. (1999), ‘Numerical Computation of the Multivariate t Probabilities with Application to Power Calculation of Multiple Contrasts’, J. Stat. Comput. Simul.63, pp. 361–378. • Genz, A., and Bretz, F. (2000), ’Numerical Computation of Critical Values for Multiple Comparison Problems’, in Proceedings of the Statistical Computing Section, American Statistical Association, Alexandria, VA, 2000, pp. 84–87.
1148
• Genz, A. and Bretz, F. (2002), ’Methods for the Computation of Multivariate t Probabilities’, J. Comp. Graph. Stat. 11, in press. • Genz, A. and Kwong, K.S. (1999), ‘Numerical Evaluation of Singular Multivariate Normal Distributions’, J. Stat. Comp. Simul. 68, pp. 1–21. • Hochberg, Y., and Tamhane, A.C. (1987), Multiple Comparison Procedures, John Wiley and Sons, New York. • Hsu, Jason C. (1996), Multiple Comparisons, Chapman and Hall, London. • Hsu, Jason C. (1992), ’Simultaneous Inference in the General Linear Model’, J. Comput. Graph. Stat. 1, pp. 151–168. • Hsu, Jason C. and Nelson, Barry (1998), ’Multiple Comparisons in the General Linear Model’, J. Comput. Graph. Stat. 7, pp. 23–41. • Stoline, M.R. (1981), ’The Status of Multiple Comparisons: Simultaneous Estimation of All Pairwise Comparisons in One-Way ANOVA Design’, The American Statistician 35, pp. 134– 141. • Tong, Y.L. (1990), The Multivariate Normal Distribution, Springer-Verlag, New York, New York. • Westfall, P.H., Tobias, R.D., Rom D., and Wolfinger, R.D. (1999), Multiple Comparisons and Multiple Tests using the SAS System, SAS Institute Inc, Cary, NC.