Spherical-Radial Integration Rules for Bayesian Computation Alan Genz y Department of Mathematics Washington State University Pullman, WA 99164-3313
[email protected]
John Monahan Department of Statistics North Carolina State University Raleigh, NC 27695-8203
[email protected]
Abstract The common numerical problem in Bayesian analysis is the numerical integration of the posterior. In high dimensions, this problem becomes too formidable for xed quadrature methods, and Monte Carlo integration is the usual approach. Through the use of modal standardization and a spherical-radial transformation, we reparameterize in terms of a radius r and point z on the surface of the sphere in d dimensions. We propose two types of methods for spherical-radial integration. A completely randomized method uses randomly placed abscissas for the radial integration and for the sphere surface. A mixed method uses xed quadrature (Simpson's rule) on the radius and randomized spherical integration. The mixed methods show superior accuracy in comparisons, require little or no assumptions, and provide diagnostics to detect dicult problems. Moreover, if the posterior is close to the multivariate normal, the mixed methods can give remarkable accuracy. Key Words: Monte Carlo, randomized numerical integration rule, importance sampling.
1 Introduction The expression of posterior inference from Bayesian methodology often requires numerical computation that is customized for each application. A key step for many applications is the numerical integration of the unnormalized posterior density. If the number of parameters is small, this problem can be handled in a number of ways. However, when the number of parameters grows beyond 3 or 4, the choices become more limited. To reduce the programming eort, we seek omnibus methods which can be successful in solving most problems with little customization. We assume little in the distributional structure that can be exploited: no conjugate family, no hierarchical structure, no conditional forms that would direct the use of Gibbs sampling or other Markov Chain Monte Carlo methods. Some of the earliest numerical work in Bayesian analysis (e.g., Zellner and Geisel 1970) used a combination of conjugate families and distributional structure to reduce the numerical integration to one or at most two dimensions. As computations became cheaper, product integration rules were proposed: product Simpson or Gauss-Hermite (Naylor and Smith 1982). Accepted for publication in the Journal of the American Statistical Association. y Partially supported by a grant from the National Science Foundation.
1
Product integration rules suer from the curse of dimensionality: the eort to reach a required accuracy increases exponentially with dimension. As a consequence, product rules are practically limited to three or four dimensions at present. An additional drawback is the diculty in constructing error estimates. Monte Carlo importance sampling has been the standard omnibus method in recent years for Bayesian problems (Kloek and van Dijk 1978, Geweke 1989). We will employ importance sampling as the benchmark method in comparisons with the methods proposed here. We will discuss the details of importance sampling in Section 3, following a brief discussion of the mathematical background and standardization methods in Section 2. The key to the methods proposed here is the Spherical-Radial Transformation discussed in Section 4, along with randomized or mixed quadrature methods. Next the Randomized Spherical-Radial methods are discussed in Section 5, followed by Mixed rules in Section 6. Comparisons of these new methods with importance sampling methods compose Section 7. The driving motivation for this work is to nd methods that are able to handle most problems, and satisfy the omnibus requirement, but also should work very well when the problem is relatively easy. For Bayesian problems, this situation occurs when the posterior is well approximated by a multivariate normal distribution. A secondary motivation, leading to the same conclusions, was to exploit the best features of two methods of integration: xed quadrature and Monte Carlo. Fixed quadrature, most commonly midpoint, trapezoid, Simpson, or Gauss rules, exploit smoothness features of the integrand. With a reasonable level of smoothness, xed quadrature can attain rates of convergence of O(n?2 ) or O(n?4 ) for n evaluations in one dimension, while Monte Carlo crawls at O(n?1=2). For d dimensions, the exponents for xed quadrature are divided by d, say O(n?2=d) as the curse of dimensionality takes hold, so that the Monte Carlo crawl appears rapid. In addition to be being undaunted by dimension, Monte Carlo has the advantage of easy-to-construct error estimates, while xed quadrature does not. Randomization of xed quadrature to obtain error estimates is not new (Cranley and Patterson 1976). Returning to the Bayesian problem, we seek to combine randomization and xed quadrature so that if the posterior is nearly multivariate normal, the methods will be very accurate. Indeed, many of the methods proposed will be exact or nearly so if the posterior is multivariate normal. The two departures from normality we have been most concerned about are thick tails and skewness. Through the use of the spherical-radial transformation (Section 4), we separate these two: thick tails are manifest in the radius, and skewness in the lack of spherical symmetry.
2 Standardization and the Normal Approximation
We wish to numerically compute the integral of several functions g(t) with respect to the unnormalized posterior distribution p(t) in d dimensions:
Z Z
I (g; p) = :::
Rd
g(t)p(t)dt:
Typically, the functions g of interest are 1 (normalization constant), t (for posterior means), or ttT (for covariances) or indicator functions (for posterior probabilities). Asymptotics suggest that for large samples, the posterior density p(t) has a useful normal approximation: log(p(t)) log(p(t)) ? (t ? t )T C (t ? t )=2 (1) 2
where t is the posterior mode and C = ?r2 log(p), evaluated at t , the Hessian matrix at the mode. Assumed throughout this paper is the existence of a posterior mode surrounded by most of the mass of the posterior. The suggestion of the normal approximation has two major consequences: the use of standardized variables, and a restatement of one of the motivations for this work. The expression (1) suggests the following reparameterization. Let C = BB T be the Cholesky factorization of the Hessian matrix C . Then reparameterizing by t = t + B ?1 x leads to Z Z Z Z I (g; p) = ::: d g(t)p(t)dt = ::: d g (x)p(x)dx R R ? 1 ? 1 where g (x) = g (t + B x) and p (x) = jB jp(t + B x). Revisiting (1) now gives the normal
approximation as
log(p (x)) log(p(0)) ? xT x=2: (2) As the sample size increases, the posterior often becomes more expensive to evaluate, and one can view an improvement in the normal approximation as compensating for this increased cost. Continuing on this line of thought brings out the motivation for the methods proposed here as omnibus methods. In addition to being able to handle almost any problem, if the normal approximation is very good, these methods should be able to give very accurate results with little eort. The methods should not rely on the normality assumption, but be able to exploit it if available. Some researchers (Naylor and Smith 1982, Evans 1991, Evans and Swartz 1992, 1996) have proposed using estimated posterior means and covariances to adaptively re ne the reparameterization. We avoid this. Because of the possibility of thick tails and the common use of improper priors, these posterior moments may not exist, and if they do, their estimates may not be accurate and the adaptation highly variable. While the adaptation attempts to adjust for thick tails, we prefer to tackle that problem directly.
3 Monte Carlo Importance Sampling Monte Carlo integration has strong appeal for numerical integration problems in high dimension because it avoids the curse of dimensionality and has the usual statistical O(n?1=2 ) convergence rate for nearly all problems. But generation of random variables from the posterior p(x) is rarely possible, so Xi , for i = 1; :::; N are often generated iid from some other more convenient density q (x) on Rd . In order to compute sample means, the observations are weighted using weights Wi = p (Xi)=q(Xi) leading to estimates of I (g; p) from the weighted sample means P N ? 1 N is rarely normalized, i=1 Wi g (Xi). The posterior p(t) P P so the estimates of posterior expectations usually take the ratio form: Ni=1 Wi g (Xi)= Ni=1 Wi : The eciency of importance sampling, measured in terms of the variance of the estimate for the normalization constant, is highest when the sampling distribution q (x) is near p(x). Given the theoretical support for the normal approximation, the rst suggestion for q (x) is the normal approximation given by (2). The drawback to this choice is that often the normal approximation deteriorates for large values of x, and few posterior distributions have the tight tail of the normal distribution. As a result, often the weights become very large for large values of x. Indeed, because the Central Limit Theorem is the basis for most of the analysis of importance sampling results (Geweke 1989, but see also Hesterberg 1991), bounded weights 3
are certainly preferred, although weights with a nite variance are sucient. However, the tail behavior of the weights is rarely checked. The response to concern for the tail behavior of the importance sampling distribution have led many (Kloek and van Dijk 1978) to use some form of Student's t distribution with small degrees of freedom. Evans and Swartz (1996) recommend a modi ed multivariate t with 5 df after standardization with the resulting
q(x) / (1 + xT x=3)?(5+d)=2:
(3)
We will use this method of importance sampling along with the multivariate normal approximation as benchmarks for the methods proposed in Sections 5 and 6. Monahan (1993) investigated two approaches for testing the tail behavior of importance sampling weights and found the best tests were those based on Hill's (1975) estimator ^a = k?1
Xk log(W
j =1
N ?j +1)) ? log(W(N ?k) )
(
for the tail rate parameter a in the expression 1 ? F (w) = w?1=aL(w): Haeusler and Teugels (1985) give the asymptotic distribution of the estimate as (^a ? a) N (0; a2=k). With the boundary for in nite mean at a = 1; and in nite variance at a = 1=2; the tests were set up with these boundaries as the null hypothesis, leading to the tests p at level : ) Reject the hypothesis of an in nite mean if a^ < (1 ? z = kp Reject the hypothesis of an in nite variance if a^ < (1 ? z = k)=2 where z is the upper critical point of the normal distribution. We will apply these tests to the weights for the appropriate methods in Section 7, taking k = 4N 1=3. With the common use of improper priors and the diculty of analyzing the functional behavior of many likelihood functions, one could unknowingly have an improper posterior, so that testing for in nite mean of the weights could be useful.
4 Transformation and Integration Tools 4.1 Spherical-Radial Transformation
In both methods proposed here, the key step is another change of variable, from the standardized vector x in Rd to a radius r and direction vector z, by x = rz, where z is a point on Ud , the surface of the unit sphere in d dimensions. This changes the integral accordingly to
Z Z
I (g; p) = :::
Rd
Z Z
g(t)p(t)dt = :::
Rd
g(x)p(x)dx =
Z 1Z 0
Ud
g(rz)p(rz)dz rd?1 dr:
The reader is reminded to note this factor rd?1 in the subsequent calculations. The choice of the order of integration is also important: sphere within radius. Denote the inner spherical integral by Z GP (r) = g (rz)p (rz)dz and the outer radial integral by
Z1 0
Ud
GP (r)rd?1dr: 4
4.2 Randomized Integration Methods
Here we consider one dimensional randomized rules for the radial integral, although the same principles apply to the methods employed for integrating over the surface of the sphere. Fixed quadrature methods, such as the trapezoid or Simpson's rule, usually give statistically biased results, while Monte Carlo methods are unbiased. We employ randomized integration methods to gain both the unbiasedness of Monte Carlo methods, and the improved convergence of xed quadrature. For example, consider the randomly shifted Riemann sum for integrating over (0; 1): Xn I (h) = n?1 h(U + (i ? 1)=n); i=1
R
where U uniform(0; 1=n). This estimate is unbiased since E fI (h)g = 01 h(u)du; and gains improved convergence through a smaller variance, usually O(n?2 ): This method may be recognized as systematic sampling (Cochran, 1978, p. 160). We will employ randomized methods to gain an estimate of an integral that is exact for certain types of functions, and unbiased for all functions. Siegel and O'Brien (1985) constructed a randomized integration rule on (?1; +1) that is unbiased for all functions and exact for cubic functions. If R has density 3r2 on (0; 1), then the rule T (R) = w0(R)h(0)+ w1(R)h(?R)+ w1(R)h(RR ), with w0 = 2 ? 2=(3R2) and w1 = 1=(3R2), is unbiased for all functions, that is E fT (R)g = ?11 h(r)dr, and exact, that is, zero variance, for any function h that has polynomial degree less than four. This randomized integration rule is similar in form and principle to antithetic variates (Hammersley and Handscomb 1964). For integrating on (?1; 1), Genz and Monahan (1996a) generalized this approach for integrating functions weighted with (r)jrjd?1, where (r) is the standard normal density. The rst order rule merely takes a simple symmetrized form: generate R d and form T1(R) = [h(R) + h(?R)]=2: ?d=2
E fT1(R)g = ?(2 d=2)
Z1
?1
h(r)jrjd?1e?r2 =2dr:
The third order rule has R d+2 ; and weights w0 = 1 ? d=R2 and w1 = d=(2R2); so that T3(R) takes the Siegel and O'Brien form but is exact for integrating cubic functions with d ? 1 ?r 2 =2 respect to jrj e , and unbiased for all functions h:
?d=2 Z 1 2 E fT3(R)g = ?(d=2) h(r)jrjd?1e?r2 =2dr: ?1 The fth order method requires two values R1 and R2 to be generated from a complicated joint distribution; simpli ed, the algorithm is to generate U 2d+7 and V Beta(d +2; 3=2), and form R1 = U sin(arcsin(V )=2) and R2 = U cos(arcsin(V )=2), producing an integration rule of the form
T5 (R) = w0 h(0) + w1h(?R1) + w1h(R1) + w2h(?R2) + w2h(R2): This is constructed to be unbiased for integrating all functions with respect to jrjd?1e?r2 =2; and
exact for quintic functions. Going to higher order has questionable value, because the integration rule becomes much more sensitive, and generating from the (R1; R2; R3) joint distribution becomes daunting. 5
This approach can be generalized for kernel functions f (r) other than the normal, say for Student's t or the logistic. For the third order method, the weights take the form w1 = vd =(2R2) R R 1 1 and w0 = 1 ? vd =R2, where vd = 0 rd+1 f (r)dr= 0 rd?1 f (r)dr and the kernel functions are f (r) = (1 + r2= )?(d+)=2 for Student's t, and f (r) = exp(?r)=(1 + exp(?r))2 for the logistic distribution. As above, T3(R) will be exact for any function that takes the form of a cubic polynomial times f (r), and will be unbiased for all integrable functions on the real line.
4.3 Spherical Integration
We consider three basic methods for integrating over the surface of the unit sphere Ud to which we assign the names antipodal, simplex, and extended simplex. The antipodal method puts 2 equally weighted points at 1 on each of the d axes: ek and ek for k = 1; :::; d. Note that there are 2d points in d dimensions, and that this method will be exact for cubic functions. The simplex rule has equally weighted points at the vertices of the regular simplex in d dimensions. Note that there are d + 1 points in d dimensions, and that this method is exact for quadratic functions. Mysovskikh (1980, p. 232) gives the formulas for the d + 1 points. Here we use fsj g to denote this set of regular unit d-simplex vertices. The d + 1 vertices d?i+1) 1 are de ned by si;j = 0 for 0 < j < i < d + 1, si;i = ( (d+1)( d(d?i+2) ) 2 for i = 1; 2; :::; d, and 1 si;j = ?( (d?i+1)d+1 d(d?i+2) ) 2 for 0 < i < j d + 1. The extended simplex integration method begins with the same d + 1 points of the simplex method, adds their negatives to make 2d + 2 vertex points. Then the (d + 1)(d + 2)=2 midpoints of the segments joining all pairs of the original vertex points are projected to the surface of Ud . These projected points and their negatives are combined into a set of points that are weighted dierently than the vertex points (see Mysovskikh 1980). If jUd j is used to denote the surface content of the unit d2-sphere, the weights for the vertex points and midpoints are (7?d)d 2(d?1) jUdj 2(d+1) 2(d+2) and jUdj d(d+1)2 (d+2) , respectively. In the general case, there are (d + 1)(d + 2) points, but only 6 equally weighted points for d = 2 and 14 (not 20) points for d = 3 because of duplication. The extended simplex method is exact for all polynomials of degree less than six. Note that the projected midpoints can also be viewed as abscissas for an equally weighted integration rule using d(d + 1) points.
4.4 Randomized Spherical Methods
Let us denote the abscissas as vk and weights as uk for any of the three methods given above in Section 4.3 for integrating the function h on Ud
Z m X u h(v )
k=1
k
k
Ud
h(z)dz:
Then a randomized method for the sphere can be constructed by generating a random orthogonal matrix Q to form Z m X uk h(Qvk ) h(z)dz: Ud
k=1
Stewart (1970) gives an algorithm for generating from the uniform distribution (invariant Haar measure) over orthogonal matrices. Essentially, the approach is to generate a d d matrix X 6
of standard normal variables, and form the QR factorization: X = QR. Then Q has the right distribution. Givens transformations can be employed in the factorization in such a way as to reduce the time and space required to determine Q. An analysis of the eectiveness of these randomized spherical rules can be brought into focus by considering an alternative method of integrating over Ud by generating random vectors zj uniformly on the sphere, that is, straight Monte Carlo. The comparison of these methods can then be seen as a comparison of simple random sampling with zj , and systematic sampling with Qj vk . If the spherical integration rule is exact for a function h, a low degree polynomial, P then the variance of uk h(Qj vk ) will be zero. If h is not a low degree polynomial, but h has sucient smoothness, then the variance should decrease as m, the number of abscissas, increases as is the case for systematic sampling. If the function is suciently noisy, so that the variables h(Qj vk ), k = 1; :::; m are uncorrelated, then systematic sampling will only do as well as simple random sampling. We write Qj vkPto imply that we will replicate by using rotations Qj to permit estimation of the variance of k uk h(Qvk ).
5 Randomized Spherical-Radial Methods The aim for these randomized spherical-radial methods is to be unbiased for all functions, and exact if the posterior happens to be a low-degree polynomial times a standard kernel function, perhaps multivariate normal. These methods are combinations of the randomized quadrature methods of Section 4.2, and the randomized spherical methods of Section 4.4. If we begin with integrating a function h(x) in Rd ; then the spherical-radial transformation (4.1), combined with a random orthogonal matrix Q produces the randomized integration rule
H (r) =
Z m X u h(rQv ) so that E fH^ (r)g = k
k=1
k
Ud
h(rz)dz:
If the spherical rule using (uk ; vk ) is exact for cubic polynomials, then V arfH^ (r)g = 0 if h is a cubic polynomial times f (jxj) for r = jxj. Using the third order rules from Section 4.3, now generate R with density proportional to rd+1 f (r) and form T3 (R) = w0H^ (0)=f (0) + w1H^ (R)=f (R). Then the integration rule T3(R) is exact for cubic polynomials times f (jxj); and unbiased for all: E fT3(R)g = E fw0H^ (0)=f (0) + w1H^Z (R)=f (R)Zg Z1 1 ?2 vd r [ h(rz)dz=f (r)]rd+1f (r)dr= rd+1 f (r)dr = H^ (0)E f1 ? vd R?2 g +
Z 1 dU? Z 1 d? Z r h(rz)dzdr= r f (r)dr 0+ U Z 1 d? Z Zd 0
= =
d
1
1
0
:::
R
0
0
d
h(t)dt=
0
r 1 f (r)dr:
To compute the integrals I (g ; p) for several functions g simultaneously using this approach, we take h(RQvk ) = g (RQvk )p(RQvk ). In addition to the normal version with f (r) = e?r2 =2; we also considered multivariate t f (r) = (1 + r2 = )?(d+ )=2 forms with = 5 and = 9; as well as the logistic version f (r) = e?r =(1+ e?r )2; all using third order rules for both radial and spherical (antipodal) integration. A fth order rule, using the extended simplex method for 7
the sphere and the fth order radii R1 and R2 for the normal f (r) = e?r2 =2 was also considered in this study. The generation of radii for the normal kernel is easy with the d+2 for the third order and a combination of and Beta variates for the fth order. While generation for the multivariate t kernel requires F-like variates, the logistic requires a special algorithm, obtained here following the ratio of uniforms method. If the posterior p is multivariate normal, then the third order integration rule T3(R) with the normal kernel f (r) = e?r2 =2 will give exact results for posterior means and variances, since they are at most quadratic functions. By exact we mean that the variance is zero. If the posterior is spherically symmetric about the mode, then the spherical integration will be exact (zero variance) and the only error results from the mismatch of the kernels. These results hold similarly for T3(R) constructed for other kernels.
6 Mixed Spherical-Radial Methods The spherical-radial transformation directly motivates the mixed methods by expressing the integral in terms of the radius r and a point z on the surface of the sphere. The high dimension (d ? 1) of the spherical integral suggests Monte Carlo in some form, and we employ the randomized spherical rules from Section 4.4 for the inner integral. The radial integral is just one dimensional, which suggests using xed quadrature. Since thick tails will be manifest only in the radial part, with GP (r) trailing o slowly to zero, the radial integral should be done using a method that can handle a variety of tail behaviors. Gauss rules are inappropriate here because the choice of rule depends on the tail behavior, and they are cumbersome to implement and dicult to stably compute when the number of abscissas is large. We therefore focus on the simpler midpoint, trapezoid, and Simpson's rules, which we consider using in compound form. These compound rules have two types of bias: bias from the size of subintervals, and bias from the choice of an upper bound on the range. The rst type of bias is reduced by decreasing the interval size and increasing the number of evaluations. If the upper bound on the range is large enough, these compound rules have higher than expected order of convergence for the larger d values because the rd?1 factor for r 0 and the rapid decrease in GP (r) for large r usually combine to make the integrand and several higher order derivatives zero at both r = 0 and r = 1 (see Davis and Rabinowitz, 1984, pp. 134-140). Working through the variance formulas given below, doubling the number of spherical replications (q ) has almost the same eect on the variance as halving the interval size. This suggests reducing the interval size to reduce the subinterval width bias far below the standard error of the method, .01 to .0001 for the number of evaluations considered here. For interval lengths of .4 (.22 spacing), Simpson's rule can reduce the relative bias below .00003 for integrating rd?1 e?r =2. Since the midpoint and trapezoid rules have larger bias than Simpson's rules for lower dimensions, we have limited our study to Simpson's rule. For selecting an upper bound on the range, notice that the factor rd?1 dictates that the tail depends on dimension. For the normal case, going as far as r = 3 might appear to be far enough, but for a modest dimension of d = 5, r = 3 will just be a single standard deviation from the mode. For an exponential tail, the same dimension would suggest going out to r = 12 to extend past 3 standard deviations. Based on our experience, and perhaps also fears, we conservatively chose a cuto on the range at r = 16, which was large enough to eectively eliminate the cuto bias for the test problems that we describe later in this paper. A more careful implementation of our methods, that is designed to be robust, reliable and ecient, would need to dynamically estimate subinterval 8
size and cuto biases, and balance these with the errors due to the variance in the spherical surface integration. The mixed spherical-radial methods can be written mathematically as
IM (g; p) =
m X w rd? 1 Xq [X uk g (riQi;j vk)p (riQi;j vk )] i i q 1
i
j =1 k=1
(4)
where wi denote the Simpson weights for the radial quadrature using ri ; uk are the weights for the spherical quadrature as in Section 4.4, and q random rotation matrices Qi;j are employed as part of the spherical integration. While the other methods outlined in this paper lead to iid random variables, constructing a variance estimate for the mixed rules is more complicated. Denote the expression in square brackets in (4) as Yij , an estimate for GPP (ri). Then the estimate of the variance of IM (g ; p), a weighted sum of means as IM (g ; p) = i wirid?1 Yi: ; can P q be found from Si2 = j =1 (Yi: ? Yij )2=(q ? 1) as
V ^ar(IM ) =
X w r d? S =q: 2 2
i i
i
2
2
i
(5)
The analysis of the sources of variation leads to insightful diagnostics that are a great strength of this approach. Take g 1 to focus on the integration constant. Then if the posterior is normal, then Yi: =p(0) (normalized with respect to the mode) should behave as 2i =2 ? r e and a plot of both versus ri will show how close to normal the posterior is. A plot of Si versus ri or standard errors for Yi: =p(0) will show how the accuracy varies with the radius. Typically, Si should be small for small values of r near the origin (mode) where the posterior should be very smooth (S0 = 0 for r0 = 0), and also tailing o to zero for values of r in the tail. If the posterior is spherically symmetric then Si will be zero, so small values of Si suggest symmetry, and large values of Si indicate departures from symmetry or smoothness. Two other plots give a better view of any departure from normality. A plot of log(Yi: ) versus ri can be compared to ?r2 =2 which should accentuate any departure from a normal tail. But the best view is the eect on the estimate IM ; that is, weighted by rd?1 : Overlaid plots of rid?1 Yi and rid?1 Si ; with rid?1 e?ri2 =2 for reference give the best view. These plots illustrate the troublesome behavior exhibited by the posterior from Example 1, the Heart Transplant Example, further described in the Appendix. The plot of Yi: =p (0) 2 =2 ? r \posterior" versus ri with e i for reference (Figure 1) show that the posterior appears to have the normal shape, at least for small values of the radius r. The variance looks small with the standard errors of Yi: =p(0) slammed against the horizontal axis; however, by itself (Figure 2) we see the small variance near the origin and in the tails. Further examination of the posterior, plotting log(Yi: =p(0)) \logposterior" versus r (Figure 3) does not follow at all the curve ?r2 =2: Rather, the tail behavior of the posterior looks more exponential. In fact, the line 1.2?r ts very well, supporting an exponential-like tail. The eect of dimension on the posterior and the accuracy of the computations are best shown in Figure 4. With the unnormalized 3 density r2 e?r2 =2 for reference, we see that the apparently small departures from normality barely visible in Figure 1 clearly have a serious eect. Moreover, the plot of the standard errors of these values no longer hugs the axis and the values for radii from 4 to 8 are not very accurate at all. The general form of (4) should suggest possible tradeos to achieve the best accuracy for a xed eort, that is, a budget of the number of evaluations of the posterior. From the outside, more radial points will reduce bias, but this is not the main concern since the bias can 9
be nearly eliminated with as few as 80 points in the interval [0, 16]. Increasing the number of radial points has some eect on the variance by reducing the interval length and hence the Simpson weights wi: The other two factors are more important. The variance of Yi: is O(q ?1 ) and so increasing the number of rotations Qi;j will reduce the variance of the estimate. The choice of spherical integration rule has a dramatic eect, but also limits other options. For the problems examined here, the higher-order spherical rules have substantially smaller variances. Even when corrected for the number of evaluations m, we see that mSi2 may dier by orders of magnitudes across the dierent methods. Usually the extended simplex method has considerably smaller variance, with the antipodal method as its closest competitor. For small values of the radius, all work well enough since the variance is so small; for large radii, the advantage of the extended simplex deteriorates to where all the methods appear roughly equal, and equal to simple random sampling. The apparent explanation is that for large radii, the points are essentially so distant that they are independent, and systematic sampling loses all advantages. The drawback with the extended simplex is that m = (d +1)(d +2) evaluations of the posterior are required for each Yij : For the antipodal method, we have m = 2d compared to merely d + 1 for the simplex method. For a xed budget of the number of evaluations, and requiring at least q = 2 rotations to estimate the variance with 1 df, changing the spherical method forces other changes and these can be expensive in high dimensions.
7 Comparisons To compare the performance of these random and mixed spherical-radial methods, good scienti c practice dictates that the test problems should match the population of interest, Bayesian applications. However, complicating any comparison is the fact that the true value of the integral will not be known for any interesting problems. Restricting the study to small-dimensional problems where the integral could be found through other methods would be self-defeating. As a result, we have approached the comparisons in two steps. We began with one simulation experiment using distributions with known integrals (densities) with the primary goal of determining whether the self-assessments of accuracy are appropriate. After nding this to be the case we can then compare the performance of these methods on a sample of problems from the literature. We have used seven methods for these main comparisons. They are described as follows, each with a mnemonic used in the tables: MCIST5 Monte Carlo Importance Sampling using a Student t5 kernel, following (3), RSR3N Randomized Spherical-Radial Method, using T3(R) from Section 5 and antipodal spherical integration rule to be exact for cubic polynomial times a normal kernel, RSR3T9 Randomized Spherical-Radial Method, as RSR3N, but using a Student t9 kernel, MSRA Mixed Spherical-Radial with Simpson's rule on radius, and antipodal on sphere, MSRVQ Mixed Spherical-Radial with Simpson and antipodal as MSRA, but varying the number of rotations q with radius r, MSRES Mixed Spherical-Radial with Simpson on radius, extended simplex on sphere, MSRMP Mixed Spherical-Radial with Simpson on radius, projected midpoint on sphere. 10
These seven represent the best performers of 14 considered in preliminary tests not shown here. Among the poor performers worth noting are Monte Carlo with the normal MCISN, a randomized method like RSR3N and RSR3T9, but with the logistic distribution as kernel, another t version RSR3T5, and mixed methods using the simplex spherical integration. In response to the superior performance of MSRES in the preliminary round, we included MSRMP to determine whether higher order ( fth) or systematic sampling with more abscissas accounted for that success. The performance of MSRMP suggests that the greater part of the improvement arises from the number of abscissas m giving smaller O(m?2) variance, instead of O(q ?1) with more rotations. For the simulation experiment, we use several families of test problems with known results: 1. Six Dirichlet Distributions, with dimensions 3, 3, 3, 7, 7, 7. 2. Six Transformed Dirichlet Distributions, same dimensions transformed by et =(1 + et ). 3. Product Logistic Distribution with d = 3, 6, 9, 12. 4. Product Extreme-Value Distribution with d = 3, 6, 9, 12. Table 1 gives the parameters for six densities from the Dirichlet family, whose support P is limited to the simplex tj 1 and whose densities take the form
Yd t ? (1 ? X t ) +1? ?(dX )=(dY ?( ))
j =1
j
j 1
j
1
d
+1
j =1
+1
j
j =1
j
Table 1: Dirichlet Distribution Parameters d Parameters 3 21 22 23 24 3 3456 3 5 10 15 20 7 21 22 23 24 25 26 27 28 7 3 4 5 6 7 8 9 10 7 3 4 5 10 16 21 22 23 The problems in Table 1 present a variety of diculties. First, the support boundaries come into play for the second and fth problems which have large variances. The third and sixth problems have varying marginal behavior across P the variables. The transformed Dirichlet densities use the change in variables tj = exj =(1+ dk=1 exk ), for j = 1; :::; d, with the resulting density over all Rd Pd P +1 j ) ?( dj =1 e j=1 j xPj Qd+1 : d+1 P (1 + dj=1 exj ) j=1 j j =1 ?(j ) This transformation should bring the Dirichlet density closer to the multivariate normal. For Q the product distributions, the joint density for the product logistic is dj=1 e?tj =(1 + e?tj )2, ?t Q and for the product extreme-value distribution the density is dj=1 e?tj ?e j . The simulation experiment is designed to address three main issues. First, we want to assess the level of bias in the methods. The randomized methods are supposed to be unbiased, and the 11
bias of the mixed methods should be small. Second, and most important, we need to determine whether the error estimates from each method are honest assessments of the accuracy of the method. Third, we want to know which methods work best. In this experiment, we apply each method n = 100 times to each problem within the four classes of distributions, twenty problems in all. All methods have the same budget of 40,000 evaluations of the density. For each problem, denote the estimate of the normalization constant (1) by Zmi and the estimate of its standard error by smi . For importance sampling or the randomized methods, we obtain smi from replication; for the mixed methods, we use (5). To assess the level of bias in the methods, we compute for each method across 20 test problems the t-test statistics for testing whether the bias is zero, that is
pnZ =sX t= (Zmi ? Zm: ) =(n ? 1): m: 2
i
These are plotted in Figure 5, with 1:96 marked for reference. As a group, the mixed methods show no serious problem with bias. The randomized method RSR3N has some serious diculty with some of the test problems. Table 2: Coverage of 95% Con dence Intervals Method Dirichlet Transf. Dirich. Logistic Extreme Value MCIST5 47 / 600 35 / 600 23 / 400 13 / 400 RSR3N 44 / 600 36 / 600 69 / 400 154 / 400 RSR3T9 28 / 600 23 / 600 17 / 400 25 / 400 MSRA 39 / 600 34 / 600 24 / 400 27 / 400 MSRVQ 35 / 600 36 / 600 33 / 400 31 / 400 MSRES 37 / 600 23 / 600 23 / 400 21 / 400 MSRMP 30 / 600 26 / 600 26 / 400 27 / 400 Second, we need to determine whether the error estimates from each method are honest assessments of the accuracy of the method. This second issue will be addressed in two ways. We will look at the coverage of 95% con dence intervals constructed using Zmi and smi . Instead of counting the number of times Zmi ? 1:96smi < 1 < Zmi + 1:96smi, we give in Table 2 the count of misses for each distribution, so that small numbers are good. For reference, the mean and standard deviation are 30 and 5.3 for a binomial with 600 trials and p = :05, and for 400 trials the numbers change to 20 and 4.4. Clearly, RSR3N is having great diculty with the product distributions, while the other randomized rule RSR3T9 is quite competitive. Since good coverage could be gained by overestimation, we also compare each q method's selfPni=1 s2 =n, assessment of accuracy measured in the root mean estimated variance RMEV = mi qPn with a measurement of the true accuracy i=1 (Zmi ? Zm: )2=n. Across the twenty problems, the ratios of these quantities should cluster about 1 if the assessments are honest. In Figure 6, aside from the failures of RSR3N, we do not see any strong tendency for over or underestimation of accuracy. This important result allows us to rely on the standard errors given by each method.
12
Table 3: Mean log(RMSE) for each method by distribution Transformed Product Product Method Dirichlet Dirichlet Logistic Extreme Value MCIST5 {5.33 B {5.70 E {5.17 D {3.79 E RSRN {5.90 A {6.46 D {3.11 E {1.93 F RSR3T9 {5.31 B {5.60 E {5.34 C {4.08 D MSRA {5.92 A {7.39 BC {5.34 C {4.24 CD MSRVQ {5.97 A {7.81 A {5.45 C {4.34 BC MSRES {6.00 A {7.53 B {6.68 A {4.66 A MSRMP {5.92 A {7.34 C {5.92 B {4.45 B 2 Method R .055 .386 .703 .479 Total R2 .971 .993 .996 .993 Third, we want to determine method is best. We will analyze logarithm of root pPnwhich ( Z mean square error RMSE = i=1 mi ? 1)2 =n, separately for each distribution. We split each sample of 100 in half to form two replicates. We analyzed the results as a three way ANOVA with method as one factor, problem as the second factor, and replication as the third, and including all two factor interactions, using the three factor interaction for error. At the bottom of Table 3 are the R2 's for the full model, as well as the contribution of method as a factor. For all four distributions, the main eects for method and problem accounted for most of the variation. Also given in Table 3 are the means for each method across the problems (6 or 4), and note that smaller is better. Next to each mean are groupings arising from multiple comparisons at level 0.05 using lsd's, with A used to label the best, and so on. For the Dirichlet problems, the dierent methods account for little variation in log(RMSE), and most of the methods appear to be equally competitive. However, since log(2) :7, the dierence between MCIST5 and MSRES translates to twice the RMSE and requires four times the number of evaluations for equal accuracy. For the other three distributions, however, the methods perform quite dierently and the mixed methods are clearly superior. For the nal comparison, we have compiled ten examples from the literatures, all in at least 3 dimensions, which are described in the Appendix. Only one of these (Example 1) has been used by more than one set of authors as a test problem. Four have been used by other authors as test problems. The largest problem (Example 10) challenged one of the authors for a real-time method for fast computation of forecast probabilities. Table 4 shows the performance of the seven methods on the 10 examples. Since we have seen that most methods give honest appraisals of their accuracy, here we give only the standard errors, sim : The mixed methods consistently outperform the others, with perhaps MSRES being the top performer. The improvement of these methods over importance sampling MCIST5 is most dramatic in Examples 2 and 7, where the ratio of standard errors between MCIST5 and MSRES are as much as 40. A more typical value for the ratio is 4, which translates into 16 times as many Monte Carlo replications to achieve the same accuracy. Also given in Table 4 are the results from the tests for in nite mean and variance described in Section 3. These tests are set up to reject these troublesome hypotheses, so that rejection should give some level of con dence in the results. However, even for MCIST9, the hypothesis of in nite variance of the importance sampling weights is only rejected in half of these ten examples.
13
Method MCIST5 RSR3N RSR3T9 MSRA MSRVQ MSRES MSRMP d
Table 4a: Standard Errors from Examples 1-5 Example 1 Example 2 Example 3 Example 4 Example 5 .0196 iv .0059 ok .0175 iv .0836 iv .2074 iv .0200 iv .0037 ok .0083 iv .0765 iv .0820 im .0094 iv .0064 ok .0083 iv .0265 iv .0631 iv .0069 .0031 .0076 .0132 .0299 .0093 .0031 .0149 .0380 .0472 .0039 .0012 .0080 .0109 .0532 .0095 .0010 .0110 .0106 .0271 3 7 3 5 8
Tail test codes: im - in nite mean not rejected, iv - in nite variance not rejected, ok - in nite variance rejected
Table 4b: Standard Errors from Examples 6-10 Method Example 6 Example 7 Example 8 Example 9 Example 10 MCIST5 .0120 ok .0040 ok .0406 iv .0105 ok .0128 ok RSR3N .0146 iv .0017 ok .0942 iv .0278 iv .0177 iv RSR3T9 .0058 ok .0068 ok .0198 iv .0072 iv .0082 ok MSRA .0054 .0006 .0500 .0063 .0066 MSRVQ .0037 .0007 .0220 .0047 .0073 MSRES .0032 .0001 .0264 .0039 .0065 MSRMP .0026 .0001 .0247 .0037 .0050 d 7 7 9 10 11 Tail test codes: im - in nite mean not rejected, iv - in nite variance not rejected, ok - in nite variance rejected
Two comparisons from Table 4 are worth noting. Examples 3, 4, and 5 arise from the same eight parameter problem (Example 5). Integrating out the covariance parameters analytically reduces the dimension to 5, and a drop in the standard errors of one half or more to Example 4. Integrating analytically two more parameters drops the dimension to 3 (Example 3), and gives some improvements in the poorer performers. Clearly, the advantage of analytical integration is not lost in these improved integration methods. Examples 6 and 7 dier only by sample size, with Example 6 using only 1/5 of the data of Example 7. With more data, the posterior is nearly multivariate normal, and the methods here, especially the mixed methods, take advantage of normality. The comparison can also be viewed in Figures 7 and 8, where the scaled posterior and standard error plots (similar to Figure 4) show how closeness to normality is exhibited in the spherical-radial parameterization.
8 Concluding Remarks The results in the previous section indicate that the mixed spherical-radial methods give superior accuracy for the same computational eort. But this improved performance is only part of the story. Plaguing both normal Monte Carlo importance sampling and Markov Chain Monte Carlo is the diculty of verifying whether the necessary conditions are met. With the mixed spherical-radial methods, the error estimates are simple and sound, and the diagnostic plots can clearly show any problems. The error estimates have been shown to be reliable. The 14
only requirement is that a mode is found near most of the probability mass. Additionally, the closer the posterior is to the multivariate normal, the better these methods perform. Clearly, the mixed radial-spherical methods exemplify the omnibus method for integrating posterior distributions we seek. To explain the superior performance of the mixed integration methods, we look at the dierences in integrating the spherical and radial parts. The spherical integration has the high dimension and calls for Monte Carlo integration, for which simply choosing points randomly on the surface of the sphere is the simplest approach. We nd that the randomized spherical rules rules often substantially outperform simple random sampling. Using a general method such as Simpson's rule for the radial part allows for any reasonable sort of tail behavior, while remaining competitive with other approaches when the tail of the posterior is normal. The reader should also notice that the mixed spherical-radial methods can exploit certain features of the distribution and their best implementation can adapt to various circumstances. The only level of adaptation attempted here was varying the number of rotations q with the radius r. After some exploration, the choice of spherical integration rule can be adjusted, the interval size of the radial integration, and perhaps the radial integration rule itself, since only Simpson's rule has been considered in this study. A fully adaptive implementation of the mixed spherical-radial method is currently being studied by the authors (Genz and Monahan, 1996b).
REFERENCES Cochran, W. G. (1978), Sampling Techniques (3rd ed.), Wiley, New York. Cranley, R., and Patterson, T. N. L. (1976), \Randomization of Number Theoretic Methods for Multiple Integration," SIAM Journal on Numerical Analysis, 13, 904{914. Davis, P. J., and Rabinowitz P. (1984), Methods of Numerical Integration, Academic Press, New York. Deak, I. (1990), Random Number Generation and Simulation, Akademiniai Kiado, Budapest. Dellaportas, P., and Wright, D. (1992), \A Numerical Integration Strategy in Bayesian Analysis," in Bayesian Statistics 4, J. M. Bernardo, et al. eds, Oxford University Press, 601{606. Evans, M. (1991) \Adaptive importance sampling and chaining," in Statistical Numerical Integration, N. Flournoy and R. K. Tsutakawa (eds.) Contemporary Mathematics 115, American Mathematical Society, Providence, Rhode Island, 137{143. Evans, M., Gilula, Z., and Guttman, I. (1989), \Latent Class Analysis of Two{way Contingency Tables by Bayesian Methods," Biometrika, 76, 557{563. Evans, M., and Swartz, T. (1992), \Some integration strategies for problems in statistical inference," Computing Science and Statistics, 24, 310{317. Evans, M., and Swartz, T. (1996), \Methods for approximating integrals in statistics with special emphasis on Bayesian integration problems," Statistical Science, 10, 254{272. Genz, A. C., and Monahan J. F. (1996a), \Stochastic Integration Rules for In nite Regions," to appear in SIAM Journal on Scienti c Computation. Genz, A. C., and Monahan J. F. (1996b), \A Mixed Algorithm for Multiple Integrals over In nite Regions," in preparation. 15
Geweke, J. (1988), \Antithetic Acceleration of Monte Carlo Integration in Bayesian Inference," Journal of Econometrics, 38, 73{89. Geweke, J. (1989), \Bayesian Inference in Econometric Models Using Monte Carlo Integration," Econometrica, 57, 1317{1340. Haeusler, E., and Teugels J. L. (1985), \On the Asymptotic Normality of Hill's Estimator for the Exponent of Regular Variation," Annals of Statistics, 13, 743{756. Hammersley, J. M., and Handscomb, D. C. (1964), Monte Carlo Methods, Methuen, London. Hesterberg, T. (1991), \Importance Sampling for Bayesian Estimation," in Computing and Graphics in Statistics, A. Buja and P. Tukey, eds., Springer{Verlag, New York, 63-75. Hesterberg, T. (1995), \Weighted Average Importance Sampling and Defensive Mixture Distributions," Technometrics, 37, 185{194. Hill, B. M. (1975), \A Simple General Approach to Inference About the Tail of a Distribution," Annals of Statistics, 3, 1163{1174. Hosmer, D. W. Jr., and Lemeshow, S. (1989), Applied Logistic Regression, Wiley, New York. Johnston, J. (1963), Econometric Methods (1st ed.), McGraw{Hill, New York. Kloek, T., and van Dijk, H. K. (1978), \Bayesian Estimates of Equation System Parameters: An Application of Integration by Monte Carlo," Econometrica, 46, 1{19. Krall, J., Utho, V., and Harley, J. (1975), \A Step Up Procedure for Selecting Variables Associated with Survival," Biometrics, 31, 49{57. Lawless, J. F. (1982), Statistical Models and Methods for Lifetime Data, Wiley, New York. Monahan, J. F. (1993), \Testing the Behavior of Importance Sampling Weights," Computing Science and Statistics, 24, 112{117. Monahan, J. F., Schrab, K. J., and Anderson, C. E. (1993), \Statistical Methods for Forecasting Tornado Intensity," in Statistical Sciences and Data Analysis, K. Matushita, et al., eds., VSP, Utrecht, The Netherlands, 13{24. Mysovskikh, I. P. (1980), \The Approximation of Multiple Integrals using Interpolatory Cubature Formulae," in Quantitative Approximation, R. A. DeVore and K. Scherer, eds., Academic Press, New York, 217{243. Mysovskikh, I. P. (1981), Interpolatory Cubature Formulas (Russian), Izd 'Nauka', Moscow{ Leningrad, 312. Naylor, J. C., and Smith, A. F. M. (1982), \Applications of a Method for the Ecient Computation of Posterior Distributions," Applied Statistics, 31, 214{225. Siegel, A. F., and O'Brien, F. (1985), \Unbiased Monte Carlo Integration Methods with Exactness for Low Order Polynomials," SIAM Journal on Scienti c and Statistical Computing, 6, 169{181. Smith, A. F. M., Skene, A. M., Shaw, J. E. H., Naylor, J. C. and Dans eld, M. (1985), \The Implementation of the Bayesian Paradigm," Communications in Statistics, A14, 1079{1102. Stewart, G. W. (1980), \The Ecient Generation of Random Orthogonal Matrices with an Application to Condition Estimation," SIAM Journal on Numerical Analysis, 17, 403{ 409. Tierney, L. and Kadane, J. (1986), \Accurate approximations for posterior moments and marginal densities," Journal of the American Statistical Association, 81, 82{86. 16
Tierney, L., Kass, R.E. and Kadane, J. (1989), \Fully exponential Laplace approximations to expectations and variances of nonpositive functions," Journal of the American Statistical Association, 84, 710{716. Turnbull, B. W., Brown, B. W. Jr., and Hu, M. (1974), \Survivorship Analysis of Heart Transplant Data," Journal of the American Statistical Association, 69, 74{80. van Dijk, H. K., and Kloek, T. (1980), \Further Experience in Bayesian Analysis Using Monte Carlo Integration," Journal of Econometrics, 14, 307{328. van Dijk, H. K., and Kloek, T. (1984), \Experiments with Some Alternatives for Simple Importance Sampling in Monte Carlo Integration," in Bayesian Statistics 2, J. M Bernardo, et al., eds., North Holland, Amsterdam, 1{20. Zellner, A., and Geisel, M. S. (1970), \Analysis of Distributed Lag Models with Application to Consumption Function Estimation," Econometrica, 38, 865{887. Zellner, A. (1971), An Introduction to Bayesian Inference in Econometrics, Wiley, New York.
Appendix. Ten Examples as Test Problems Example 1 Stanford Heart Transplant Data (d = 3)
Turnbull, et al, (1974) propose a model which has been analyzed as a Bayesian problem by Naylor and Smith (1982), Tierney and Kadane (1986), Tierney, Kass and Kadane (1989) among others. This apparently unimposing problem suers from surprisingly heavy tail behavior in the posterior, which can cause serious diculties in importance sampling. The small dimension suggests product integration to obtain the \correct" answer, and this pursuit motivated one of the authors to the methods proposed here. Example 2 Lawless Proportional Hazards Model (d = 7) This proportional hazards problem given by Lawless (1982) was used as an example by Dellaportas and Wright (1992). Here the sample size is relatively small (n = 65) but the dimension larger (d = 7) with ve explanatory variables. We have used the data as given by Lawless which dier from the original data of Krall, et al. (1975). Examples 3, 4, and 5 Simultaneous Equations Model (d = 3; 5; 8) Kloek and van Dijk (1978) and van Dijk and Kloek (1980, 1984) use a simple example given by Johnston (1963, p. 268) of a macroeconometric model with two endogenous variables and one exogenous variable and one zero restriction. This 10 observation problem has ve linear parameters and three for the covariance matrix. Example 5 consists of this problem in its full 8 parameter form. The three covariance parameters can be integrated out analytically to form the ve parameter Example 4. Two more parameters can be integrated out analytically to arrive at the three parameter form (Example 3) used by Kloek and van Dijk. Here we can see the progression of dimension and the value of any analytic integrations to reduce dimension. Examples 6 and 7 Birthweight Data (d = 7) Hosmer and Lemeshow (1989, p. 91) describe a study of the incidence of low birth weight in infants which serves as an example of a high dimension logistic regression. We have chosen a subset of 7 explanatory variables: mother's age, weight, smoking, premature labors, hypertension and uterine irritability (and intercept). In order to view 17
the eect of sample size, we have taken only 1/5 of the 189 observations for Example 6; Example 7 uses all of the data. Example 8 Contingency Table Model (d = 9) Evans and Swartz (1996) give this problem as their second example. Evans, Gilula and Guttman (1989) model a 3x3 cross-classi cation by Wing (1962) with pij = i (1) j (1)+ (1 ? )i (2) j (2) where both the i 's and j 's sum to 1. We have reparameterized using a logit transform, and the other eight parameters in pairs with similar transformations to extend the parameter space to R9. Example 9 ANOVA with Student t Errors (d = 10) Evans and Swartz (1996) give this simulated example as their rst example. The model essentially is a 9 level one-way layout with iid Student t (3 df) errors and the scale parameter, log transformed, as the tenth parameter. We note that this apparent 10 dimensional problem can be done as 9 one-dimensional integrals nested within the one dimensional integral for the scale parameter to obtain \correct" results. Example 10 Tornado Model (d = 11) Monahan, Schrab and Anderson (1993) use an ordinal regression model for forecasting tornado intensity using ve explanatory variables. The seven ordinal levels account for the remaining six parameters, for which an ordering restriction serves as an additional complication. In spite of this, the sample size is large enough (157) to expect a wellbehaved problem.
18
Figure 1: Posterior and Standard Error as a Function of Radius
Figure 2: Standard Error as a Function of Radius 19
Figure 3: Log Posterior as a Function of Radius
Figure 4: Posterior and Standard Error Scaled by Rd?
1
20
Figure 5: Testing for Unbiasedness
Figure 6: Accuracy of Variance Estimates 21
Figure 7: Scaled Posterior and Standard Error for Example 6
Figure 8: Scaled Posterior and Standard Error for Example 7 22