KEYWORDS: Auxiliary variable; EM algorithm; Gibbs sampler; Group of ... Rubin 1977) and the data augmentation (DA) algorithm (Tanner and Wong 1987).
Parameter Expansion for Data Augmentation Jun S. Liu
and
Ying Nian Wu
Abstract Viewing the observed data of a statistical model as incomplete and augmenting its missing parts are useful for clarifying concepts and central to the invention of two well-known statistical algorithms: expectation-maximization (EM) and data augmentation. Recently, Liu, Rubin, and Wu (1998) demonstrate that expanding the parameter space along with augmenting the missing data is useful for accelerating iterative computation in an EM algorithm. The main purpose of this article is to rigorously de ne a parameter expanded data augmentation (PX-DA) algorithm and to study its theoretical properties. The PX-DA is a special way of using auxiliary variables to accelerate Gibbs sampling algorithms and is closely related to reparameterization techniques. Theoretical results concerning the convergence rate of the PX-DA algorithm and the choice of prior for the expansion parameter are obtained. In order to understand the role of the expansion parameter, we establish a new theory for iterative conditional sampling under the transformation group formulation, which generalizes the standard Gibbs sampler. Using the new theory, we show that the PX-DA algorithm with a Haar measure prior (often improper) for the expansion parameter is always proper and is optimal among a class of such algorithms including reparameterization. KEYWORDS: Auxiliary variable; EM algorithm; Gibbs sampler; Group of transformations; Haar measure; Locally compact group; Markov chain Monte Carlo; Maximal correlation; Over-parameterization; Rate of convergence; Reparameterization. Jun S. Liu is Assistant Professor, Department of Statistics, Stanford University. Ying Nian Wu is Assistant Professor, Department of Statistics, University of Michigan. The authorship is based on alphabetical order. Liu's research was partially supported by NSF grants DMS-9501570 and DMS-9803649, and the Terman fellowship. Wu thanks Department of Statistics of Stanford University for partial support during the summer of 1997. Part of the manuscript was prepared while the rst author was visiting Department of Statistics of UCLA. The authors are grateful to Persi Diaconis, Andrew Gelman, Hui-Nien Hung, Chuanhai Liu, Xiao-Li Meng, Je Wu, two anonymous referees, and the Associate Editor for their helpful discussions and important suggestions. 1
1
1 INTRODUCTION The incomplete-data formulation is fundamental in Rubin's (1978) causal inference model and is also key to both the expectation-maximization (EM) algorithm (Dempster, Laird, and Rubin 1977) and the data augmentation (DA) algorithm (Tanner and Wong 1987). In this formulation, the set of observed data is augmented to a set of \completed data" which typically follows a simpler model. Computation of the maximum likelihood estimates or the posteriors is accomplished by iterating between imputing the missing data and tting the complete-data model. Recently, Liu, Rubin and Wu (1998) notice that along with imputing the missing data, which simpli es computation at the expense of iterations, expanding the parameter space for the complete-data model can be a useful technique to accelerate the convergence of the EM algorithm. More precisely, they observed that, with the imputed missing data, some extra parameters can often be introduced without distorting the original observed-data model. A more precise de nition appears in Section 3. In this case, the EM algorithm can be implemented for the expanded model (the PX-EM algorithm) and it converges with a monotone increase in observeddata likelihood. Liu et al. (1998) show both empirically and theoretically that the PX-EM algorithm can often yield signi cant acceleration over the ordinary EM algorithm. As a precursor to the PX-EM algorithm, Meng and van Dyk (1997) propose ecient augmentation which, instead of nding an over-parameterization as with the PX-EM algorithm, identi es an optimal reparameterization among a class of reparameterizations indexed by a working parameter. Because the DA algorithm, or the more general Gibbs sampling algorithms (Gelfand and Smith 1990), can be viewed as stochastic generalizations of the EM-type algorithms, it has been widely speculated that over-parameterization (or auxiliary variable) and reparameterization (or ecient augmentation) methods should be similarly useful for designing more ecient stochastic algorithms. In fact, some eorts along this line have been made (see Meng and van Dyk 1997 and the discussions therein). Although success stories of using auxiliary variable in Markov chain Monte Carlo (MCMC) abound (e.g., Higdon 1998), theory indicates that extreme care must be taken in order to design a useful method (Geyer 1992; Liu 1994). Following the parameter expansion idea, in this article we de ne a parameter expanded data augmentation (PX-DA) algorithm. Then we explore the relationship between the PXDA algorithm and over-parameterization, reparameterization, and group of transformations. In proving optimality properties of the PX-DA algorithm, we establish a new theory of iterative conditional sampling. In particular, it is shown that a Gibbs-like step can be generalized as 1
movement along an orbit of a transformation group acting on the space of interest. Explicit formula for drawing an element from the group, conditional on the current state, is given so that the target distribution of interest is invariant under such moves. To illustrate the main bene ts and important issues, Section 2 works through an intentionally simple example which, nevertheless, conveys most of the relevant ideas. Section 3 de nes a general PX-DA algorithm and shows that reparameterization can often be viewed as a special PX-DA algorithm. This connection helps us to prove an optimality result in Section 5. Section 4 introduces the data-transformation concept in a MCMC sampler and shows that any PX-DA algorithm in which the expansion parameter indexes a data-transformation mechanism and has an independent prior converges no slower than the ordinary DA algorithm. Section 5 establishes a new theory that generalizes the regular Gibbs sampler, studies the use of Haar priors in the PX-DA algorithm, and investigates why such priors should be used. Connections between our results and ducial distribution, maximal ancillary, non-informative prior etc., are noted. Section 6 gives a numerical example to illustrate the methodology, and Section 7 concludes with a discussion.
2 A SIMPLE EXAMPLE To illustrate basic ideas of the PX-DA algorithm, let us consider the complete-data model
y j ; z N ( + z; 1); z j N (0; D);
(1)
where is an unknown parameter and D is a known constant. In this model, y is the observed data and z is missing. Hence, the observed-data model is y j N (; 1 + D). This example can be viewed as a special case of the random eects model. The observed data y and missing data z are multidimensional random variables in general, but are kept as 1-dimensional in this simple example. With a at prior on , the posterior distribution of is N (y; 1 + D). Let us treat this example as a missing data problem in which one uses the DA algorithm to iterate between the following two conditional draws, (2) z j ; y N 1 y+?D?1 ; 1 + 1D?1 and j y; z N (y ? z ; 1); for simulating the posterior distribution of given y. Liu, Wong and Kong (1994, 1995) derive that the convergence rate (the second largest eigenvalue of the induced Markov chain transition operator) of any DA scheme is equal to the square of the maximal correlation between z and 2
, which, in this case (i.e., with normality), happens to be the absolute value of the lag-1 autocorrelation of the draws of at stationarity. This rate (the larger its value, the slower the chain converges) can be computed as
j y ; z) j y ) = 1 : r0 = 1 ? E (var( var( j y) 1 + D?1
The rate r0 also corresponds to the Bayesian fraction of missing information in Rubin (1987). It is seen from the ordinary DA iterations (??) that forcing z to have mean zero causes relatively high \association" between and z and slows down the algorithm. We can overparameterize (??) to get
y j ; ; w N ( ? + w; 1); w j ; N (; D):
(3)
This expanded complete-data model preserves the observed-data model y j ; N (; 1 + D), but has an expansion parameter identi able only from the complete data (y; w). We call such a parameterization \over-centering". When is xed at 0, the expanded model (??) reduces to the original model (??). To implement a DA algorithm for the expanded model, we need to assign a prior for . For now we use a proper prior, N (0; B ), which is independent of the prior of . Then we have the following parameter expanded DA (PX-DA) algorithm: 1. Draw (w; ) conditional on (; y); i.e., draw j ; y N (0; B ) and draw y? 1 w j ; ; y N 1 + D?1 + ; 1 + D?1 : 2. Draw (; ) conditional on (y; w); i.e., draw Bw 1 j y; w N B + D ; B?1 + D?1 ; and draw j y ; w; N (y ? w + ; 1). Because is sampled in both of the foregoing steps, if we only look at and w, the procedure is mathematically equivalent to a collapsed sampler (Liu 1994) with integrated out from the joint posterior distribution. The only reason for carrying in the procedure is to simplify computation. The rate of convergence of this PX-DA algorithm is j y; w)) = D ? (B?1 + D?1)?1 1 : rpx = 1 ? E (var( var( j y ) 1+D 1 + D?1 3
Thus, the new algorithm is never slower than the ordinary DA algorithm for model (??). As B ! 1, the rate rpx goes to 0, meaning that the algorithm converges in one iteration. It is easy to check that as B ! 1, the algorithm is still well de ned by its limiting one-iteration transition despite the fact that the prior for is improper. Computationally, this limiting transition can be realized by expressing the two steps of the PX-DA as a random mapping dependent of B and then letting B go to in nity. More precisely, if one starts at 0 , then Step p 1 corresponds to setting = BZ1, and
p w = 1y+?D?01 + BZ1 + p1 +Z2D?1 ;
and the new drawn at Step 2 is s y? p D Z 1 = y ? B + D 1 + D?01 + BZ1 + p 2 ?1 + Z3 1 + B?1 +1 D?1 ; 1+D where Z1 ; Z2; Z3 are independent standard Gaussian random variables. Hence, as B ! 1 the p limiting transition is 1 = y + 1 + DZ3 N (y; 1 + D). In Section 4.3 we show that this limiting argument is generally applicable.
3 THE PX-DA ALGORITHM 3.1 A Formal De nition of the PX-DA Algorithm For the rest of the article, we let y be the observed data, z be the missing data in the original model, and be the parameter of interest. We use the generic notation f to denote the probability densities related to the original model in a standard DA implementation. For example, f (y; z j ) represents the complete-data model, f (y j ) is the observed-data model, f () denotes R the prior, and f (y; z ) = f (y; z j )f ()d is the marginal distribution of the complete data. We will use p for distributions related to the expanded model in the PX-DA framework.
The DA algorithm (Tanner and Wong 1987) 1. Draw z f (z j ; y ) / f (y; z j ). 2. Draw f ( j z ; y) / f (y ; z j )f (). Suppose we can nd a hidden identi able parameter in the complete-data model f (y; z j ). Then we can expand this model to a larger model p(y; w j ; ) that preserves the observeddata model f (y j ). Mathematically, this means that the probability distribution p(y; w j ; ) 4
satis es
Z
p(y; w j ; )dw = f (y j ):
We call an expansion parameter throughout the article. For notational clarity, we here use w instead of z to denote the missing data under the expanded model. In order to implement the DA algorithm for the expanded model, we need to give a joint prior distribution p(; ). It is straightforward to prove that
Proposition 3.1 The posterior distribution of is the same for both models if and only if the marginal prior distribution for from p(; ) agrees with f (), i.e., p( j y) = f ( j y) if and R only if p(; )d = f ().
Therefore, we only need to specify the conditional prior distribution p( j ) while maintaining the marginal prior for at f (). It is clear that given and y , the posterior distribution of , p( j y; ), remains p( j ) because is not identi able from y . For the time being, we assume that p( j ) is a proper probability distribution. The PX-DA algorithm can then be de ned as iterating the following steps:
The PX-DA algorithm 1. Draw (; w) jointly from their conditional distribution given (y; ). This can be achieved by rst drawing from p( j ), and then drawing w according to
w j y; ; p(y; w j ; ): 2. Draw (; ) jointly according to
; j y; w p(y; w j ; )p( j )f (): Mathematically, this algorithm iteratively draws w conditional on and y , and then draws conditional on w and y, with marginalized out in both steps. So the convergence of this scheme to its target distribution follows directly from standard Markov chain theory (e.g., Liu et al. 1995). In Section 4, we identify two conditions under which the PX-DA algorithm can be shown superior to the DA algorithm. In Section 5, we show that assigning a Haar invariant prior to is optimal under fairly reasonable conditions. 5
3.2 Degenerate Priors and Reparameterization Because p(y; w j ; ) is an extension of f (y ; z j ), it is often the case that one can nd a constant null such that f (y; z j ) = p(y; z j ; null); i.e., the expansion parameter is hidden at null in the original model. Therefore, if we let p( j ) = null , then the corresponding PX-DA algorithm reduces to the original DA algorithm for f (y; z j ). For the simple example in Section 2, we have null = 0. If p( j ) = A() , i.e., = A(), for some function A, then the corresponding PX-DA algorithm becomes the DA algorithm for p(y; w j ; A()), which can be interestingly viewed as a reparameterization of f (y; z j ). Consider the example in Section 2. By letting p( j ) = , we obtain the following re-centering parameterization (Gelfand, Sahu, and Carlin 1995):
y j ; w N (w; 1); w j N (; D):
(4)
The DA algorithm for model (??) has a rate of convergence r1 = 1=(1 + D). Thus, when D > 1, this algorithm converges faster than under the original parameterization. Whereas when D < 1, the resulting algorithm is slower. Now let us consider the problem of searching for the best degenerate prior p( j ) = A() so as to induce an optimal reparameterization. For simplicity, we let D = 1 and consider only the class of scalar functions: fA() = A; A 2 R1 g: Then for a xed A, the corresponding reparameterization (which we call partial centering) is
y j ; w N ((1 ? A) + w; 1); w j N (A; 1):
(5)
Model (??) leads to a DA algorithm with a rate of convergence r(A) = 1 ? 1=(2(A2 + (1 ? A)2)), which achieves minimum 0 at A = :5. Therefore, the prior p( j ) = =2 leads to a PX-DA algorithm that converges in one iteration. Finding the optimal parameterization in (??) among all those indexed by A is essentially the idea of ecient augmentation of Meng and van Dyk (1997), which has a broader statistical meaning as augmenting the least amount of missing information. It is interesting to realize that ecient augmentation is technically a special PX-DA algorithm where one wants to nd a degenerate prior of the form p( j ) = A() to make the dependence between the missing data and as small as possible. Since the set of all A() is too large to optimize over, it has to be parameterized (e.g., A() = A) with the aid of intuition. An alternative strategy, which we will explore in this article, is to go to the other extreme, i.e., to make and independent a 6
priori and let the imputed data decide at each iteration. Note that the non-informative prior p(j) 1 in the simple example also leads to a PX-DA algorithm converging in one iteration. Section 5 gives a theoretical reason for why it is generally favorable to use an invariant prior for . In the next section, we show that together with a data transformation mechanism, the independent prior of always leads to a PX-DA algorithm with improved rate of convergence.
4 DATA TRANSFORMATION AND CONVERGENCE PROPERTY OF PX-DA 4.1 Parameter Expansion by Data Transformation It is often the case that the expansion parameter corresponds to a transformation of the missing data z . Thus, adjusting can be intuitively understood as \analyzing" the missing data. Condition (i) as follows is fundamental in understanding the role of the expansion parameter. With Conditions (i) and (ii), we can prove that the PX-DA algorithm outperforms the DA algorithm.
Condition (i) The parameter indexes a \data transformation" mechanism with z = t(w). That is, for any xed , the function t induces a one-to-one and dierentiable mapping (or, a C 1-dieomorphism) between z and w, and w plays the role of missing data in the new scheme. In other words, for any xed value of , the original model f (y; z j ) determines the expanded model p(y ; w j ; ) in the following way:
p(y; w j ; ) = f (y; t(w) j )jJ(w)j; where J (w) = detf@t (w)=@ wg is the Jacobian term evaluated at w.
Condition (ii) Parameters and are independent a priori, i.e., p( j ) = p0(). Under the two conditions, the PX-DA algorithm becomes
Scheme [1]: 1. Draw z f (z j ; y ), p0(), and compute w = t? 1 (z ). 2. Draw (; ) jointly according to ; j y ; w f (y ; t(w) j )jJ (w)jp0()f (): 7
In many Bayesian computation problems, the marginal distribution of the complete-data, R f (y; z) = f (y; z j )f ()d can be obtained in closed form. Then Step 2 of Scheme [1] can sometimes be accomplished by rst drawing from [ j y; w] / f (y; t(w))jJ(w)jp0() (its marginal posterior distribution), and then drawing conditional on . In this case, Scheme [1] can be rewritten as a minor modi cation of the DA algorithm:
Scheme [1.1] 1. Draw z f (z j ; y ), in the same way as the DA algorithm. 2. Draw 0 p0 (), and compute w = t?01 (z ). Draw 1 [ j y; w] / f (y ; t(w)) jJ (w)j p0 (). Compute z 0 = t1 (t?01 (z )). 3. Draw f ( j y ; z0 ), in the same way as the DA algorithm. Compared with the DA algorithm, Scheme [1.1] conducts an adjustment of the missing data, from z to z 0, through the use of a set of transformations, before the next is drawn. The convergence of Scheme [1.1] follows from the convergence of the general PX-DA algorithm. Moreover, as a result of the following theorem, Step 2 of Scheme [1.1] leaves f (z j y) invariant.
Theorem 4.1 Suppose (i) the random variable z (z); (ii) ft : 2 Ag is a set of transformations on z ; and (iii) a probability measure p0 () can be de ned on A. Let 0 be a random draw from the prior p0() and let w = t?01 (z). If 1 ( j w) / (t(w)) jJ(w)j p0(); then z 0 = t1 (w) follows the distribution . Proof:
Consider the joint distribution of and w,
p(; w) = p0()(t(w)) jJ(w)j;
(6)
under which the marginal distribution of is p0 () and t (w) . Therefore, if 0 p0() and z , then w = t?01 (z ) must follow the marginal distribution of w under (??). Because the new 1 is drawn from the conditional distribution of given w, under (??), we easily see that the joint distribution of (1; w) has to be the same as (??). Hence, z 0 = t1 (w) follows the distribution . 2 8
If we let (z ) = f (z j y) in Theorem 4.1, it is clear that Step 2 of Scheme [1.1] leaves f (z j y) invariant. Generally speaking, Theorem 4.1 provides us a recipe for conducting a move, which leaves invariant, along the trace S = ft (z ); 2 Ag of a set of transformations.
4.2 Rate of Convergence Theorem 4.2 Suppose Conditions (i) and (ii) hold. Then Scheme [1.1], or equivalently Scheme
[1], converges no slower than the DA algorithm de ned in subsection 3.1.
The movement in one PX-DA iteration can be represented by a simple directed graph (or, conditional independence graph) Proof:
old ?! z ?! z0 ?! new ; whereas the DA iteration can be represented as old ?! z ?! new : The conditional independence between z 0 and old is a consequence of Step 2 of Scheme [1.1]. To prove the theorem, we only need to show that the complete-data variance of h() given z in the PX-DA algorithm is no smaller than that in the DA algorithm for any square integrable functions h (this is because that the convergence rate of a DA algorithm is completely characterized by the in mum of this variance, with the in mum over all mean zero and variance one functions, see Liu et al. 1994). The intuition behind this is that the extra variation caused by adjusting z to z 0 makes move more freely under the PX-DA algorithm. In the following, subscript \DA" indicates the probability transition under the DA algorithm, subscript \PX-DA" indicates that under the PX-DA algorithm, and subscript f indicates distribution (or joint distribution) induced by the original model f (y ; z; ). Because Scheme [1.1] and the DA algorithm produce the same joint distribution for (; z) (upon convergence), we have for all h(), square integrable functions with respect to f ( j y ), that varPX?DA fh()g = varDA fh()g = varf fh()g: Furthermore, the conditional variance of under the PX-DA algorithm satis es
EPX?DA [varPX?DA fh() j zg] = Ef [EPX?DA[varf fh() j z0gjz] + varPX?DA [Ef fh() j z0g j z]] = Ef [varf fh() j z 0 g] + Ef [varPX?DA [Ef fh() j z 0 g j z ]] Ef [varf fh() j zg]: 9
The second equality holds because the relationship between and z 0 in the PX-DA algorithm is induced by the sampling of from f ( j y ; z0 ). On the other hand, Liu et al. (1994) show that for the DA algorithm EDA[varDA fh() j zg] = Ef [varf fh() j zg]: Thus, the desired result follows from Theorem 3.2 of Liu et al. (1994). 2 Remark 1. It is clear from the proof that the comparison remains valid if Step 2 of Scheme [1.1] is generalized to any adjustment of z to z 0 that is independent of old and leaves the marginal distribution of z invariant. Remark 2. Both conditions (i) and (ii) are important in the derivation of Theorem 4.2 and can be satis ed by all the examples in Liu et al. (1998). For condition (ii), we have seen in Section 3.2 that the dependent (degenerate) prior p0() = a() can lead to a PX-DA algorithm with slower rate of convergence than the original DA algorithm. For condition (i), consider the following parameter expansion scheme
[y j w] = N ( + w; 1 + );
w N (0; D ? );
where is between [?1; D]. This scheme does not admit a data transformation mechanism. It is obvious that if and only if = D, the corresponding DA algorithm converges in one iteration. Therefore, unless p0() is a point mass on D, the PX-DA algorithm is always slower than the ordinary DA algorithm with xed at D. Remark 3. A variation of Scheme [1], which we call the conditional PX-DA algorithm, is to let 0 be the same as 1 drawn in the previous iteration instead of a new draw from its prior, i.e., this scheme iterates between sampling w given (; ; y) and sampling (; ) given (w; y). Although the conditional PX-DA algorithm shares the same lag-1 auto-correlation for with the PX-DA algorithm, it is less ecient than the PX-DA algorithm because of the comparison theorem of Liu et al. (1994). Furthermore, the conditional PX-DA algorithm can sometimes be inferior to the ordinary DA algorithm. Take the example in Section 2. If Step 1 of drawing from N (0; B ) were skipped, the convergence rate of the scheme would have been
w j y; ; )) = 1 ? (1 + D ) ; r2 = 1 ? E (var( var(w j y) B+D ?1 ?1
which is always greater than (1 + D?1 )?1 provided that B > 0. Hence, the conditional PX-DA algorithm is slower than the ordinary DA algorithm in this case. 10
4.3 A Limiting Procedure for Using Improper Priors The use of an independent prior for can be intuitively understood as letting the imputed data, z, decide which transformation (i.e., ) to use during the iterative computation. This suggests that one may want to use a very diused prior for , which often corresponds to an improper prior. In this case, Scheme [1] cannot be implemented because sampling from p0() (Step 2) is no longer feasible. If the improper prior is the limit of a sequence of proper priors, we can use the following result to realize a PX-DA algorithm. To x notation, let pB () be a proper prior for with hyper-parameter B , and suppose pB () converges to an improper prior p1 () as B ! B1 . Let KB (z0 j z) be the transition induced by Step 2 of Scheme [1.1]. If a limiting version of Step 2 exists as B ! B1 , then a limiting PX-DA algorithm exists and it should still have a better performance than the DA algorithm.
Theorem 4.3 Suppose KB (z0 j z) ! K1(z0 j z ) as B ! B1 for almost all z, where K1(z0 j z) is a proper probability transition function. Then a limiting PX-DA algorithm can be implemented as: 1) draw z f (z j y; ); 2) draw z 0 K1 (z 0 j z ); and 3) draw f ( j y; z 0). This limiting PX-DA still converges to the target distribution.
Proof:
This is a consequence of the following lemma. 2
Lemma 4.1 Suppose KB (x; y) is a sequence of probability transition functions, all having (x) as invariant distribution. If K1 (x; y ) = limB !B KB (x; y ) is a proper transition function, then is an invariant distribution of K1. 1
R Because (x)KB(x; y )dx = (y ), a.s., by Fatou's lemma we have Z Z Z (x)K1(x; y)dx = B!limB (x)KB(x; y)dx Blim (x)KB(x; y)dx = (y); 8 y: !B RR R R Since (x)K1(x; y )dxdy = (y )dy = 1, we have that (x)K1(x; y )dx = (y ), a.s. 2
Proof:
1
1
Deriving the computer code for K1 (z 0 j z ) often only involves simple algebra.
11
5 GROUP STRUCTURE AND HAAR PRIOR FOR THE EXPANSION PARAMETER 5.1 Locally Compact Group and the Haar Measure More speci c results can be obtained if the set of the transformations ft ; 2 Ag is endowed with a ner structure, i.e., if it forms a locally compact group. We will show in the next few subsections that if the prior of corresponds to a Haar measure, whether it is a proper probability distribution or not, there is always a well-de ned PX-DA algorithm and it is optimal in terms of the convergence rate. More formally, a set A is called a group with respect to an operation \" if (i) 8 ; 0 2 A, 0 2 A; (ii) there is an identity element e 2 A so that e = e = , 8 2 A; and (iii) 8 2 A, we can nd a unique ?1 2 A so that ?1 = ?1 = e. We assume that the set of transformations ft; 2 Ag has the same group structure as A, that is, te (z ) = z , and 8 ; 0 2 A, t (t (z )) = t (z ). We call A a locally compact group or topological group if topologically A is a locally compact space and operations (; ) ! and ! ?1 are continuous (Rao, 1987). If the operations are analytic (i.e., in C 1 ), A is called a Lie group. Group A is automatically a locally compact group if it is nite. For any measurable subset B A and element 0 2 A, the notation B 0 de nes a subset of A resulting from \acting" on every element of B by 0. A right Haar measure H (d) on A is de ned as a measure that is invariant under group acting on the right, i.e., it satis es Z Z H (B) = H (d) = H (d) = H (B 0); 8 0 2 A; and 8 measurable subset B A: 0
0
B
B 0
One can similarly de ne a left Haar measure. Under mild conditions, the right (or left) Haar measure is unique up to a positive multiplicative constant (Rao 1987). Saying that A is unimodular means that its right Haar measure is also a left Haar measure. When A is compact (e.g., a nite group) or abelian (i.e., = , e.g., the translation and scale groups), one can show that its right Haar measure is unimodular [Proposition 4 of Rao (1987), page 498]. Otherwise, the right and left Haar measures may dier by a modular function. If A is a compact group, then its unimodular Haar measure is the uniform probability measure. If A is the translation group, e.g., t (z ) = z + as in the over-centering parameterization, the unimodular Haar measure for is simply the Lebesgue measure. If A is a scale group (i.e., for scalar , t (z ) = z ) as in the probit regression example in Section 6, the Haar measure for 12
is proportional to jj?1d. If A is the group of non-singular k k matrices, then the unimodular Haar measure is jj?k d. In the following, we shall assume that a density H () exists for the unimodular Haar measure with respect to the Lebesgue measure, i.e., H (d) = H ()d.
5.2 The PX-DA Algorithm With the Haar Prior We will show that if the prior distribution of is a unimodular Haar measure, then the following PX-DA scheme is always proper. Its optimality will be shown in Section 5.4.
Scheme [2]: 1. Draw z f (z j ; y ), the same as in the DA algorithm. 2. Draw (; ) jointly according to ; j y ; z f (y; t (z ) j )jJ(z )jH (d)f (); i.e., draw from the posterior distribution of the parameters in the expanded model. The only dierence between Scheme [2] and the DA algorithm is that Step 2 of Scheme [2] draws from the posterior of (; ) in the expanded model, with a Haar prior on . Compared with Scheme [1.1], Scheme [2] essentially xes 0 at null = e instead of drawing it from p0(d). Modeling Scheme [1.1], we rewrite Scheme [2] to conform with the ordinary DA algorithm:
Scheme [2.1] 1. Draw z f (z j ; y ). 2. Draw p( j y ; z) / f (y; t(z )) jJ(z )j H (d): Compute z 0 = t (z ). 3. Draw f ( j y ; z0 ). Like in Scheme [1.1], Step 2 of Scheme [2.1] de nes a transition from z to z 0. We shall show that this step not only leaves the distribution f (z j y) invariant, but also produces some optimality properties. Let Z be the space of z . For a given z , the set ft (z ) : 2 Ag Z is called an orbit. Formally, we say that z and z 0 lie on the same orbit if and only if there exists a unique 2 A, such that z 0 = t (z ). Dierent orbits do not intersect and the space Z can be partitioned into the union of all the orbits, each of which has the same structure as A. To illustrate, consider the following examples. Example A: Let z = (z1; z2) 2 R2 , and A = f 2 R1 : t (z) = 13
(z1 + ; z2 + )g. Figure 1(a) shows a set of orbits for this translation group. Example B: Let z = (z1; z2) 2 R2, and A = f > 0 : t(z) = (z1; z2)g. Figure 1(b) shows a set of orbits for this scale group. (a)
(b)
Z2
Z2
Orbits
Orbits
Z1
Z1
Figure 1: The set of orbits in R2 for (a) translation group, and (b) scale group. Suppose this set of orbits can be represented by a smooth cross-section Q (Wijsman 1966), which is de ned as a subset of Z that intersects with (almost) every orbit exactly once. In the foregoing Example A, a cross section is Q = f(z1; z2) : z1 = 0g. In Example B, a cross section is Q = f(z1; z2) : z12 + z22 = 1g. When A is nite or a compact Lie group, the existence of a smooth Q has been established (see Wijsman 1966 for references). When A is not compact, Palais (1961) provides results on the existence of Q. With the cross section Q, any z 2 Z can be described by its orbit r 2 Q and its position 2 A on the orbit. Step 2 of Scheme [2.1] eectively generates a new position conditional on the orbit. To prove its properness, we need the following:
Condition (iii) The transformation group A is locally compact and has a unimodular Haar measure H ()d. There exists a smooth cross-section Q Z and the mapping Z : Z ( ; r) = t (r) for A Q ?! Z is one-to-one and continuously dierentiable, i.e., a dieomorphism.
In all the speci c cases considered in this article, such as the translation and scale groups, the existence of a smooth Q and the dieomorphism Z can be easily veri ed. Under the mapping Z , we have for all z 2 Z a new parameterization: i.e., 9 ( ; r), such that z = Z ( ; r), where r indicates the orbit z lies on, and its position on r. A simple property of this dieomorphism is that 8 2 A and z = Z ( ; r)
t (z) = t(Z ( ; r)) = Z ( ; r): 14
(7)
Based on this formulation, we can describe Scheme [2.1] by the following diagram:
old ?! ( ; r) ?! ( 0; r) ?! new;
(8)
where 0 = .
Lemma 5.1 Let
Ir () = @Z (; r)=H (d) = @Z (H;(r))=@ ;
and K (r) = @Z (; r)=@r. Under Condition (iii), the dieomorphism Z satis es
[Ir (); K(r)] = J (Z (e; r))[Ir(e); Ke(r)]: Because of left invariance of H (d), we have Ir ( ) = @Z ( ; r)=H (d ). By dierentiating both sides of (??) with respect to we have Proof:
J(Z ( ; r))Ir( ) = Ir ( ): Dierentiating both sides of (??) with respect to r gives rise to
J(Z ( ; r))K (r) = K (r): The result then follows by letting = e. 2
Theorem 5.1 Let be an arbitrary probability measure on Z , and suppose Condition (iii) holds. For z 2 Z , we let Z g(z) = (t(z))jJ(z)jH (d): (9) Then g (z ) < 1. If we write z = Z ( ; r) and let be drawn from ( j z) = (t(z))jJ(z)jH (d)=g(z);
(10)
then follows distribution r , the conditional distribution of position given orbit r induced by , and is independent of conditional on r. If z , then z 0 = t (z ) follows . Proof:
From Lemma 5.1, the joint distribution of (; r) induced by is
(; r) = (Z (; r)) jJ(Z (e; r))jkIr(e); Ke(r)k H (d)dr; where kIr (e); Ke(r)k is the absolute value of the determinant of [Ir (e); Ke(r)]. The marginal distribution of r is then kIr(e); Ke(r)k h(r); 15
where
Z
h(r) = (Z (; r))jJ(Z (e; r))jH (d) < 1:
The conditional distribution for the position is then
r () = (Z (; r))jJ(Z (e; r))jH (d)=h(r): Because J (z ) = J (t (z ))J (z ) for any ; 2 A, and H (d) is a Haar measure, we have Z g(z) = (Z ( ; r)) jJjJ ((ZZ((e;e;rr))))j jH (d ) = jJ (Zh((re;) r))j < 1;
and
(t(z))jJ(z)jH (d)=g(z) = (t (Z (e; r)))jJ (Z (e; r))jH (d )=h(r):
Hence, follows the conditional distribution r and is independent of . As in the regular Gibbs sampler, t (z ) = Z ( ; r) must follow provided that z . 2
Corollary 5.1 Step 2 of Scheme [2.1] (i.e., the complete-data posterior of ) is always proper, and it leaves f (z j y) invariant. Thus, Scheme [2.1], or equivalently Scheme [2], converges to the target distribution of interest.
Proof:
Let (z) = f (z j y ) in Theorem 5.1. 2
5.3 Some Remarks Remark 1. A key assumption in proving Theorem 5.1 is the existence of a smooth crosssection and the dieomorphism Z . In practice, we need not know what they are and often do not need to check on their existence: if the niteness of (??) can be established by other means, we can prove the invariance without using the change-of-variable technique employed here. In particular, we can establish that for any measurable function h, E h(t (z )) = E h(z ). A direct examination also reveals that (??) is invariant along the orbit of z ; i.e., if z 1 = t 1 (z) for some 1 2 A and 1 ( j z 1 ), then t1 (z 1 ) has the same distribution as t (z ), with ( j z). The only important formula from a practitioner's point of view is (??). Recently, Liu and Sabatti (1999) extend the invariance result of Theorem 5.1 to the case when H (d) is only a left Haar measure. Remark 2. Theorem 5.1 shows that the use of Haar measure enables us to generate a 0 = conditionally independent of with given orbit information. Mathematically, this means that
16
the position variable is integrated out of the sampler. Hence, Scheme [2.1] actually induces a collapsed diagram compared with diagram (??):
old ?! r ?! new :
(11)
Remark 3. The invariance property explored in this theorem has a close relationship with work by Fraser (1961), Bondar (1972, 1976), and Wijsman (1966) on structural distribution, ducial inference, and maximal invariance. In fact, Bondar (1972) proves a very similar theorem under the condition that J (z ) is independent of z . Their results do not require that the Haar measure is unimodular. As mentioned in Remark 1, the invariance result of Theorem 5.1 also holds when H (d) is not unimodular. Remark 4. Consider the parametric family P = fp (z) = (t (z))jJ(z )j : 2 Ag, where is a known base distribution. Of interest is the inference on with an observation z. A natural pivotal quantity is t (z ) whose distribution is . When the space of z has the identical group structure as A, the ducial distribution of derived by Fraser (1961) is equivalent to (??). With the same setting, the maximal ancillary statistic is the orbit of z . Thus, the inference of conditional on maximal ancillary (Pitman estimator) is equivalent to (??). Remark 5. With the foregoing parametric family setting, the non-informative nature of H (d) can be seen in light of Theorem 5.1: if z is an observation from the base distribution and is drawn from the posterior (??), then the fact that t (z ) follows can be interpreted as that using H (d) as a prior does not \disturb" the base distribution a posteriori. No other prior can achieve this.
5.4 Optimality of the Haar Measure in PX-DA With Conditions (i) and (iii) (e.g., z = Z ( ; r)), we can rewrite the original model as
f (z; j y)dzd = g( ; r; )drdH (d );
(12)
where g denotes the density after transformation. The expanded model can be rewritten as:
p(z; ; j y)dzdH (d) = g( ; r; )p( j )H (d)H (d )ddr;
(13)
where p(j) is the conditional prior density of with respect to the Haar measure H (d). As a consequence, the PX-DA algorithm induces the iteration between the conditional draws ; ; r j and ; j ; r under the joint distribution (??), which is mathematically equivalent 17
to iterating between the conditional draws ; r j and j ; r (with integrated out) in (??). In comparison, the DA algorithm induces the same iteration based on (??). Note that (??) and (??) have the same marginal distribution for (r; ) conditional on y . The dierence between the two algorithms is only in the conditional distribution of ( ; r) given . If we let p(j) 1, Theorem 5.1 shows that the new position 0 = is independent of given r. Thus, the PX-DA algorithm with a Haar measure prior on eectively induces the iteration between drawing r j and drawing j r, with both and integrated out from (??). In this sense, parameter expansion is doing exactly ecient augmentation, where only the orbit of the missing data is augmented. Based on the collapsing theorem of Liu (1994), we have:
Theorem 5.2 Under Conditions (i) and (iii), Scheme [2] is as good as or better than any PXDA algorithm with a proper prior p( j )H (d) (with respect to the Lebesgue measure) on the
expansion parameter. Here H (d) is the Haar measure.
Remark: This result holds for all nonnegative and nondegenerate function p( j ) that, when combined with H (d), gives rise to a proper prior with respect to the Lebesgue measure. But the theorem does not directly apply if p( j )H (d) is improper. In the situation when p( j )H (d) can be expressed as the limit of a sequence of proper priors, the approach of Section 4.3 can be employed to show that the Haar measure is still optimal. It remains unclear, however, what a \cleaner" PX-DA algorithm would be in this case. The following result is a special case of the theorem, but we provide a dierent proof.
Corollary 5.2 Scheme [2], or equivalently Scheme [2.1], converges no slower than Scheme [1]
or equivalently Scheme [1.1].
For any integrable function h(), we let s(z 0 ) = Ef fh() j z 0 g. Then according to the proof of Theorem 4.2, we only need to compare Ef [varPX?DA fs(z0 ) j z g] under the two schemes. Let K1(z 0 j z ) be the transition induced by Step 2 of Scheme [1.1], and let K2(z 0 j z ) be that for Scheme [2.1], which is determined by (??). From Theorem 4.1, K1 is invariant with respect to . Since K1 samples along the orbit of z , it must also leave the conditional distribution R invariant, i.e., K1(z 00 j z 0 )K2(z0 j z )dz 0 = K2(z 00 j z ): Consequently,
Proof:
Ef [varK2 fs(z0)g] = Ef [varK2 fs(z00)g] = Ef [EK2 [varK1 fs(z00 ) j z 0 g] + varK2 [EK1 fs(z 00) j z 0 g]] Ef [varK1 fs(z00) j z0g] = Ef [varK1 fs(z0) j zg]: 18
2 Because reparameterization often corresponds to using a degenerate prior p() = A() in the PX-DA, we have the following result:
Corollary 5.3 Under Conditions (i) and (iii), Scheme [2] is as good as or better than any reparameterization scheme that can be formulated as a PX-DA algorithm with degenerate prior.
Although using the Haar invariant prior for in the PX-DA is optimal under Conditions (i) and (iii), the resulting scheme may be dicult to implement. Moreover, sometimes A may not induce a data-transformation mechanism or possess a group structure. The use of a proper prior, p( j ), and the limiting argument in Section 4.3 is still valuable in these occasions.
6 A NUMERICAL EXAMPLE: PROBIT REGRESSION Let y = (y1; :::; yn) be a set of iid binary observations from the probit model
Yi j Bernoullif(Xi0)g; where Xi (p 1) are the covariates, is the unknown regression coecient, and is the standard Gaussian cumulative distribution function. Of interest is the posterior distribution of , under, say, a at prior. A popular way to ease computation is to introduce a \complete-data" model in which a set of latent variables, z1 ; :::; zn, is augmented so that [zi j ] = N (Xi0; 1); and yi = sgn(zi ); where sgn(z ) = 1 if z > 0, and sgn(z ) = 0, otherwise. The standard DA algorithm iterates the following steps: 1. Draw from [zi j yi ; ]. That is, zi N (Xi0; 1) subject to zi 0 if yi = 1; and zi N (Xi0; 1) subject to zi < 0 for yi = 0. P P P 2. Draw [ j zi ] = N (^; V ), where ^ = ( i XiXi0)?1 i Xi zi , and V = ( i Xi Xi0)?1 . Both ^ and V can be computed using the Sweep operator (e.g., Little and Rubin, 1987). The complete-data model can be expanded by introducing an expansion parameter for residual variance, which is originally xed at 1: [wi j ] = N (Xi0; 2 ); and yi = sgn(wi): 19
It is clear that this model can be derived by using the group of scale transformation z = t (w) (w1=; : : :; wn=). The corresponding Haar measure is H (d) = ?1 d. Scheme [2] has the same rst step as in the standard DA, but with slightly dierent later steps: P 2. Draw ^2 RSS=2n , where RSS = i (zi ? Xi ^)2, which is a by-product from the SWEEP P operator, provided that we have computed i zi2 . 0
3. Draw N (^=^; V ). If we put an inverse Gamma prior distribution on 2 , that is, 2 b0=Gamma(a0 ) where a0 and b0 are two positive constants, then Scheme [1] is identical to Scheme [2] except that the second step is changed to
Draw 20 b0=Gamma(a0). Then draw 21 (b0 +20RSS=2)=Gamma(a0 +n=2). Compute ^2 = 21=20 = (Gamma(a0) + RSS=2)=Gamma(a0 + n=2).
As a0 ! 0, Scheme [1] converges to Scheme [2]. beta_1 = 2
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
beta_1 = 1
0
10
20
30
40
50
0
20
60
80
100
800
1000
beta_1 = 8
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
beta_1 = 4
40
0
100
200
300
400
0
200
400
600
Figure 2: Autocorrelation functions for DA (solid lines) and PX-DA (dashed lines) with various values of 1 . 20
We took n = 100 and Xi = (1; xi)0, with xi generated from N (0; 1). The yi were generated from Bernoulli(( 0 + 1xi )) with the true values 0 = 0, and 1 = 1; 2; 4; 8. We implemented both the DA algorithm and Scheme 2. Figure 2 shows the autocorrelation functions of the draws of 1 for both algorithms under dierent true values for 1. It is clear that as the real value for 1 gets larger, the improvement of the PX-DA algorithm over the DA algorithm becomes more signi cant. Or to put it another way, the PX-DA algorithm is not slowed down very much by the increased value of 1, whereas the DA algorithm is. This phenomenon can be understood as P follows. For the DA algorithm, var( j y ; z) = ( i Xi Xi )?1; whereas for the PX-DA algorithm, X X var( j y; z ) = ( Xi Xi )?1 + E [^^0 var(n )=RSS ] ( XiXi )?1 + 0 =2n; 0
0
0
i
i
which increases with . What we observed in Figure 2 can be understood from the fact that the sample autocorrelation is determined by 1 ? E fvar(jy ; z)jyg=var(jy ) (Liu et al. 1994). The comparison in the rates of convergence should re ect the comparison in real computing time since the implementation of the PX-DA algorithm needs only negligible computing overhead in comparison with the DA algorithm.
7 DISCUSSION Similarities and dierences between this article and Meng and van Dyk (1999a) should be noted. On one hand, both articles target over-parameterization and reparameterization methods for speeding up the Gibbs sampler, with ideas originating from recent development in the EM algorithm. Both intend to identify useful rules in guiding the use of such methods and to study their theoretical properties. On the other hand, this article concentrates more on the PX-DA algorithm under general settings and on the identi cation of general conditions necessary for the PX-DA algorithm to be optimal among a class of similar ones, whereas Meng and van Dyk (1999a) focus more on interesting cases and statistical insights derived from them. Moreover, this article presents a dierent viewpoint of some issues concerning the relationship between reparameterization and over-parameterization, the use of the PX-DA versus the conditional PXDA (called Scheme 2 in Meng and van Dyk (1999a)), and the justi cation for using an improper prior on the expansion parameter. Liu (1998) recently presented a covariance adjustment method where the \adjustment mapping" plays the role of the expanded parameter in our setting. The expansion parameter in the PX-DA algorithm plays a dierent role than the working parameter in Meng and van Dyk's (1997) ecient data augmentation because the expansion 21
parameter is inferred along with the PX-DA iteration instead of being xed at an optimal value (see also Green 1997). It is conceivable, however, that a working parameter can be introduced to index a transformation group to help nd a good parameter expansion scheme (van Dyk 1998). A subtle point relates the PX-DA algorithm to auxiliary variable techniques. In our case, of interest is the posterior distribution of , i.e., f ( j y ), whereas the posterior distribution of the augmented missing data can be arbitrarily distorted. The introduction of actually makes the joint (posterior) distribution of (; w) dier from that of (; z ). This special feature distinguishes the PX-DA algorithm from the usual auxiliary variable techniques in which the introduction of a new parameter does not change the previous joint distribution. The implication of Theorems 4.1 and 5.1 goes beyond the PX-DA algorithm described in this article. In particular, both theorems can be viewed as a generalization of the Gibbs sampler | because every Gibbs sampling move can be viewed as a transformation acting on the current state. Some conditional sampling rules for choosing one among a set of possible transformations is given by the theorems, which are especially useful when one envisions certain favorable directions to move along in the space. Liu and Sabatti (1998, 1999) present general ways of using the theorems, together with the Metropolis-Hastings steps, to accelerate a MCMC algorithm. Although we have only shown one real example, we would like to assure the reader that the PX-DA algorithm can be implemented for all the models in Liu et al. (1998) in parallel to their PX-EM implementations. In particular, the optimality result for using the Haar prior applies to the multivariate t-model (Meng and van Dyk 1999a), probit regression models, and random eects models. An interesting situation arises from a random eects model example of Meng and van Dyk (1999b) where they want to draw from an ane transformation group, say, A = f(1; 2) : (1; 2)z = 1 + 2zg. In this case, the group element (1; 2) can be sampled either jointly, where the left Haar measure j2j?2 d1d2 is desirable (Liu and Sabatti 1999), or \sequentially" (e.g., the translation component is sampled rst and then the scale component), where one only needs the unimodular Haar measure j2 j?1d1 d2 for the scale component. When the implementation of Step 2 of the PX-DA algorithm is dicult, a Metropolis-Hastings step can be added (some caveats are discussed in Liu and Sabatti 1999). Alternatively, one can reduce computational complexity by restricting on a nite subgroup or subset of A in which an exhaustive search can be conducted. In summary, we have formally de ned the PX-DA algorithm when a proper prior for the expansion parameter is used and its generalized versions under limiting improper priors or 22
Haar invariant priors. We have further identi ed some conditions under which the PX-DA algorithm can be guaranteed to outperform the ordinary DA algorithm. We have shown that using a Haar invariant prior for the expansion parameter is optimal among a class of such algorithms. In particular, we nd Condition (i) for the correspondence between the expansion parameter and a data transformation mechanism constructive. The reader can read Liu et al. (1998) and Meng and van Dyk (1999) for other practical advice on the search of a good parameter expansion scheme.
REFERENCES Bondar, J.V. (1972), \Structural Distributions Without Exact Transitivity, Annals of Mathematical Statistics, 43 326-333. | (1976), \Borel Cross-Sections and Maximal Invariants," Annals of Statistics, 4, 866-877. Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977), \Maximum Likelihood Estimation From Incomplete Data Via the EM Algorithm (with Discussion)," Journal of the Royal Statistical Society, Ser. B, 39, 1-38. Fraser, D.A.S. (1961), \The Fiducial Method and Invariance," Biometrika, 48, 261-280. Gelfand, A.E., Sahu, S.K., and Carlin, B.P. (1995), \Ecient Parameterizations for Normal Linear Mixed Models," Biometrika, 82, 479-88. Gelfand, A.E. and Smith, A.F.M. (1990), \Sampling-Based Approaches to Calculating Marginal Densities," Journal of the American Statistical Association, 85, 398-409. Geyer, C.J. (1992), \Practical Markov chain Monte Carlo (with Discussion)," Statistical Science, 7, 473-483. Green, P.J. (1997), \Discussion On the Paper by Meng and and van Dyk," Journal of the Royal Statistical Society, Ser. B, 59, 554-555. Higdon, D.M. (1998), \Auxiliary Variable Methods for Markov Chain Monte Carlo With Applications," Journal of the American Statistical Association, 93, 585-595. Liu, C.H. (1998), \Covariance Adjustment for Markov Chain Monte Carlo | A General Framework," Technical Report, Bell Laboratories, Lucent Technologies. 23
Liu, C.H., Rubin, D.B., and Wu, Y.N. (1998), \Parameter Expansion to Accelerate EM | the PX-EM algorithm," Biometrika, 85, 755-770. Liu, J.S. (1994), \The Collapsed Gibbs Sampler With Applications to a Gene Regulation Problem, Journal of the American Statistical Association, 89, 958-966. Liu, J.S., and Sabatti, C. (1998), \Simulated Sintering: Markov chain Monte Carlo With Spaces of Varying Dimensions (with Discussion)," Bayesian Statistics 6, J.M. Bernardo, J.O. Berger, A.P. Dawid, and A.F.M. Smith (eds). New York: Oxford University Press. In press. | (1999), \Generalized Gibbs Sampler and Multigrid Monte Carlo for Bayesian Computation," Technical Report, Department of Statistics, Stanford University. Liu, J.S., Wong, W.H., and Kong, A. (1994), \Covariance Structure of the Gibbs Sampler with Applications to Comparisons of Estimators and Augmentation Schemes," Biometrika, 81, 27-40. | (1995) \Covariance Structure and the Convergence Rate of the Gibbs Sampler with Various Scans," Journal of the Royal Statistical Society, Ser. B, 57 157-169. Meng, X.L., and van Dyk, D. (1997), \The EM Algorithm | an Old Folk Song Sung to the Fast Tune (With Discussion). Journal of the Royal Statistical Society, Ser. B, 59 511-567. | (1999a), \Seeking Ecient Data Augmentation Schemes via Conditional and Marginal Augmentation," Biometrika, in press. | (1999b), \The Art of Data Augmentation," Technical Report, Department of Statistics, University of Chicago. Palais, R.S. (1961), \On the Existence of Slices for Actions of Non-compact Lie Groups," Annals of Mathematics, 73, 295-323. Rao, M.M. (1987), Measure Theory and Integration, John Wiley & Sons, New York. Rubin, D. B. (1978), \Bayesian inference for causal eects: The role of randomization," Annals of Statistics, 6, 34-58. 24
Rubin, D. B. (1987), Multiple Imputation for Nonresponse in Surveys, John Wiley & Sons, New York. Tanner, M.A. and Wong, W.H. (1987), \The Calculation of Posterior Distributions by Data Augmentation (with discussion)," Journal of the American Statistical Association, 82, 805-811. van Dyk, D. (1998), \Discussion on Liu and Sabatti's paper," Bayesian Statistics 6, J.M. Bernardo, J.O. Berger, A.P. Dawid, and A.F.M. Smith (eds). New York: Oxford University Press. In press. Wijsman, R.A. (1966), \Existence of Local Cross-Sections in Linear Cartan G-spaces Under the Action of Noncompact Groups," Proceedings of the American Mathematical Society, 17, 295-301.
25