Nov 25, 1996 - of Rice University, Houston, TX; and Rick Picard is a Technical Sta Member of the ... further information on can be found in the texts of Carter.
Adaptive Importance Sampling on Discrete Markov Chains Craig Kollman, Keith Baggerly, Dennis Cox and Rick Picard 1 November 25, 1996
ABSTRACT: In modeling particle transport through a medium, the path of a particle behaves as a transient Markov Chain. We are interested in characteristics of the particle's movement conditional on its starting state which take the form of a \score" accumulated with each transition. Importance sampling is an essential variance reduction technique in this setting, and we provide an adaptive (iteratively updated) importance sampling algorithm that is proven to converge exponentially to the zero-variance solution. Examples illustrating this phenomenon are provided. Running Title: Adaptive Importance Sampling AMS 1991 subject classi cations: 65C05. KEY WORDS: Adaptive procedures, exponential convergence, Monte Carlo, particle transport, zero variance solution. Craig Kollman is a Biostatistician at the National Marrow Donor Program, Minneapolis, MN; Keith Baggerly and Dennis Cox are Assistant Professor and Professor, respectively, in the Statistics Department of Rice University, Houston, TX; and Rick Picard is a Technical Sta Member of the Statistics Group of Los Alamos National Laboratory, Los Alamos, NM. Support for this research was provided by Los Alamos National Laboratory as part of the Laboratory Directed Research and Development Program's ongoing eorts to improve Monte Carlo methods: for the rst author while as a graduate student at Stanford, for the second while as a sta member at LANL, for the third while as a visitor at Los Alamos, and for the fourth throughout. 1
1
1 Introduction and Motivation The motivation for this work involves modeling the behavior of particles (primarily neutrons and photons) as they move within a medium. Related applications include such wide ranging areas as well logging for oil exploration, heat generation calculations for nuclear reactors, radiation dosage calculations for exposures to medical X-rays, development of radiation detection devices, radiation shielding, criticality safety for storage of nuclear materials, and design of nuclear weapons. Because it is vastly more expensive to perform the related physical experiments and use complex instrumentation to obtain necessary measurements than to pursue computer simulation for these applications, it is common to use simulation software such as the Monte Carlo radiation transport code MCNP (Briesmeister 1993). To motivate our work, we give a brief and necessarily super cial description of the particle physics applications; further information on can be found in the texts of Carter and Cashwell (1975), Kalos and Whitlock (1986), and Lux and Koblinger (1991). Once a particle is emitted from a source, its subsequent behavior is inherently stochastic. Particles move in random directions, collide with atoms in their paths, interact in various ways upon collision, and produce assorted chain reactions. The movement of a particle through a (possibly complex) medium can be sequentially simulated through a series of a) path lengths, i.e., the distance until the next collision, b) types of interactions upon collision with atoms in the material, e.g., the particle may be absorbed or may continue on with possibly altered energy and direction, c) energy loss, if any, and d) scattering angles, determining direction until the next collision. In terms of a random walk characterizing a particle's movement, the state space is typically seven-dimensional: three dimensions to denote the particle's current location, two dimensions to denote the particle's current direction, one dimension for energy, and one dimension for time. In nearly all problems of practical interest, the particle's behavior is Markovian, in that the probability of transition from one state to the next depends only on the current state. Natural state spaces are bounded and the Markov chain can be treated as transient in that either the particle is absorbed during a collision or it leaves the region 2
of interest, and is presumed never to return (we ignore backscattering). From a particle's simulated history, a \score" is obtained. In practice, the score denotes some physical quantity of interest and the objective of the simulation is to estimate the expected score. Some simple examples related to the above applications include estimating: a) the proportion of particles emitted from a particular source that arrive at a speci ed destination, b) the energy released within a speci ed location (i.e., the summed dierences between the before-and-after energies of all collisions within the location), c) the portion of particles entering a speci ed location which are in a certain energy range, and d) the proportion of particles emitted from a particular source that are prevented (\shielded") from passing through a speci ed volume. Obtaining results from so-called \analog" simulations that use nature's transition probabilities can be extremely time consuming. For many problems involving complex material geometries, simulating a single particle history is nontrivial, and only a small fraction of those histories may contribute nonzero scores. Such situations are ideally suited for importance sampling (often called \biasing the random walk"). Several other variance reduction techniques have been proposed in the literature and successfully implemented in current transport code; see Hammersley and Handscomb (1964), Lux and Koblinger (1991) and Briesmeister (1993) for details. These methods, while vastly more ecient than analog Monte Carlo, still only achieve a \constant" speedup in that the variance of such procedures still decreases as O(n?1 ), where n is the number of histories simulated. This rate is built into the procedures through the assumption of independent samples: the outcome of one realization does not aect the outcome of another. To improve on this rate, then, we must use intelligently chosen dependent samples. This \intelligent choice" corresponds to allowing the process to learn and adapt at various stages, a notion which has been termed sequential or adaptive Monte Carlo. Early work in the eld (on a twostage arrangement only) includes that of Marshall (1956), which was soon followed by the more general and extensive work of Halton (1962). More recently, work by Booth (1985, 1986, 1988, 1989) on guiding various Monte Carlo methods towards zero-variance solutions highlighted adaptive methods once again. Booth found cases where, using adaptive Monte Carlo methods, empirical convergence to the true solution was not merely a power of n, but 3
rather O(e?n ) for some positive constant . We refer to this as \exponential convergence". Kollman (1993) later proved that such convergence was possible. The purpose of this work is to prove that under certain conditions (which extend those in Kollman(1993)) adaptive importance sampling for discrete Markov chains with scoring will achieve convergence that is at least exponential. We deal with the case where the form of the target vector can be given as a linear model X , and deal with the case of completely arbitrary scores as a special case. In Section 2, we set out the basic mathematical model. We also outline a generalization of importance sampling to Markov Chains, establish the existence of a zero-variance chain, and introduce an adaptive importance sampling algorithm. In Section 3, we prove that this algorithm achieves exponential convergence. In Section 4, we examine various examples showing the empirical behavior of the algorithm in simulations. Finally, in Section 5, we discuss practical constraints and directions for future research.
2 Description of the Method. Consider a Markov chain hXn i1 n=0 with state space f1; : : :; d; g where is the cemetery. Denote the transition matrix for the nonabsorbing states f1; : : :; dg by P. We assume that eventual death is certain so n (1) nlim !1 P = 0: When the particle changes from state i to state j , a \score" sij 0 is incurred. We assume s = 0 , so no score is accumulated after death. Denoting the death time by , the total score of a particle is X YP = sXn? ;Xn : We are interested in estimating
n=1
1
i E [ YP j X0 = i] ; the expected total score for a particle starting in state i. Such quantities are of great interest in practical applications, as indicated in Section 1. We will assume that = (1 ; : : :; d ) is given by a linear model = X ;
4
(2)
where X is a known d p matrix and is an unknown p-vector. This of course includes the case where ranges over all of IRd: take = and X the d d identity matrix. This case, which is simultaneously the least presumptive and the most computationally intensive, was addressed by Kollman (1993).
2.1 Importance Sampling on Markov Chains Instead of simulating the particle under the true transition probabilities fpij g, we can simulate with other probabilities fqij g. To keep the expected scores unbiased, each score is weighted by the likelihood ratio between P and Q. For all times up to death fn g, this ratio is given by n Y Ln = pqXi? ;Xi : 1
i=1 Xi?1 ;Xi
For this to be well-de ned, we require that Q dominates P, denoted Q P, which simply means that if pij > 0 then qij > 0. Note that Q P guarantees that the chain is transient under Q (i.e., Qn ! 0) because of the nite state space. When sampling from Q, the total (weighted) score is X YQ = sXn? ;Xn Ln : 1
n=1
It is easily shown that the expectation of YQ given X0 = i is i . Now, let vi = Var(YQ jX0 = i). By the Markov property,
E (YQjX0 = i; X1 = j ) = pq ij (sij + j ) and ij !2 p ij Var(YQ jX1 = j ) = q vj : ij The variance decomposition Var(YQ ) = Var[E (YQjX0; X1)] + E [Var(YQ jX0; X1)] yields
3 2 2 !2 !2 3 p s 2 X d d X vi = qi iq i + 4qij pq ij (sij + j )25 ? 2i + 4qij pqij vj 5 : i
ij
j =1
(Note that E [Var(YQ jX0 = )] = 0.) This condenses to
ij
j =1
0 1 2 2 s2 X " p2ij (sij + j )2 # p 2 A + X pij vj ; ? vi = @ iq i + i q q i
ij
j
(here we take 0=0 = 0). If we let
2 s2 X " p2ij (sij + j )2 # 2 p ? i fi = iq i + qij i j
5
j
ij
and de ne R to be the matrix whose (i; j )th element is given by ( 2 > 0; rij = pij 0=qij ifif qqij = ij 0; then in matrix form the variance equation becomes
v = f + Rv: Note that when
qij () = and
(3)
pij (sij + j ) pi si + Pdk=1 pik (sik + k )
pi si P pi si + dk=1 pik (sik + k ) we get 1 0 ! d p2 (sij + j )2 d 2i s2i X X p ij A @ fi = p s + p (s + ) pi si + pik (sik + k ) ? 2i qi () =
i i
which reduces to
j =1 ij ij
j
k=1
0 1 ! d d X @pisi + pij (sij + j )A pisi + X pik (sik + k ) ? 2i = 0: j =1
k=1
This choice of Q gives f = 0 and thus presumably v = 0, and we have a zero variance importance scheme. In fact, one can show by an induction on the value of that this Q gives YQ = i if X0 = i. As is typically the case for zero variance solutions, this choice of Q depends on the unknown vector . However, it suggests an adaptive procedure wherein one estimates , substitutes the estimate into qij (), simulates with the new importance sampling chain, and iterates.
2.2 The Adaptive Algorithm For ~ 2 IRd with ~j > 0 for 1 j d, de ne
pij (sij + ~j ) pi si + Pdk=1 pik (sik + ~k ) P where of course ~ = 0 and qi = 1 ? dj=1 qij . This gives a transition matrix Q(~) with Q(~) P. Similarly, de ne f (~) to be the f above for Q(~), i.e. the ith component of qij (~) =
f (~) is given by
1 0 ! d p2 (sij + ~j )2 d 2 s2 X X p ij fi (~) = @ pi si + p (s + ~ ) A pi si + pik (sik + ~k ) ? ~2i : i i
j =1 ij ij
j
k=1
6
In a similar way, de ne v(~) and R(~). The idea of the algorithm is to simulate under Q(^ ) where the vector ^ is an estimate of from previous simulations. We must be careful to ensure that Q(^) P. Note that if there exists a > 0 such that si for all i then i for all i. The particle is assured of scoring at least since it will eventually die. Since we know that i we can take the maximum of and our estimate without increasing the error. This ensures that ^i and hence Q(^ ) P. If such a value for is unknown, we can easily alter the problem by adding a > 0 to each si . Since every particle dies exactly once, this just adds to each i . We can subtract from each ^i at the end of the iterations to return to the original problem. Thus, without loss of generality, there exists a known > 0 such that i for all i. The outcomes of runs started in various states are used to estimate by estimating . More speci cally, let D denote a design with (say) m runs, i.e. D = (i1; :::; im) where each ij is in the state space (ij 6= ). Given an importance sampling chain Q, we will obtain the data y = (Y1Q ; : : :; YmQ ), where Yj Q is the total weighted score from simulations of the chain with Yj Q resulting from the initial value X0 = ij in the design D. Of course, the components of y are mututally independent (conditional on the current estimate of ). Based on these data, we will use the (ordinary) least squares estimator of given by ^ D = (XTD XD )?1 XTD y;
where XD is the m p matrix whose j th row is the ij th row of X. The corresponding estimator of is ^ D = X ^ D . We now describe the design that will be used in the algorithm. Let D1 = (i1; : : :; i`) be a given design such that XD has full rank p, and let Dr be the design obtained by replicating D1 r times. A lower bound on r is given in the proof of the main theorem. The algorithm is de ned as follows: 1
I. Obtain an initial estimate ^ (0), e.g., one based on previous experience with similar problems, or perhaps based on a simulation using the analog transition probabilities pij . (As with many iterative procedures, convergence can be greatly accelerated through use of a good starting value.) 7
II. Given that m iterations the algorithm has produced an estimate ^ (m), iterate the following steps to convergence:
{ 1. Using design Dr , run independent replications of the chain with Q(^(m)) as the transition matrix.
{ 2. For a given replication in 1, let be the death time and X Y =
n=1
sXn? ;Xn Ln 1
where the fXng and fLn g are obtained from the given simulation run.
{ 3. Using the linear model = X , use the (i; Y ) pairs (i is the starting value) to nd an estimate ^ (m+1) by ordinary least squares.
{ 4. De ne ^ (m+1) by ^(im+1) = max([X ^ (m+1) ]i; ) i = 1; : : :; d: Note that this last step of the algorithm ensures that ^(i m+1) > , 1 i d. Moreover, the computational burden can be reduced for very large problems (where only a fraction of the states are visited in each iteration) by computing qij (^(m) ) only when needed { i.e., for visited states only. We mention two modi cations of the algorithm. The rst involves utilizing what we call \path information." When a state i is visited, we may of course think of it as an initial state and begin accumulating a total weighted score. The computations needed to do this are basically the same as those associated with the actual ( rst) initial state of the history of that path, plus a little bookeeping. However, there will now be dependence between the total weighted scores from any paths that share a common history. This was considered in the case when X is the identity in Kollman (1993) when only the rst visit to i was allowed in the estimation (so that for a given initial state, all histories are independent), and it was shown that in this setting it improves convergence. We have found in simulations with nonidentity X matrices that the use of path information can lead to nonconvergence, so there are still some issues to resolve before we can recommend it in general. The other modi cation is the use of previous estimates, which we refer to as \incorporating previous information." Thus, we could consider modifying step 4 of the algorithm 8
to
^(im+1) = maxf[X ^ (m+1)]i; g + (1 ? )^(im) ;
i = 1; : : :; d;
(4)
where 2 (0; 1] can be chosen. We shall show in the Proposition in the next section that one obtains exponential convergence with this algorithm as well.
3 Proof of Exponential Convergence. We have made three assumptions about the problem:
Assumption 1. limn!1 Pn = 0. Assumption 2. There exists a > 0 such that i for every i. Assumption 3. The mean total score is given by the linear model in (2) where X has full column rank p. Let D1 be as given in the previous section. Note that there exists a constant M such that E [kX ^ D ? k2 ] M max Var(YQ jX0 = i); (5) i 1
for all importance sampling transition matrices Q where ^ D is the least squares estimator of based on data with design D1 . For the r-fold replicated design Dr , the YQ in the previous display can be replaced by the average of the r independent replicates of YQ at the same design point, and we obtain 1
Var(YQ jX0 = i): E [kX ^ D ? k2] Mr max i r
(6)
It should be understood when we write ^ (m) below that this estimate is derived from YQ(^ m ) values utilizing the design Dr , i.e. is ^ Dr for the current iteration. (
)
Theorem Under the assumptions stated above, given a design D1 there exist deterministic
constants > 0 and R such that if the adaptive algorithm is run with design Dr where r R, then with probability one em k^ (m) ? k ! 0 as m ! 1. Before presenting the proof we need to derive some preliminary results.
Lemma 1. There exists a matrix A and an > 0 such that 1T v(~) (~ ? )T A(~ ? ); 9
whenever k~ ? k < .
Proof: We establish that v(~) is twice continuously dierentiable in a neighborhood of . Since v() = 0 is a global minimum of v , the result then follows by a simple application of Taylor's formula. When all the entries of ~ are positive, Q(~) P and both rij (~) and fi (~) are in nitely dierentiable in ~ . Thus, in view of (3) we need only show that (I ? R(~))?1 exists and is in nitely dierentiable in a neighborhood of to complete the proof. By Assumption 2 we know that > 1, which implies that Q() P. If pij > 0 then qij () > 0 and
p2ij [pi si + Pdl=1 pil (sil + l )] pij i pij i : rij () = q () = = pij (sij + j ) sij + j j ij p2ij
Let rij(n) () and p(ijn) denote the (i; j )th elements of Rn () and Pn , respectively. An induction argument shows that
p(n) rij(n) () ij i : j
(7)
Thus, the (7) holds for all n by induction. Since the state space is nite, Assumption 1 implies that 1 X
which implies that Thus,
n=0
1 X n=0
Pn < 1
Rn () < 1:
(I ? R())?1 =
1 X n=0
Rn ():
One can easily show the requisite dierentiability of (I ? R(~))?1 from this. 2
Lemma 2. There exists a constant c 2 (0; 1), an R1 > 0, an > 0 and a > 0 such that if the learning algorithm is run with k^ (0) ? k < and a design consisting of r R1 replications of D1, then
o n Pr k^ (m) ? k cmk^ (0) ? k; 8m (0) I [k^ (0) ? k < ] I [k^ (0) ? k < ]:
10
Proof: By Assumption 2 and steps 3 and 4 of the algorithm, we have (m)i h (m+1) ( m+1) 2 T 2 ^ E k^i ? ik ^ E kxi ? i k ^ (m) Mr vi ^ (m) ; by (6) where M is given in (5). Let and A be as in Lemma 1. Then
i h E k^ (m+1) ? k2I [k^ (m) ? k < ] ^ (m)
d h i X E k^(im+1) ? i k2 ^ (m) I [k^ (m) ? k < ]
=
i=1
bk^ (m) ? k2; where
(8)
b dMr?1 sup kAk: kk=1
Choose R1 > 0 so that r R1 implies b < 1. Let T = inf fm 0 : k^ (m) ? k g and de ne (m) as
( (m) ^ ? ; m T 0; m > T: Note that if ^ (m) ? = 0 for some m then ^ (m+k) ? = 0 for all k > 0 by the zero variance (m)
property. One can then check that (m+1) = (^(m+1) ? ) I [0 < k(m) k < ]:
Thus
h
E k(m+1) k2 h^ (n) imn=0
i
i h i h = E k^ (m+1) ? k2I [k^ (m) ? k < ] h^ (n) imn=0 I [0 < k(m) k < ]: bk^ (m) ? k2I [0 < k(m)k < ]: = bk(m) k2: = E k^ (m+1) ? k2I [0 < k(m) k < ] h^ (n) imn=0
(9) (10) (11)
In the above, (9) follows because f0 < k(m)k < g fk^ (m) ? k < g, (10) by (8), and (11) by de nition of (m). By induction, this yields
h
i
h
i
?1 bmE k(0)k2 ^ (0) ; E k(m)k2 h^ (n) imn=0
11
(12)
for all m 1. Now, choose a value c such that b < c2 < 1, and de ne events
8 > f(0) < g < Fm = > : fk(m)k cmk(0)kg
if m = 0; if m > 0:
Using Markov's inequality and (12), for m 1,
1 0 m\ ?1 P @ Fmc \ Fj ^ (0)A 1 0j=0 n (m) 2 2m (0) 2o m\?1 (0) = P @ k k > c k k \ Fj ^ A j =0 0 2 3 1 h (m) 2 2m (0) 2 (j) m?1i m\?1 (0) = E @ P k k > c k k h^ ij =0 I 4 Fj 5 ^ A j =0 0 (m) 2 (j) m?1 2m?1 3 1 E k k h^ ij=0 4 \ 5 (0)A I Fj ^ E@ c2m k(0)k2 j =0 0 2 3 1 ?1 m k(0)k2 m\ b @ 4 5 E 2m (0) 2 I F ^ (0)A c k k j=0 j b m 0 m\?1 1 = c2 P @ Fj ^ (0)A : j =0
and hence also
0 m 1 m 0 m?1 1 \ \ P @ Fj ^ (0)A 1 ? cb2 P @ Fj ^ (0)A : j =0 j =0 0 1 1 0 m 1 \ \ (0)A @ Fj ^ P @ Fj ^ (0)A = mlim P I [F0] !1
So
j =0
j =0
where > 0 is given by
1 Y m=1
1?
b m c2
P (b=c2)m < 1 (see Theorem 12-52 of Apostol (1957)). Note that > 0 since 1 m=1 Now ) (\ 1 Fm fT = 1g = f(m) = ^ (m) ? for all mg: m=0
That is, the and processes never decouple on the event F0 . So,
n
o
P k^ (m) ? k cm k^ (0) ? k for all m ^ (0) I [k^ (0) ? k < ] 12
(13)
P
(\ 1 m=0
Fm
)
I [k^ (0) ? k < ]:2 We must start our initial estimate, ^ (0) , close enough to for the probability bound of Lemma 2 to hold. However, even if we start with k^ (0) ? k we can wait to see if k^ (m) ? k < for some m. If this happens, the strong Markov property tells us that
k^ (m) ? k < cm?m k^ (m) ? k for all m m holds with probability at least . That is, every time the process f^ (m) g1 m=0 enters the -neighborhood of , there is probability at least that exponential convergence occurs. If we can show that this must happen in nitely often then exponential convergence would be certain. We are now ready to complete the proof.
Proof of Theorem: Let nD be the number of rows in the design matrix XD and let 1
Y (m) = (Y1(m) ; : : :; Yn(Dm) ) denote the nD -dimensional vector of the averages of the r values of Yj;Q(^ (m) ) values at each design point ij in D1 at the mth iteration (i.e., Yj(m) is the average of the r replicates of Yj;Q(^ (m) ) ). The average total path scores Yj(m) are nonnegative and
have nite mean ij , so by Markov's inequality
n
o
Pr Yj(m+1) > kij ^ (m) k1 ; for j = i1; i2; : : :; inD . Conditional on ^ (m) the rnD chains of iteration m + 1 are independent, and hence
n
o
Pr Yj(m+1) kij for all j ^ (m) 1 ? k1 Choose k such that
nD
:
(1 ? 1=k)nD > 1=2:
Denote by D = (i ; : : :; inD ) the vector of component values corresponding to the states in D1 . Let [0; kD ] be the rectangle set in RnD where 0 yj kij , 1 j nD , and put T T ? 1 T H = sup max X ) X y x ( X D : 1id i D D 1
1
y2[0;kD ]
1
13
1
1
Then
n
o
(14) Pr ^ (m+1) H 1 ^ (m) 12 : Let U denote the set of vectors in Rd having all positive components. For a vector ~ in U let Li (j~ ) denote the probability law (distribution) of YQ(~ ) when simulating under Q(~) and starting at X0 = i. That is, for sets A R
( m ) Li(Aj~ ) = Pr YQ(^ m ) 2 A X0 = i; ^ = ~ :
(
)
Note that the right hand side of the above equation does not depend on m. The transition probabilities qij (~) are continuous functions of ~ . For ~ 2 U ; Q(~) P so the likelihood ratios Ln and total scores YQ(~ ) are continuous functions of ~ . It follows that the map ~ 7! Li (j~ ) is continuous in the topology of weak convergence, that is If ~ n ! ~ ; then Li (j~ n ) ) Li (j~ ): Let E ~ () denote expectation under Li (j~ ). Fix > 0. Suppose that we have a sequence of vectors h~ n i in U with ~ n ) ~ . Then the distributions for the random variables YQ(~ n ) I [YQ(~ n ) ] under Q(~ n ) converge in distribution to the distribution of YQ(~ ) I [YQ(~ ) ]. By bounded convergence,
E ~ n (YQ(~ n ) I [YQ(~ n) ]) ! E ~ (YQ(~ n ) I [YQ(~ n) ]): That is, E ~ (YQ(~ n ) I [YQ(~ n ) ]) is a continuous function of ~ for each xed . For any ~ 2 U , E ~ (YQ(~ )) = i ; (15) which is a constant and hence a continuous function of ~ . So, on U
E ~ (YQ(~ ) I [YQ(~ ) > ]) = E ~ (YQ(~ ) ) ? E ~ (YQ(~ ) I [YQ(~ ) ]) is continuous in ~ . Since YQ(~ ) has nite mean,
~ lim !1 E YQ(~ ) I ]YQ(~ ) > ] = 0:
If we restrict ~ to f~ : 1 ~ H 1g then we can think of E ~ (YQ(~ n ) I [YQ(~ n ) > ]) as a family of continuous functions of ~ on a compact set indexed by . These functions 14
tend monotonically to zero pointwise as ! 1, so the convergence is uniform by Dini's Theorem. That is,
E lim !1 1sup ~ H 1
~ (Y ~ I [Y ~ > ]) = 0 Q(n ) Q(n )
(16)
so that the family of probability measures
fLi(j~ ) : 1 ~ H 1g is uniformly integrable. By Parzen (1954), this implies that the weak law of large numbers holds uniformly over f~ : 1 ~ H 1g. Let be as in Lemma 2. Then
( m ) p ^ = ^ = 0: Pr (17) rlim !1 1sup d ^ H 1 Note that this is where the assumption that i > for all i is critical. Equation (15) does
jxTi ^ (m+1) ? ij >
not necessarily hold unless all components of ^ are positive. We must have a compact set in order for equation (16) to hold and the only obvious way to obtain this is to bound ^ away from 0. By (17), we can choose R2 large enough so that r R2 implies 1 (m+1) ( m ) T ^ (18) sup Pr jxi ? ij > p ^ = ^ < 2d : d 1^ H 1 If the algorithm is run with r > R2 then a union bound yields o n Pr k^ (m+1) ? k < ^ (m) 12 ; (19) on the event f 1 ^ (m) H 1g. Let R1, c, and be as in Lemma 2. Let R = max(R1; R2). Suppose the algorithm is run with r R. Now the sequence h^ (m) i1 m=0 is Markov so (14) and (19) imply that
n
o o n Pr k^ (m+2) ? k < 1 ^ (m+1) H 1 P f1 ^ (m+1) H 1j^ (m)g 14 :
Pr k^ (m+2) ? k < ^ (m)
Thus, independently of the value ^ (m) , if the algorithm is run with r R we have at least 1=4 probability of being within an -ball of in two steps. Letting
l = P fk^ (l+2) ? k < j^ (l) g; 15
P
then l l diverges, and hence by the conditional Borel-Cantelli lemma (Corollary 5.29 of Brieman (1968)), n o Pr k^ (m) ? k < in nitely often = 1: (20) We have shown that if we start within an -ball of we have a nite probability of exponential convergence occurring, and we have now shown that if we leave this -ball, we will return to it with probability 1. We now tie the two together to complete the proof. De ne two sequences of stopping times fUn g and fWn g inductively as follows
W0 = 0; Un+1 = inf fm > Wn : k^ (m) ? k < g;
n 1;
Wn = inf fm > Un : k^ (m) ? k > cm?Un k^ (m) ? kg;
n 1:
The Wn mark the times at which exponential convergence fails after each Un?1 , and the Un mark the times thereafter that f^ (m) g enters the -neighborhood of . By Lemma 2 and the strong Markov property,
Pr f Wn = 1j Un < 1g :
(21)
Pr f Un < 1j Wn?1 < 1g = 1:
(22)
By (20), Let
Gn = fWn?1 < 1 and Wn = 1g: The Gn are obviously disjoint. Note that n\ ?1 l=1
Gcn =
n\ ?1 l=1
fWl < 1g:
By the Markov property and (21) and (22),
n\ ?1 ) Pr Gn Gcl = Pr f Gn j Wn?1 < 1g l=1 Pr f Un < 1 j Wn?1 < 1g Pr f Wn = 1 j Un < 1g = 1 = : (
It follows by the conditional Borel-Cantelli lemma that
Pr
([ 1
n=1
)
Gn = 1: 16
Now,
Gn = fWn?1 < 1 and Un = 1g [ fUn < 1 and Wn = 1g: By equation (22), the event on the left has probability zero, so
P
([ 1
n=1
)
fUn < 1 and Wn = 1g = 1:
(23)
Recall from Lemma 2 that c < 1 and hence ? loge (c) > 0. Choose 0 < < ? loge (c) and note fem k^ (m) ? k ! 0g fUn < 1 and Wn = 1g for all n: Thus, by equation (23),
P fem k^ (m) ? k ! 0g P
([ 1 n=1
)
fUn < 1 and Wn = 1g = 1
and the Theorem is proved. 2 We close this section with a proof that the exponential convergence is preserved under the modi ed algorithm which uses previous information (i.e., (4) replaces Step 4). We also give some discussion afterwards indicating that the exponential rate can probably be improved by good choice of the parameter in (4).
Proposition Under the same assumptions as the Theorem, the same conclusions hold for the modi ed algorithm with (4) in place of Step 4.
Proof: The steps in the proof of the Theorem are equally valid for the modi ed algorithm, except for (8) and (20). To deal with (8), note that
h
E k^ (m+1) ? k2I [k^ (m) ? k < ] ^ (m)
i
n 2 (m) o bk^ ? k2 + (1 ? )2 k^ (m) ? k2 I [k^ (m) ? k < ] h i = 2 b + (1 ? )2 k^ (m) ? k2 I [k^ (m) ? k < ]:
Then (8) holds but with b replaced by
b? = 2 b + (1 ? )2 : Since 0 < 1, it follows that 0 < b? < 1. 17
(24)
To establish a version of (20), rst let
Em = p ; 1i` : 2 d Now (18) continues to hold for r suciently large, so we may take r large enough that by subadditivity,
jxTi ^ (m) ? ij
h
i
Pr Em+1 j ^ (m) = ^ 21 : 1^ H 1 sup
Now for any N > 0,
^i(m+N ) = (1 ? )N ^(im) +
N X n=1
(25)
(1 ? )N ?n maxf[X ^ (m+n) ]i ; g:
p
As 0 < 1, we can choose N suciently large that (1 ? )N < =(4H d), and then if p p ^ (m) H 1, we have 0 < (1 ? )N ^(i m) < =(4 d). We also assume (1 ? )N i < =(4 d). If in addition,
maxf[X ^ (m+n)]i; g ? i < p ; for n = 1; 2; : : :; N; 2 d
then
N (m+N ) X (m+n) N N ? n ^ ? i (1 ? ) j^i ? i j + (1 ? ) maxf[X ]i ; g ? i ^i h in=1 N p + 1 ? (1 ? ) p 2 d 2 d p ; d
and we conclude that
N o \ n (m+N ) o n k^ ? k < ^ (m) H 1 \ Em+n : n=1
Note that the newly de ned h^ (m) : m = 1; 2; : : :i is still a Markov chain. By the Markov property, (14), and (25),
o n Pr k^ (m+N +1) ? k < ^ (m) 2?(N +1):
The result (24) now follows by the conditional Borel{Cantelli Lemma, which completes the proof of the proposition. If one assumes that there is achieved an exact exponential rate for the original algorithm (rather than an upper bound, as the foregoing theorem asserts), then it is possible to obtain 18
an optimal in the modi ed algorithm which uses previous information. By an \exact exponential rate," we mean there exists a b < 1 such that
h
i
E k^ (m+1) ? k2 ^ (m) bk^ (m) ? k2 : Our simulation results of the next section suggest this is in fact the case. Then, as in (24) the modi ed algorithm satsi es the same approximation but with b replaced by b?. The optimal b? is given by
= 1=(b + 1) =) b? = b=(b + 1): For b close to 1, this can reduce the exponential factor by about 1=2.
4 Examples. We now illustrate the performance of the algorithm in a few simple examples. Example 1. Consider the problem of counting the number of steps before absorption in a very simple system. Let the d d matrix P be de ned by pij = 1=(d + 1) for all i,j . The scores sij = 1 for all i,j (we include si = 1 for simplicity). By inspection, the number of steps to extinction starting from any initial state has a geometric distribution with parameter 1=(d + 1) so the expected score is d + 1. In this case, we t the model i = constant. The results for a 19 19 matrix with 6 replications of the point design vector of initial states (1; 7; 13; 19) is shown in Figure 1. Convergence is measured in two ways. First, by the log sum of squared deviations about the tted values at the design points, and second, by the log sum of squared deviations about the true values at the design points, similar to the norm in the main theorem. The wholly internal variance estimate is needed in the case where the true solution is unknown; note that the two estimates track each other quite well over the entire sequence of iterations. The estimates in Figure 1 are based on data from the mth iteration only. Incorporating information from previous iterations was tried, but the results proved at best indistinguishable from the results shown here. The roughly linear (in log scale) decrease towards zero variance is common in our experience, where the proportional decrease in variation with each iteration is directly related to the slope in this plot. This linearity is the source of the term "exponential convergence". 19
In applications to complex problems, there is no attempt to fully achieve the zero variance solution (as displayed in Figure 1). Simulation codes contain many approximations: for example, physical constants such as cross sections in particle physics problems are determined experimentally, and are known to contain error. Another source of approximation in certain cases is the discretization of continuous state spaces. Simulation output simply propagates such errors. Beyond a certain point in the iterations, errors such approximations dominate the statistical error in the simulation, and it no longer makes sense to continue computing. Consequently, iterating until zero variance is observed ("zero" to machine accuracy as in Figure 1) is not realistic. Example 2. In this case we remain with the problem of counting the number of steps before absorption, but we change the transition matrix P so that pij = 0:5j ?i+1 ; i j; 0. This can be viewed as the situation of a particle moving on a string divided into intervals, on which the particle moves forward a geometric number of intervals at each step. Due to the memoryless property of the geometric distribution, the expected number of steps is a linear function of the initial state. In the case of d = 19, i = 21 ? i. The results of tting i = 0 + 1i with 6 replications of the same design used in Example 1 are shown in Figure 2. Convergence results using both the true and estimated variance are again shown. Example 3. Finally, we consider the classic gambler's ruin problem. A gambler with i dollars is betting repeatedly against the house, winning or losing 1 dollar on each bet. The gambler has a probability p of winning each bet. Play continues until either the gambler is reduced to zero dollars and is \ruined" or the gambler has acquired a total of z dollars, at which point he has \broken the bank" (the bank's initial capital is z ? i dollars). We are interested in the probability of the gambler's eventual ruin as a function of his initial fortune. This probability can be shown to be )z ? (q=p)i ; ruin(i) = (q=p (q=p)z ? 1 when q = 1 ? p 6= p, and ruin(i) = 1 ? i=z when q = p = 1=2. We model this by tting the constant c in i = c[(q=p)z ? (q=p)i]. In this situation, pij = p if j ? i = 1, pij = q if i ? j = 1, and pij = 0 otherwise. The scores are all zero except for s1 = 1. The case of the gambler breaking the bank is handled by modeling the transition to z dollars as a transition to the absorbing state receiving zero score. Results for p = 0:6, z = 20 using the same 20
design as in the previous two examples are shown in Figure 3. Convergence results using both the true and estimated variance are again shown. Note the linearity of convergence on a logarithmic scale in all three cases. Convergence to machine accuracy is attained in under 20 iterations in these examples. Reducing the number of design points or replications slightly slows the rate of exponential covergence.
5 Discussion There are some limitations in the results here which we hope to remove in future work. One obvious such limitation is the assumption of a nite state space. The algorithm can be extended to general state space without diculty, but the proof of exponential convergence as given here needs to be reworked. One major diculty is establishing transience of the importance sampling chains in a neighborhood of the zero variance chain. Simulations suggest that exponential convergence still holds. The true may not be of the form X ; indeed, the linear model is best viewed as an approximation of reality. Often, general qualities of the solution are known from physical principles (principles such as continuity { i.e., i values that are from states which are \close" should be close), but those principles don't explicitly translate into a linear model having parameter vector . In that simulation results embody still other approximations as noted in discussion at the end of Example 1, the absence of a formally correct model is not serious. When the form of is unknown, it can be often closely approximated, using a, for example, high degree polynomial. Although choosing ^ to give the \best" such polynomial fails to produce the zero variance importance sampling, it is common in such cases to observe exponential convergence to the variance associated with that best polynomial, after which point the convergence to zero is at the n?1 rate. When the model is initially unknown, we can t it by sequentially extending our class of models (with polynomials of increasing degree, for example). Internal variance estimates using this procedure alternate between decreasing exponentially and \plateauing", as the best approximation is approached and achieved. Interestingly, the exponential constants tend to become smaller and the t becomes more unstable as the complexity of the model 21
increases, introducing a penalty for over tting. Diagnostics for detecting when to expand (or diminish) the dimensionality of the model are under development. The potential of the adaptive method is great. The variance reduction of adaptive importance sampling escapes the n?1 rate (or any power of n) by intelligently exploiting dependent data (i.e. learning) rather than focusing on well-chosen independent data values. Adaptive importance sampling has been shown here to achieve at least exponential variance reduction rates. Finally, the algorithm used to attain these rates is conceptually straightforward and elegant, and extensions to other problems follow directly.
References Apostol, T. A. (1957), Mathematical Analysis, Reading, Addison-Wesley. Booth, T. E. (1985), \Exponential convergence for Monte Carlo particle transport?", Transactions of the American Nuclear Society, 50, 267-268. Booth, T. E. (1986), \A Monte Carlo learning/biasing experiment with intelligent random numbers", Nuclear Science and Engineering, 92, 465-481. Booth, T. E. (1988), \The intelligent random number technique in MCNP", Nuclear Science and Engineering, 100, 248-254. Booth, T. E. (1989), \Zero-variance solutions for linear Monte Carlo", Nuclear Science and Engineering, 102, 332-340. Brieman, L. (1968), Probability, Reading, Addison-Wesley. Briesmeister, J. F., ed. (1993), \MCNP { A General Monte Carlo N-Particle Transport Code", Version 4A, Los Alamos National Laboratory Report LA-12625-MS. Carter, L. L. and Cashwell, E. D. (1975), Particle Transport Simulation and the Monte Carlo Method, Technical Information Center, Spring eld VA, Energy Research and Development Administration. Halton, J. H. (1962), \Sequential Monte Carlo", Proceedings of the Cambridge Philosophical Society, 58, 57-78. 22
Hammersley, J. M. and Handscomb, D. C. (1964), Monte Carlo Methods, London, Chapman and Hall. Kalos, M. H. and Whitlock, P. A. (1986), Monte Carlo Methods, New York, John Wiley. Kollman, C. (1993), \Rare Event Simulation in Radiation Transport", unpublished Ph.D. thesis, Department of Statistics, Stanford University. Lux, I. and Koblinger, L. (1991), Monte Carlo Particle Transport Methods: Neutron and Photon Calculations, Boca Raton, CRC Press. Marshall, A. W. (1956), \The use of multi-stage sampling schemes in Monte Carlo computation" in University of Florida Symposium on Monte Carlo methods, 123-140, New York, John Wiley. Parzen, E. (1954), \On uniform convergence of families of sequences of random variables", University of California Publications in Statistics, v.2, 23-54.
23
Uniform Dispersion, 19 states 20 Internal Variance True Variance
Log Sum of Squares at Design Points
10
0
−10
−20
−30
−40
−50
−60 0
2
4
6
8 10 12 Iteration Number
14
Figure 1: Convergence Results for Example 1.
24
16
18
20
Geometric Steps to Right, p=0.5, Region Length=20 20 Internal Variance True Variance
Log Sum of Squares at Design Points
10
0
−10
−20
−30
−40
−50
−60 0
2
4
6
8 10 12 Iteration Number
14
Figure 2: Convergence Results for Example 2.
25
16
18
20
Gamblers Ruin, z=20, p=0.6, 24 design points per iteration 10
Log Sum of Squares at Design Points
0
−10
−20
−30
−40
−50
−60
−70 0
2
4
6
8 10 12 Iteration Number
14
Figure 3: Convergence Results for Example 3.
26
16
18
20