Biometrika (1997), 84,1, pp. 1-18 Printed in Great Britain
Computational complexity of Markov chain Monte Carlo methods for finite Markov random fields BY ARNOLDO FRIGESSI AND FABIO MARTINELLI Dipartimento di Matematica, Universita di Roma Tre, via Segre 2, 00146 Roma, Italy e-mail:
[email protected] [email protected] AND JULIAN STANDER Department of Mathematics and Statistics, University of Plymouth, Drake Circus, Plymouth, PL4 8AA, U.K. e-mail:
[email protected] SUMMARY
This paper studies the computational complexity of Markov chain Monte Carlo algorithms with finite-valued Markov random fields on a finite regular lattice as target distributions. We state conditions under which the complexity for approximate convergence is polynomial in n, the number of variables. Approximate convergence takes time 0(n logn) as n -> oo if the target field satisfies certain spatial mixing conditions. Otherwise, if the field has a potential with finite interaction range independent of n, the complexity is exponential in ny, with y < 1, which is still more favourable than enumerating all the states. When the interaction range grows with n, the algorithms can converge exponentially in n. Analogous results are provided for an expectation approximated by an average along the chain. Some key words: Gibbs sampler; Metropolis algorithm; NP-completeness; Range of interaction; Rate of convergence; Spectral gap; Stopping rule.
1. INTRODUCTION
Today, Markov chain Monte Carlo methods have applications in almost every branch of statistics, especially Bayesian inference. Some references are Besag & Green (1993), Besag et al. (1995) and Geyer (1992). Sampling and computing expectations via these algorithms are approximate, because the chain is stopped after a finite number t of transitions. If we fix a measure of the error committed, we may consider the number of iterations t* needed to make this small. We are interested in the behaviour of t* as a function of n, the number of variables. This is the computational complexity for approximate convergence. The aim of this paper is to characterise the computational complexity of some Markov chain Monte Carlo algorithms for approximate sampling from and computing expectations with respect to a certain class of regular finite Markov random fields. Our interest arises from Bayesian image analysis, where the methods are used for tasks such as restoration, classification and object recognition. We consider an image comprising a finite number n of random variables, positioned on the sites or pixels of a finite lattice A. The joint prior or posterior distribution u is assumed to be a Markov random field
2
A. FRIGESSI, F. MARTINELLI AND J. STANDER
on A. We assume the support QA of /i isfinite,so that both sampling from \i and computing the expectation of some function / with respect to n can be performed exactly by full enumeration in finite time. This time is exponential in n because there are exponentially many configurations in QA. For this reason approximations or iterative methods are needed. However, if the computational complexity of these were of the same exponential order as full enumeration, no benefit would be gained. The results in this paper concern reversible chains such that at each iteration one of the n sites is chosen uniformly and its corresponding variable is updated. We discuss some additional assumptions in § 2. Among the methods to which our results apply are the single site updating Gibbs sampler, the Metropolis algorithm and other Hastings dynamics (Hastings, 1970). Sections 3 and 4 deal with approximate sampling: the main theorems are closely related to results in the statistical mechanics literature. We prove that under certain spatial mixing conditions the computational complexity is polynomial in n. Often t* = O(n log n) as n-> oo. Note that n logn is approximately the expected time to update each site at least once when sites are chosen at random, and is the complexity if the n variables are independent. We conjecture that approximate Markov chain Monte Carlo sampling has polynomial complexity whenever the target field does not undergo phase transition. When a target field defined o n A c Z 2 does not satisfy spatial mixing conditions, but has finite interaction range, approximate sampling has computational complexity not larger than 0{exp(cn*)}, for c independent of n. This bound is tight in the case of the twodimensional Ising model. Hence approximate convergence requires a time exponential in n*, and so the corresponding algorithm is still better than full enumeration. In § 4 we present an example in which the interaction range grows with n and for which approximate convergence requires a time exponential in n itself. Our estimates of t* are expressed as order of magnitudes in n, and we cannot provide constants. Our concern is the study of the rate of weak convergence of a finite state Markov chain. There are two quite different ways to treat this question. In this paper we estimate the convergence rate a priori, before the chain is actually run. This point of view is taken by Frieze, Kannan & Poison (1994), Frigessi & den Hollander (1993), Jerrum & Sinclair (1993), Meyn & Tweedie (1993) and Rosenthal (1995), among many others. These authors propose bounds on the sampling errors; see § 3 for further discussion. A second way is on line: while the chain is running one tries to detect when the distribution of the chain is close enough to the required stationary distribution. Recent rigorous contributions to this are Asmussen, Glynn & Thorisson (1992), Mykland, Tierney & Yun (1995), Propp & Wilson (1996), Robert (1995) and Roberts & Tweedie (1996). In § 5 we consider the complexity of approximating expectations. Here the error can be defined in two different ways, giving rise to the notions of average case and worst case complexity. The latter is more important, but more complex to analyse. In both cases we obtain the same results as for sampling: if the target Markov random field satisfies certain spatial mixing conditions, then approximating expectations has polynomial complexity; if these conditions fail, then the complexity can still be better than full enumeration. All proofs are collected together in the Appendix. Finally, we mention that other simulation methods that have been proposed to speed up convergence cannot be analysed by our methods. Important examples include auxiliary variable methods and multiple site updating Markov chain Monte Carlo methods (Besag & Green, 1993), and simulated tempering (Marinari & Parisi, 1992). These algorithms seem in practice to reduce the complexity in the presence of phase transition.
Complexity of Markov chain Monte Carlo
3
2. SET-UP AND ASSUMPTIONS
Let A be a finite subset of Zd with n sites. Assume that the random variable at each site takes values in a finite set Sf of cardinality \y\, and let flA = SfK denote the set of all possible configurations; note that |Q A | is exponential in n. Let x = (xt)ieA be an arbitrary element of C1A. In order to define the Markov random field measure on ftA we need to fix a boundary condition T e QAc. It is convenient to define a Markov random field through its Gibbs description. Let {UD:D is a finite subset of Zd} be a potential, that is a family of functions UD:Sfz"->R such that UD{x) depends only on x,, with i e D, and translation invariant; that is UD+k = UD, for all keZd and all sets D except singletons. Furthermore, let the potential have finite interaction range r independent of n, so that UD = 0 if diam (D) > r, for, say, the Euclidean metric on the lattice. For some purposes we assume that \UD\ can be bounded above by a finite number independent of n for all sets D except singletons. Since UD is defined on a finite set, this is the case in practice. Consider the energy with boundary condition T: H^{x)=
X
UD(x%
D:DnA + 0
where x[ = xt if i e A, and x[ = T, if i e Ac. A Markov random field has probability distribution on QA given by
where fi > 0 and ZA>t is the normalising constant. We sometimes drop subscripts and write \i for nAtZ. An important example of /zA>t is the Ising model. This has Sf = {— 1,1} and j
£
)
(1)
where the first sum is over all pairs of pixels i, j at distance one. The range r of interaction is one. The parameter fi is called inverse temperature. In the case of data yt e { — 1,1} obtained by applying Bernoulli noise with parameter n e (0, \), a data modulated external field is present in the posterior
(jS £ x,xj + h £ yiX\
(2)
where h = \ log{(l - n)/n}. Markov chain Monte Carlo algorithms are based on an irreducible aperiodic discrete time Markov chain (AT(O) = x0, ^ ( 1 ) , . . . , X(t),...) with state space flA and transition matrix PA>T(x, y), whose stationary distribution is ft^>z. In this paper we consider chains whose transition matrix satisfies: (i) PA>t(x, y) = 0 if x and y differ at more than one pixel; (ii) F^jx, y) is /iAit-reversible: nA-z(x)PKt(x, y) = nA-z(y)PA-x(y, x) for all x, y e QA. The Gibbs sampler fits the above scheme. Assume that at time t the Markov chain is in state x. In order to obtain the value of X(t + 1), a pixel i e A is selected uniformly at random for updating. Define x, a for a e Sf to be the configuration that agrees with x at all pixels except pixel i, where it takes the value a. Then the transition kernel of the Gibbs
4
A. FRIGESSI, F. MARTINELLI AND J. STANDER
sampler is
It is not essential to choose the site to be updated at random. Deterministic sweep strategies are also feasible. For technical reasons we concentrate on the former, but we expect polynomial complexity to hold under the same conditions in other cases, possibly with different orders. A further assumption is needed on the transition matrix PA>t for our results to hold: that there exist constants g t and g2, independent of n, such that ^ mm x.y e n A r
^ max G
ibt\
x
W
xy e O
^ G i b
x
O, error(s) ^ e, for all s ^ t}.
Complexity of Markov chain Monte Carlo
5
In §§ 3 and 4 we will discuss the behaviour of t* as a function of n, for large n. In § 5, in order to assess the rate of convergence of the sample average /,, we consider its mean squared error after t steps,
E^l\f,-E^,{f(X)}n
(7)
where the outer expectation is computed with respect to the measure on the trajectory space starting from X(0) = x0. There are two different ways to treat the dependency on the initial state x 0 , and they give rise to average and worst case complexities, as will be explained in § 5.
3. SAMPLING IN POLYNOMIAL TIME
31. Spatial mixing ensures polynomial complexity We look for conditions on ft*'z and on the sampling chain so that sup max d{pTA-z{X(t) = .\X(0) = xo},^(.)}^Cnexp(--t]
(8)
holds for ^ > 0 and C > 0 independent of n. Assuming that t ^ n log(Cn)/£, so that the right side of (8) is less than or equal to 1, we have that t* ^ O(n logn) as n-> oo, that is t* is polynomial in n. The conditions we need on the Markov chain are (3) and (4), and also that nA-z cannot be too sensitive to boundary conditions x. We make this precise now through Condition sz below, which measures the degree of dependency of /zAit on x. Following Stroock & Zegarlinski (1992) for any Ao and any A'
(6 A'
where x^ is equal to x except at the site ;. 1. Let PA>t be a single-site nKx-reversible updating dynamic such that (3) and (4) hold. If there exists a region AQCZZ1' such that Condition sz holds, then we have the following. (i) If there exists a constant X* e (0,1) independent of n such that — X* < Xk for all k, then there exists C > 0 and £>0 not depending on n such that (8) holds for every A a Zd. Accordingly, t* ^ O(n log n). (ii) Otherwise, (8) holds for all t^n2 log(Cn)/^', where £,' > 0 is a constant. Accordingly, t*^0(n2 log n). THEOREM
By t* ^ O(aH) we mean that the order with which t*-* oo is at most a , a s n - > oo. Hence Condition sz together with the assumptions on the transition matrix implies polynomial complexity of approximate sampling. It is difficult to interpret and check Condition sz. One can think of the number atJ as a measure of the effect on ^A>t at site i of a change of the boundary condition x at site ;. Then Condition sz with 8 < 1 can be
6
A. FRIGESSI, F. MARTINELLI AND J. STANDER
interpreted roughly by saying that the overall effect of changing the boundary condition at sites j e 5rAo has to be strictly smaller than the number of sites in AQ. There are also other mixing conditions that imply (8). In particular Condition MO proposed in Martinelli & Olivieri (1994) holds more often than Condition sz, and is easier to check in our context. Condition MO also measures the dependency on boundary conditions of the target distribution. We refer the reader to the Quaderno of the Istituto per le Applicazioni del Calcolo report number 32/1993 by A. Frigessi, F. Martinelli and J. Stander 'Computational complexity of Markov chain Monte Carlo methods' for further details. Given that Conditions sz or MO ensure polynomial complexity, the question of when they hold has to be considered. This is well understood in the case of the two-dimensional Ising model (1). Condition sz holds for certain values of the parameters ft and h: it does not hold where phase transition occurs, i.e. for h = 0 and /? > ft = 0-44068 . . . , and for /? large and \h\ small. Condition MO has been verified everywhere except where phase transition occurs. We conjecture generally that approximate sampling by means of Markov chain Monte Carlo methods from a Markov random field that does not undergo phase transition as n -*• oo has polynomial computational complexity. In particular, under the assumptions made on the Markov chain Monte Carlo method, the complexity is conjectured to be of order n log n. A further argument supports our conjecture. If, for A large enough, the energy HAtl has a unique absolute minimum HA>I(x*) which does not depend on the boundary condition, and if HA-X(x*) is separated from all the other values of HAco, then Condition MO holds for sufficiently large /?; see Martinelli, Olivieri & Schonmann (1994). In this case the field cannot have phase transition either. Note that in order to hope to be able to verify any spatial mixing condition, as n -*• oo, we must assume that the range r of the target field does not grow with n. In Bayesian image analysis, if the degradation process is local and the prior distribution is a Markov random field, then the posterior distribution also has finite interaction range. The fact that in this case the data act as a spatially nonhomogeneous external field is often thought of as preventing the posterior field from undergoing phase transition, with the result that, given our conjecture, approximate polynomial sampling should hold. A more rigorous argument is presented in Aizenman & Wehr (1990). Among the energies studied by these authors for { — 1, +1} variables is H(x) = -
£ J+JeA
J x x
ij i j ~ Z (fc + £yt)xi> feA
where the data yt are assumed to be realisations of independent and identically distributed random variables. It is shown that, for almost every realisation of these data and for all h and e > 0, the corresponding field does not undergo phase transition as n-> oo. 3-2. Related results
A classical way to bound (6) is by uniform ergodicity. Assume there exists t0 > 0, 0 and a probability measure v on ftA such that (PA-7°(x,y)^t of the Ising model. They consider a certain combinatorial problem, whose solution Z' can be approximated in polynomial time and is linearly related to ZAit. They conclude that there exists an algorithm for approximate sampling from the two-dimensional Ising model with external field h = h(n) > n" 1 that runs in time O(n11 logn). Note that the assumption on h rules out the presence of phase transition. 4. SAMPLING IN EXPONENTIAL TIME
41. Reduced exponential complexity Whenever Conditions sz and MO do not hold, approximate convergence of Markov chain Monte Carlo may require at* that increases faster than poly normally in n. We show that if polynomial convergence does not hold, nevertheless, under certain assumptions on the interaction range, t* increases slower than full enumeration of ilA. Since we use the following bound, P max d{prA*{X(t) = .\X(0) = xo}, MA%)} < — \ (11) *o «= n A 2 m m , e OA {/i A>t (x)*} given by Diaconis & Stroock (1991), we bound the second largest eigenvalue. d THEOREM 2 (Martinelli, 1994). Let Abe a cube in Z with sides of length L. Assume that the range r of /xA>t is constant in n. Then
where Cx is a constant independent of L and f}, and C2 is a constant independent of L. In the report by Frigessi, Martinelli & Stander mentioned in § 31, a version of the proof adapted to our setting of a discrete time Markov chain, a slightly more general assumption on the field, and single site updates is presented. Combining Theorem 2 with (11) and (4) we obtain the following. COROLLARY 1. Approximate sampling from a constant range Markov randomfieldon a hypercube A in Zd, with sides of length L, needs a time
for some constant C, where n = Ld is the number of sites, as n-* oo.
8
A. FRIGESSI, F . MARTINELLI AND J. STANDER
For the two-dimensional Ising model we obtain t* ^0{exp(/?Cn*)}. The same upper bound holds for the posterior model (2). The assumption on the range being constant with n is necessary. More precisely, one can see that X2 -* 1 exponentially in n if r = O(n). A typical situation in which Corollary 1 does not hold is when the likelihood induces complete interaction among all sites in A, although the prior Markov field has only local interactions. The bound is very poor if spatial mixing conditions for y^'z hold. If they do not hold, it is important to investigate the tightness of the bound. In order to establish whether the complexity is exponential in n(d~1)/d in some cases, we first need a lower bound on t*. To obtain this we need a lower bound on (6). LEMMA
1. We have max d{p^{X(t) = .\X(0) = xo},^(.)}>$V2.
(12)
We also require a lower bound on X2. 3 (Thomas, 1989). Let A be a box in Zd with sides of length L. Let /iA be given by (1) with free boundary conditions and h = 0. Then, for /? sufficiently large, THEOREM
where C 3 is a constant independent of L and /?, and C4 is a constant independent of L. From this bound on A2 and (12) we obtain the following corollary for the two-dimensional Ising model undergoing phase transition. 2. Under the same assumptions as Theorem 3 for the two-dimensional Ising model, approximate sampling has complexity COROLLARY
as n->oo. So for the Ising model with h = 0 and sufficiently large /?, t* is exponential in n*. 4-2. Full exponential complexity We discuss an example of target distributions for which approximate sampling needs the same time as complete enumeration of the state space fiA. For d=\, we associate binary images with n pixels with vertices of an n-dimensional cube. A single-site updating algorithm corresponds to a random walk on the cube. Consider the target distribution /x derived from the uniform distribution by increasing the probability of the images ( - 1 , . . . , - 1 ) and ( 1 , . . . , 1) by ±n, where 0 < n ^ 1 - (2"- 1 )- 1 :
l
if
( l l ) ( l l )
For the Gibbs sampler the probability of entering a particular state from any given starting point is the same for all states with the same number of l's. Thus conclusions about convergence can be drawn by considering the embedded distance chain on { 0 , 1 , . . . , n}
Complexity of Markov chain Monte Carlo which records the number of l's and an appropriately transformed target distribution fi' and transition matrix P'. It can be shown that the eigenvalues of P' are the same as the eigenvalues of the original Gibbs sampler and hence all are positive. We now compute bounds for the second largest eigenvalue. From Proposition 1' of Diaconis & Stroock (1991) we obtain
Applying Cheeger's inequality as given in Diaconis & Stroock (1991), with their S set to {0}, we obtain 1
_ r
i n
\l + 2 -
Asymptotically, as n-> oo, we have 1and so A2->1 exponentially in n. From bounds (11) and (12) we conclude that t* is exponential in n. The range of the appropriately defined potentials of the target distribution /z grows with n. Hence \i cannot satisfy any spatial mixing condition leading to polynomiality, neither does it satisfy the assumption of Theorem 2. A different example is given by Matthews (1993). The target \i is a posterior density given data Y. A uniform prior for X is assumed over the n-dimensional hypercube C = (0, iy and the likelihood of Y given x is a mixture of an n-variate normal distribution with mean vector x and the uniform distribution on C. The posterior is then proportional to the likelihood. Because of the second element in the mixture, the range of the posterior potentials of n is n. By Theorem 2, stated for continuous state space, approximate convergence cannot take a time smaller than or equal to exponential in n*, as n-* oo. Matthews shows that the complexity of sampling by means of the Gibbs sampler with raster scans is exponential in n if a certain variance is small enough. The full range of the posterior in this case is inherited from the likelihood.
5. COMPLEXITY OF APPROXIMATING EXPECTATIONS
51. General
We now study the quality of the approximation /, given by (5) of the expectation Ell\.z{f{X)}. For simplicity we drop all A, x superscripts in this section. Let yen
and denote by var/1{/(AT)} the variance of/with respect to \L We measure the error after t steps by the mean squared error (7) conditioned on the starting configuration x0 e Q. Without loss of generality we may assume that £„ {f(X)} = 0 and so the conditional error at time t is E^ijf). The aim is to bound this error from above. Such a bound will of course strongly depend on the assumptions made about the distribution of Xo. There are two ways to proceed.
9
10
A. FRIGESSI, F. MARTINELLI AND J. STANDER
The first consists of the study of the expected asymptotic error
EAEXoiff)}=
I Kx0)EXoiff).
(13)
This corresponds to the assumption that the chain is started in equilibrium, and will lead to optimistic rates of convergence, based on the analysis of the correlation structure of the identically distributed sequence. One could argue that (13) is indeed the right quantity to study if the sample average is collected after a long enough burn-in which allows us to consider x0 as a genuine realisation from \i. We define the average complexity as t\.= min{t > 0:£„{E^if2)} t}. A second way to take into account the starting configuration is to analyse the worst case error max E^iff).
(14)
This more relevant quantity leads to realistic rates of convergence. We shall bound (14) both when the Condition sz holds and when it does not, and estimate the corresponding complexity :max .EL (/?)*%£, for all
THEOREM
52. Average complexity 4. If the assumptions of Theorem 1 hold and if var,, {f(X)} < O(n) as n-> oo,
then If only Theorem 2 holds and var M {/p0} oo, then where A is a cube in Zd. Under Condition sz, Theorem 4 is very optimistic since O(n) is even smaller than O(n log n), the expected time needed to visit every site. Indeed, since Xo is assumed already in equilibrium, it is not necessary to update every site in order to get a good approximation of £M{/(X)} by means of/,. The assumption on the growth of the equilibrium variance varM {f(X)} is very often satisfied. For instance, in image restoration, it is commonly needed to approximate E{Xt) for each pixel i. If a finite number of colours is allowed, then f(x) = xt ^ s, and so varMA.t {f(X)} is bounded by a constant in n. Theorem 4 is also valid under Condition MO instead of Condition sz. 5-3. Worst cast complexity We now turn to the estimate of (14). First we note that (14) is of the same order as (13) as f-»-oo; more precisely
Complexity of Markov chain Monte Carlo
11
However, the time required to ensure that approximate convergence in (15) has been reached depends on n and can, in principle, be exponential in n. Accordingly, one cannot use this limit and the average case analysis in order to establish worst case complexity THEOREM
5. If the assumptions of Theorem 1 hold and if varM {/(*)} < O(n log n), || / 1 | „ < O(n)
as n -> oo, then In order for the polynomial result not to hold, the norm and variance of/ would have to grow at least exponentially in n. Just as for sampling, if the target field is spatially mixing, the complexity is polynomial in n. The average case analysis was indeed optimistic. We can be more precise in the comparisons of the two approaches. From (A15) of the Appendix it follows that, if
then max i s ^ / ^ v a r , {/(*)}£, t
(16)
Q
for all t^Kfn(\ognf, where K / ^40cg 0 is independent of n. Then (8) holds. Proof. Because of reversibility there exist right eigenfunctions l = l,tf>2,-.-, \a\ of P such that
where 5^ = 1 if ; = k, and 0 otherwise, and by definition £ pr{X(q) = y\X(0) = x}^ t (y) = A^ t (x).
(A3)
y
By spectral decomposition, any function f:£l->R can be expressed as ini
/a[
ck*k,
(A4)
where ck = . Note that, if £„ {f(X)} = 0, then c t = 0 and ln|
£ ^ = var,{/(X)}.
(A5)
Now set *! = Lco« log nj, with c0 > 0 sufficiently large that l//i(xo)1/p 0 independent of n such that, for all / with £„ {/(X)} = 0,
max lE^rjW)}]! < 2[varp{/(X)}]* exp{{ - ^ ( ff-- Lcon lognj)} ii
^^
n n
))
(A6)
Complexity of Markov chain Monte Carlo
13
for t > lcon lognj. To obtain the bound (8) use the fact that d(fi, n) = max \n(A) - n(A)\ and set f{x) = lB(x) — fi(B) in (A6), where B -Aj ^ -exp(-£'/n)
(k = 2 , . . . , |O|),
because of (A2) and the assumption (4). Therefore, the constant A that in case (i) was independent of n now becomes A = £'/(2n). Hence (Al) follows as before with do replaced by do/n. Accordingly (8) holds for all t ^ n2 log(Cn)/^'. • Proof of Lemma 1. The following equality is known to hold in general: 1 Xo), !*(•)} = •: max \E.[f{X(t)}~\ 2m. \ lE^UAXm
~ E,{fx(X)} |.
If A #= 1, it is easy to show that £„ {fx{X)} = 0, since /x is the invariant distribution of P. Moreover, £xoUx {X(t)n = A%(xo). Hence d{pr{X(t) = .\X(0) = xo}, /i(.)} > \ \M'\fx(xo)\. By maximising over x o e 12, we obtain (12) since max JCoen |/ l (x o )| = ||/tlL = 1Proof of Theorem 4. The bound
for every f ^ 1, can be found for example in Frigessi, Hwang & Younes (1992). If Condition sz holds, then using (A2) we obtain var, {/(*)} 2 as «->oo. If Theorem 2 holds, then e
using the notation of § 3. The t\ have the orders stated if var,, {f(X)} does not grow too fast in n.
• The proof of Theorem 5 is based on the following lemma. LEMMA 3. Assume that the conditions of Theorem 1 hold. Let c 0 be a positive constant and ty ^ t2 be given. (a) If t ^ lcon log nj and t2 — t1'^ lcon log nj, then
f {' max \Ex^if{X(t1)}f{X(t2)}']\^\arll{f{X)} exp