Rate optimality of Random walk Metropolis algorithm in high
Recommend Documents
May 4, 2009 - Duke University, Durham. NC, 27708-0251 e-mail: [email protected]. CRiSM ...... Roberts, Gareth O., & Rosenthal, Jeffrey S. 1998. Optimal ...
Apr 25, 2002 - In Fortune Cookie, Chang Sho Restaurant, Cambridge, MA. [2] D. Achlioptas, P. Beame, M. Molloy. A Sharp Threshold in Proof Complexity.
Keywords: MCMC, Adaptive MCMC, Metropolis{Hastings algorithm, convergence ..... line. In the lower picture in Figure 3 we present in a similar way the mean.
Chernoff's bound, Azuma-Hoeffding inequalities, Martingale-based inequalities and Talagrand's inequalities (Hoeffding, 1963). But it is trivial to use the simplest ...
spacecraft the size of a VW Beetle more than 10 billion .... these sparkignited
explosions are a lot more ... 737 when leaking fuel was ignited by a hot engine.
.... 1575 view from Braun and hogenberg's Civitates Orbis terrarum ..... CSO in
20
spacecraft the size of a VW Beetle more than 10 billion .... these sparkignited
explosions are a lot more ... 737 when leaking fuel was ignited by a hot engine.
.... 1575 view from Braun and hogenberg's Civitates Orbis terrarum ..... CSO in
20
Jul 4, 2005 - Online at stacks.iop.org/EJP/26/S31. Abstract ... the Metropolis algorithm to phenomena of physical and mechanical interest, such as the ... not ready to follow several courses on those problems can then lose their interest in the field
Nov 12, 2010 - gested as cheap materials with average thermal conductivity.[5] However making .... computaional domain is assumed to be periodic. The solution at steady ..... (CANTEC, Award Register#: ER64239 0012293). [1] E.S.Choi ...
powerful computers (personal computers) that exist at the edge of the Internet ... servers on the Internet are typically high-end dedicated server computers with ...
dynamics of a random walk with a purely random multiplicative process to understand how ... condition for the particle number to reach a steady state. We also ... the probability that a pure random walk is at position r on the nth step if it starts a
This paper compiles basic components of the construction of random groups and of the proof of their properties announced in [G10]. Justification of each step, as ...
Feb 27, 2013 - Abstract: An actual account of the angle random walk (ARW) coefficients of gyros in the constant rate biased rate ring laser gyro (RLG) inertial ...
trem ely di cult to experim entally construct a quantum com puter,asw ellasto carry outquantum com putations. Recently, several groups have investigated ...
[8] B.B. Mandelbrot, J. Fluid Mech. 62, 331 (1974). [9] J.P. Kahane and J. Peyri`ere, Adv. Math. 22, 131 (1976). [10] C. Meneveau and K.R. Sreenivasan, J. Fluid.
multifractal models since they do not involve any particular scale ratio. The MRWs are indexed ... stationary multifractal processes or positive random measures. ... tion f(t) are generally characterized by the exponents ζq .... ously using, for ins
The random walk algorithm proposed by Grady [1] is a leading method for seeded image segmentation. In this graph-based algorithm, edge weights denote the ...
Jun 21, 2016 - GRAPH data are important data types in many scientific areas, such ... Very often, people are only in- ..... limit the number of visited vertices, and a small value ξ for ..... using the Erdos-Renyi model [18] with some modifications.
A recommender system makes personalized product sugges- tions by extracting ... the MovieLens data set (in subsection 2.
Jun 23, 1999 - tom: enlarged view of the conversion boundary (dashed line) showing the ..... Computational Physics (Addison-Wesley, Redwood City,. 1992).
compared ItemRank with other state-of-the-art ranking tech- niques (in particular ... on Bayesian networks [Breese et al
Jul 23, 2013 - In the Barabási-Albert model the new vertex will connect to m vertices, where m ..... Mean-field theory for scale-free random networks. Physica A ...
tutorial introduction to the algorithm, deriving the algo- rithm from first principles. The article is self- ... ical il
Apr 7, 2016 - April 7, 2016. Abstract. We introduce a family of stochastic processes on the integers, depending on a parameter â [0, 1] and interpolating ...
Such a trend is confirmed by the average of the ..... Then micro DOA is something like a weighted averaging of individua
Rate optimality of Random walk Metropolis algorithm in high
Jun 20, 2014 - understood for a class of light-tailed target distributions. We develop ... distributions, such as the Student t-distribution or the stable distribution.
arXiv:1406.5392v1 [stat.ME] 20 Jun 2014
Rate optimality of Random walk Metropolis algorithm in high-dimension with heavy-tailed target distribution Kengo KAMATANI∗
Abstract High-dimensional asymptotics of the random walk Metropolis-Hastings (RWM) algorithm is well understood for a class of light-tailed target distributions. We develop a study for heavy-tailed target distributions, such as the Student t-distribution or the stable distribution. The performance of the RWM algorithms heavily depends on the tail property of the target distribution. The expected squared jumping distance (ESJD) is a common measure of efficiency for light-tail case but it does not work for heavy-tail case since the ESJD is unbounded. For this reason, we use the rate of weak consistency as a measure of efficiency. When the number of dimension is d, we show that the rate for the RWM algorithm is d2 for the heavy-tail case where it is d for the light-tail case. Also, we show that the Gaussian RWM algorithm attains the optimal rate among all RWM algorithms. Thus no heavy-tail proposal distribution can improve the rate.
Keywords: Markov chain; Diffusion limit; Consistency; Monte Carlo; Stein’s method
1
Introduction
The Markov chain Monte Carlo (MCMC) method is a technique for the evaluation of the complicated integrals that uses a random sequence of Markov chains. In Bayesian community, it is used to calculate integrals with respect to the Bayesian posterior distribution. Although a lot of new techniques have been introduced for this calculation, MCMC is still one of the most popular tool for that purpose. The random walk Metropolis (RWM) algorithm is the most widely used subclass of MCMC methods, due to its simplicity. It produces a simple modification of a symmetric random-walk Markov chain. It is applicable to both discrete state space and continuous state space. We focus on the latter and assume Rd as the state space in this paper. Another explanation for the appeal of the RWM algorithm is its generality. Any kinds of symmetric random-walk can be used for this algorithm. We can use the Gaussian and the Student t-random walk or their mixture. Even if we decide to use the Gaussian random walk, we still have a free choice of its covariance structure. Note that although the underlying random-walk is not ergodic, the resulting Markov chain is usually ergodic (See Section 3.1 of Tierney [1994] for example). However this generality means that, in turn, the practitioner should choose one particular symmetric random-walk Markov chain among all possibilities. Although any choice may produce ergodic Markov chain, the performance of the RWM algorithm depends on the random-walk. There are many practical strategies to monitor the performance of the MCMC, and it is advisable to use those for choosing an efficient random-walk for the RWM algorithm. There are also a lot of theoretical works for the performance of the MCMC. The detail will be described below. Apart from the choice, the performance of the MCMC heavily depends on the target distribution, which is the underling probability measure for the integral we want to approximate. For example, in Roberts and Rosenthal [1998], Metropolis-adjusted Langevin algorithm (MaLa) was shown to have better ∗ Supported
in part by Grant-in-Aid for Young Scientists (B) 24740062.
1
rate of convergence than that of Gaussian RWM in high-dimensional unimodal target distribution. On the other hand if the target distribution is super-exponential, Gaussian RWM can be geometrically ergodic but MaLa can not. Thus there is no uniformly optimal choice for the random-walk for RWM algorithm, and we should be care for the structure of the target distribution. A simple classification for the target distribution uses its tail property. If it is a light-tail distribution, ergodic property of the RWM algorithm was studied in Mengersen and Tweedie [1996], Roberts and Tweedie [1996] and Jarner and Hansen [2000] by using Foster-Lyapunov type drift condition. It was proved in Mengersen and Tweedie [1996] and Jarner and Hansen [2000] that the exponential or lighter tail is necessary and sufficient for geometric ergodicity for the RWM algorithm. These asymptotic results corresponds to the analysis for M → ∞ where M is the number of iteration of the MCMC. On the other hand, in Roberts et al. [1997], they considered asymptotic property of the RWM algorithm as the number of dimension d → ∞. We will refer to this, high-dimensional asymptotics setting. This seemingly curious setting brought useful insights for practitioners. Under this setting with the Gaussian random-walk, the asymptotic optimal choice of the covariance structure was obtained. Practically, this optimal choice can be obtained by monitoring simple statistic as described in Gelman et al. [1996]. Another interesting results is that the convergence rate is on the order of d. This can be interpreted that the number of iteration M should be proportional to d. This rate may vary among MCMC algorithms. For example, the MaLa has the order d1/3 , which is better than d of the RWM algorithm [Roberts and Rosenthal, 1998]. The assumption in Roberts et al. [1997] is rather restrictive, and there has been a considerable effort for generalization of this result. For heavy-tail target distributions, ergodic properties were studied in Jarner and Tweedie [2003] and Jarner and Roberts [2007]. Their results show that the RWM algorithm can not have geometric ergodicity but polynomial ergodicity under heavy-tail target distribution. Also it was showed that a random-walk with heavy-tail increment distributions improves the rate of polynomial convergence. A natural question is whether this is true or not for high-dimensional setting. The convergence rate may or may not be worse than d, and this rate can or can not be improved by using heavy-tail increment distribution. To solve these questions is quite important in practical point of view. Compared to ergodic property, there are only a few works for high-dimensional asymptotics for heavy-tail target distributions. One exception is Sherlock and Roberts [2009]. They developed theoretical analysis on the expected squared jumping distance, which will be described later (see Section 2.2). Another related result is Neal and Roberts [2011], which is not for heavy-tail target distribution but for heavy-tail increment distribution of the random-walk of the RWM algorithm. However the above natural question is not solved yet. Main difficulty is that, unlike light-tail case, there are (at least) two kinds of convergence rate for heavy-tail target distribution. It makes the expected squared jump distance useless and optimality of the RWM algorithms becomes difficult to define. In this paper, we shall concentrate on a class of heavy-tail target distributions under high-dimensional setting. We follow local consistency framework studied in Kamatani [2014] to overcome the difficulty mentioned above. Key result is that if we use the Gaussian random-walk in RWM algorithm, then the convergence rate is d2 , which is much worse than the light-tail case d. Surprisingly, this rate is optimal in a sense, and hence no heavy-tail increment distribution can improve the rate. The paper is organized as follows. The assumptions and some background are given in Section 2. Since usual expected squared jumping distance does not make sense for heavy-tail target distributions, we define the rate of convergence as in Kamatani [2014]. In Section 3, this notion will be tested to the light-tail, Gaussian target distribution, and then applied to rotationally symmetric heavy-tail distributions. Proofs are relegated to Section 4. In Section A, we prepare a simple introduction for Stein’s technique, which is the main tool for our proofs. We use the following notation throughout in this paper. The state space is Rd throughout, and the Euclidean norm is denoted k · k and the inner product is denoted h·, ·i. Write Nd (µ, Σ) for d-dimensional normal distribution with the mean vector µ ∈ Rd and the variance covariance matrix Σ, and φd (x; µ, Σ) be its probability distribution function. The d × d-identity matrix is denoted by Id . We also use the notation L(X) to denote the law of a random variable X. For x ∈ R, write x+ = 2
max{0, x}, x− = max{0, −x} and write [x] for the integer part of x ≥ 0. Write the sup norm by khk∞ = supz∈E |h(z)| for h : E → R for a state space E. If h is absolutely continuous, we write h′ for the derivative.
2
Asymptotic properties of High-dimensional MCMC
2.1
Assumption for the target distribution
We consider a sequence of the target distributions (Pd )d∈N indexed by the number of dimension d. For a given d, Pd is a d-dimensional probability distribution that is a scale mixture of the normal distribution. Furthermore, our asymptotic setting is that the number of dimension d goes infinity while the mixing distribution Q is unchanged. Let Q(dy) be a probability measure on [0, ∞). Let Pd be the scale mixture of the normal distribution defined by Pd = L(X0d ), Qd = L(kX0d k2 /d) (2.1)
where X0d |Y ∼ Nd (0, Y Id ) and Y ∼ Q. In particular, Pd is rotationally symmetric, that is, it is invariant under all orthogonal transform. We assume that the mixing distribution Q is δ1 (the Dirac measure charging 1 ∈ R) or it satisfies the following, where we write f (n) for the n-th derivative of a function f . Assumption 1. Probability distribution Q has the strictly positive probability distribution function q(y). The probability distribution function q(y) is two-times continuously differentiable and q (i) (y) vanishes at +0 and +∞ for i = 0, 1, 2. Moreover, limy→+∞ yq(y) = 0. Let g(x; α, ν) ∝ xν−1 exp(−αx) be the probability distribution function of the Gamma distribution G(α, ν). The following lemma says that the value of pd (x) only depends on kxk and is a monotone decreasing function with respect to kxk. This property is important for the random-walk Metropolis algorithm since it makes the acceptance probability function very simple.
Lemma 2.1. Under Assumption 1, both Pd and Qd have the probability distribution functions pd and qd that satisfy kxk2 pd (x) ∝ kxk2−d qd ( ). d Moreover, pd (x1 ) < pd (x2 ) if and only if kx1 k > kx2 k. Proof. Since Pd and Qd are scale mixture of normal distribution and that of the chi-squared distribution with respectively, we have Z ∞ Z ∞ d d pd (x) = φd (x; 0, zId )Q(dz), qd (y) = g(y; , )Q(dz) 2z 2 0 0 2
d d d−2 φd (x; 0, zId ). and hence the first claim follows by the relation of integrands, that is, g( kxk d ; 2z , 2 ) ∝ kxk The second claim comes from the fact that φd (x1 ; 0, zId ) < φd (x2 ; 0, zId ) if and only if kx1 k > kx2 k.
The following lemma says that qd is very close to q when d is sufficiently large. A proof is in Section C. (i)
Lemma 2.2 (Convergence of probability density function). Under Assumption 1, limd→∞ kqd − q (i) k∞ = 0 for i = 0, 1, 2. Example 1 (Student t-distribution). The probability distribution function of the t-distribution with ν > 0 degree of freedom is Γ((ν + d)/2) pd (x) = . Γ(ν/2)ν d/2 π d/2 (1 + kxk2 /ν)(ν+d)/2 In this case, Q is the inverse chi-squared distribution with ν-degree of freedom with probability distribution function q(y) ∝ y −ν/2−1 e−ν/(2y) . It is straightforward to check that q (i) (y) vanishes at +0 and +∞ for i = 0, 1, 2 and limy→+∞ yq(y) = 0. For properties of the (multivariate) t-distribution, see Kotz and Nadarajah [2004]. 3
ExampleR 2 (Stable distribution). If Pd is the rotationally symmetric α-stable distribution with characteristic −α function exp(−iht, ), then Q is α/2-stable distribution on the half line with R xi)Pd (dx) = exp(−kt/2k α/2 Laplace transform exp(−ty)Q(dy) = exp(−|t| ). Although there is no closed form of probability density function q(x), all derivatives of q(x) are continuous and vanishes at 0 and ∞, and q(x) ∼ x−α−1 as x → +∞. See Section 14 of Sato [1999]. Next we review some properties of the uniform distribution on the surface of the sphere {x ∈ Rd ; kxk = d}. Let U d = (Uid )i=1,...,d be uniformly distributed on the surface and V d = (Vid )i=1,...,d be an independent Pd random variable on the surface. Then hU d , V d i = i=1 Uid Vid has the same law as U1d . The properties of U1d are well-known. The law of (U1d )2 /d is Beta(1/2, (d − 1)/2), and U1d has the mean 0 and the variance 1. Asymptotic normality result with a sharp bound can be found in Diaconis and Freedman [1987] for the total variation distance and Chatterjee and Meckes [2008] for the Wasserstein distance. Lemma 2.3. If U d = (Uid )i=1,...,d is uniformly distributed on the surface of the sphere {x ∈ Rd ; kxk = d}, then 3 8 , kL(U1d ) − N (0, 1)k1 ≤ kL(U1d ) − N (0, 1)kTV ≤ d−4 d−1 where kP − QkTV = 2 supA |P (A) − Q(A)| and kP − Qk1 = supf ∈B1 |P (f ) − Q(f )|, where B1 is the set of functions f such that sup |f (x) − f (y)|/|x − y| ≤ 1.
Since Pd is rotationally symmetric in this paper, above uniform bound is the key fact for our results. In addition to Lemma 2.3, we will use E[(U1d )α ] =
d−1 dα/2 B( α+1 2 , 2 )
B( 21 , d−1 2 )
→
Γ( α+1 2 ) α/2 2 (d → ∞) Γ( 21 )
(2.2)
for α > −1, where we used Stirling’s approximation.
2.2
Expected squared jumping distance
We call that a probability measure Sd on Rd is symmetric measure if Sd (A) = Sd (−A) where −A = {−x; x ∈ A} for any Borel set A, and we denote Sd for the set of all symmetric probability measures in Rd . d Let Pd be a probability measure on Rd with density pd (x). In this paper, a Markov chain X d = (Xm )m∈N d d is called the random-walk Metropolis (RWM) chain for the target Pd if X0 is an R -valued random variable, and for m ≥ 1, d d d d Ym = X m−1d + Wm , Wm ∼ Sd d Ym with probability αd (Xm−1 , Ymd ) d = Xm d d Xm−1 with probability 1 − αd (Xm−1 , Ymd )
where αd (x, y) = min{1, pd(y)/pd (x)}. Write the law of X d by RWM(Pd , Sd ) if X0d ∼ Pd . It is well known that X d is a Markov chain reversible with respect to Pd , that is, if X d ∼ RWM(Pd , Sd ), then the random vectors d d d (X0d , X1d, . . . , Xm ) and (Xm , Xm−1 , . . . , X0d ) have the same law. Expected squared jumping distance (ESJD) is formally defined in Sherlock and Roberts [2009] as a measure of efficiency of MCMC as ESJD(Pd , Sd ) := E[kX1d − X0d k2 ].
The main result in Roberts et al. [1997] is that the (finite dimensional) probability law of X d converges weakly to the Langevin diffusion. The speed measure of the diffusion is proportional to 1/dESJD(Pd , Sd ) in the limit, and hence ESJD can be interpreted as a measure of efficiency (larger is the better).
4
An interesting property for ESJD is the existence of optimality. Let Pd satisfy (2.1) and recall that Qd is the law of kX0dk2 /d. Write P1 for the cumulative probability distribution of X01 . Suppose that l2 P1 (−l/2) is bounded, and l∗ is its maximizer. Let Sd∗ ∈ Sd be a distribution on {x ∈ Rd ; kxk = l∗ }. The following is more or less well-known. Proposition 2.1. For each d ∈ N, supSd ∈Sd ESJD(Pd , Sd ) = ESJD(Pd , Sd∗ ) = 2(l∗ )2 P1 (−l∗ /2). Proof. Suppose that X0d ∼ Pd and W1d ∼ Sd . Let Zd =
By Lemma 2.1, pd (X0d + W1d ) < pd (X0d ) if and only if kX0d + W1d k > kX0dk, that is, Z d > 0. Since the conditional laws of Z d − kW1d k2 and −Z d given W1d are the same, E[
pd (X0d + W1d ) d , Z > 0|W1d ] = pd (X0d ) =
Z
Z d >0 d
pd (x + W1d ) pd (x)dx = pd (x)
Z
Z d >0 d
pd (x + W1d )dx
P(Z − kW1d k2 > 0|W1d ) = P(Z < 0|W1d ).
By using the above equation, E[1 ∧
pd (X0d + W1d ) d |W1 ] = pd (X0d ) =
P(Z d ≤ 0|W1d ) + E[
pd (X0d + W1d ) d , Z > 0|W1d ] pd (X0d )
2P(Z d ≤ 0|W1d ) = 2P1 (−kW1d k/2)
since hX0d , W1d /kW1d ki ∼ P1 . Thus by definition, ESJD(Pd , Sd ) = =
E[1 ∧
pd (X0d + W1d ) kW1d k2 ] pd (X0d )
2E[kW1d k2 P1 (−kW1d k/2)] ≤ 2(l∗ )2 P1 (−l∗ /2).
The properties of ESJD and expected acceptance ratio have been studied since the seminal work of Roberts et al. [1997]. However for heavy-tail case, l2 P1 (−l/2) is not bounded in general, and hence it can not be used as a measure of efficiency. Thus for heavy-tail target distribution, we need another measure of efficiency, and we use the framework of Kamatani [2014].
2.3
Consistency for high dimensional MCMC
In this section, we review consistency of MCMC studied in Kamatani [2014]. Set a sequence of Markov d chains {ξ d := (ξm ; m ∈ N0 )} (d ∈ N) with the invariant probability measures {Πd }d . The sequence {ξ d }d (or its law) is called consistent if M−1 1 X d f (ξm ) − Πd (f ) = oP (1) (2.3) M m=0 for any M, d → ∞ for any bounded continuous function f . This says that the integral Πd (f ) we want to 1 PM−1 d calculate is approximated by Monte Carlo simulated value M m=0 f (ξm ) after a reasonable number of iteration M . Regular Gibbs sampler should satisfy this type of property (more precisely, local consistency. See Kamatani [2014]) when d is the sample size of the data. However this is not always the case as described in the end of Kamatani [2014]. In our case, (2.3) is not satisfied in two respects: the state space is not the same for d ∈ N, and M = Md should satisfy a certain rate. 5
Recall that the target distribution in the current study is rotationally symmetric, and hence there are two sufficient statistics: X0d and kX0dk2 (2.4) kX0dk
d where X0d ∼ Pd . For d = 1, 2, . . ., let X d = (Xm ; m ∈ N0 ) be an Rd -valued stationary process with invariant distribution Pd on (Ω, F , P). We consider asymptotic properties of X d through d → ∞. As commented above, the state space for d X (d ∈ N) changes as d → ∞ that is inconvenient for further analysis. To overcome the difficulty, we set a projection πE = πd,E for a finite subset E ⊂ {1, . . . , d} by
πE (x) = (xi )i∈E (x = (xi )i=1,...,d ). √ We denote πk for π{1,...,k} . We also define rd : Rd → R to be d(kxk2 /d − 1) or kxk2 /d depending on the target distribution Pd . Roughly, the properties after projections πE (E ∈ {1, . . . , d}) and rd correspond to d d d that of Xm /kXm k and kXm k with respectively. Definition 1 (Consistency). We call that a sequence of Rd -valued Markov chain {X d}d∈N and its law are consistent if Md −1 1 X d f ◦ πEdk (Xm ) − Pd (f ◦ πEdk ) = oP (1) (2.5) Md m=0 as d → ∞ for any k ∈ N, Md → ∞ and for any bounded continuous function f : Rk → R and any k-elements Edk of {1, . . . , d}. Also, for rd : Rd → R, we call X d and its law are consistent with respect to rd if Md −1 1 X d f ◦ rd (Xm ) − Pd (f ◦ rd ) = oP (1) Md m=0
(2.6)
for any Md → ∞ and for any bounded continuous function f : R → R. Note that consistency (2.5) does not imply consistency with respect to rd (2.6). However, these properties are, in fact, prepared just for explanation of weak consistency below, and this definition itself is impractical in our current setting; for high-dimensional case (d → ∞), consistency (with or without projection) is rarely satisfied. As in Kamatani [2013], we relax the condition for Md and introduce the rate of consistency, which is the key of our results. Definition 2 (Weak Consistency). We call that a sequence of Rd -valued Markov chain {X d}d∈N and its law are weakly consistent with rate Td if (2.5) is satisfied for any Md → ∞ such that Md /Td → ∞. If (2.6) is satisfied in place of (2.5) we call that {X d } and its law are weakly consistent with rate Td with respect to rd . In this paper, we compare different MCMC methods by the rate of weak consistency. This is just a formalization of the usual approach in this area; this type of comparison for high-dimensional MCMC is at least dates back to Roberts and Rosenthal [1998]. An important remark to note here is that the weak consistency corresponds to two sufficient statistics (2.4) may vary. For our heavy-tail setting, the rate for weak consistency for X0d 7→ X0d /kX0dk and X0d 7→ kX0d k2 /d are d and d2 with respectively (see Propositions 3.3 and 3.4). d There is a technical difficulty for the analysis (πEdk (Xm ))m∈N since it is not a Markov chain. However d if we pair the sequence with rd (Xm ) then it become a Markov chain. The proof is easy and is omitted for brevity. d Lemma 2.4. If Pd is rotationally symmetric, and Sd is a symmetric measure, then both (rd (Xm ))m∈N0 and d d d (πEdk (Xm ), rd (Xm ))m∈N0 are Markov chains if X ∼ RWM(Pd , Sd ).
6
3
Main results
3.1
Weak consistency for Gaussian target distribution
d First, we consider asymptotic property of rd (Xm ). For Gaussian case Pd = Nd (0, Id ), we set rd : Rd → R as
rd (x) =
√ kxk2 d − 1 (x ∈ Rd ). d
The law of rd (X0d ), which is the stationary distribution of Y d,1 defined below, tends to N (0, 2). This normal distribution is also the stationary distribution of the limit Ornstein-Uhlenbeck process. Let [x] be the integer part of x ≥ 0, and let µk (σ) = E[|Z|k exp(−Z + )] for Z ∼ N (σ 2 /2, σ 2 ). 2
d ). Proposition 3.1. Pd = Nd (0, Id ). Set X d ∼ RWM(Pd , Sd ) for Sd = Nd (0, ld Id ), and let Ytd,1 = rd (X[dt] d,1 1 Then Y converges to Y that is the solution of
dYt1 = −
σ(l)2 1 Yt dt + σ(l)dWt ; Y01 ∼ N (0, 2) 4
(3.1)
where σ(l)2 = 4µ2 (l) and (Wt )t≥0 is the standard Wiener process. In particular, RWM(Pd , Sd ) has weak consistency with rate Td = d with respect to rd . As suggested by Lemma 2.1, the Gaussian proposal distribution attains the optimal rate for weak consistency with respect to rd . Therefore no heavier-tail proposal distribution can be better. Theorem 3.1. Pd = Nd (0, Id ). For RWM(Pd , Sd ) for Sd ∈ Sd , the optimal rate for weak consistency with 2 respect to rd is Td = d. In particular, Sd = Nd (0, ld Id ) attains the optimal rate. d Next we consider asymptotic properties of πEdk (Xm ). If Sd is rotationally symmetric, then the law of d {πEdk (Xm )}m does not depend on any choice of the k-elements Edk from {1, . . . , d}. Therefore, to prove (2.5), it is sufficient to show for Edk = {1, . . . , k} and in the following, we only consider πk = π{1,...,k} . Note that in the following, Y 2 does not mean the square of Y . 2
Proposition 3.2. Let Pd = Nd (0, Id ). Set X d ∼ RWM(Pd , Sd ) for Sd = Nd (0, ld Id ), and let Ytd = d d (rd (X[dt] ), πk (X[dt] )). Then Y d converges to (Y 1 , Y 2 ) where Y 1 is the solution of (3.1) and Y 2 is the solution of σ(l)2 2 Yt dt + σ(l)dBtk ; Y02 ∼ Nk (0, Ik ) dYt2 = − 2 where σ(l)2 = l2 µ0 (l) and (Btk )t≥0 is the k-dimensional standard Wiener process independent of (Wt )t≥0 . As in Theorem 3.1, this rate is optimal for weak consistency. 2
Theorem 3.2. For Pd = Nd (0, Id ) and for Sd = Nd (0, ld Id ), RWM(Pd , Sd ) is weakly consistent with rate Td = d. This rate is optimal.
3.2
Weak consistency for heavy-tailed target distribution
Now we focus on heavy-tailed√distribution, that is, Q has a probability density q. In this case, unlike the Gaussian case, the d(kxk2 /d − 1) is not appropriate for the study of the trajectories of MCMC, √ projection d 2 since the law of d(kXm k /d − 1) as d → ∞ is not tight. Therefore we consider another projection rd (x) = kxk2 /d. d By this projection, the law of rd (Xm ) tends to Q as d → ∞.
7
2
Proposition 3.3. Let Q satisfy Assumption 1. Set X d ∼ RWM(Pd , Sd ) for Sd = Nd (0, ld Id ), and let d d,1 Ytd,1 = rd (X[d converges to a stationary ergodic process Y 1 that is the solution of 2 t] ). Then Y dYt1 = a(Yt1 )dt + where
and (Wt )t≥0 is the standard Wiener process. In particular, RWM(Pd , Sd ) has weak consistency with rate Td = d2 for the projection rd . As in the case of the Gaussian target distribution, the above rate is optimal for weak consistency with respect to rd . This is rather counter intuitive, since in practice and in ergodic theory, sometimes heavier proposal works well. The interpretation will be discussed in Section 3.3. Theorem 3.3. Let Q satisfy Assumption 1. For RWM(Pd , Sd ) for Sd ∈ Sd , the rate for weak consistency 2 with projection rd can not exceed d2−ǫ , that is, rd /d2−ǫ → ∞ for any ǫ > 0. In particular, Sd = Nd (0, ld Id ) attains the optimal rate Td = d2 in this sense. d d Next we show asymptotic properties of πE (Xm ). The key fact is that the convergence rate for kXm k and are different. This makes analysis little bit complicated. However, thanks to Lemma B.2, we can prove weak consistency even for this case.
d d Xm /kXm k
2
Proposition 3.4. Let Q satisfy Assumption 1. Set X d ∼ RWM(Pd , Sd ) for Sd = Nd (0, ld Id ), and let d d Ytd = (rd (X[dt] ), πk (X[dt] )). Then Y d converges to (y, Y 2 (y)) where y ∼ Q and Y 2 = Y 2 (y) is the solution of dYt2 = −
σ(l)2 2 Yt dt + y 1/2 σ(l)dBtk ; Y02 ∼ Nk (0, yIk ) 2
where σ(l)2 = l2 µ0 (l), l2 = l2 /y and (Btk )t≥0 is the k-dimensional standard Wiener process. Note that the process Y 2 is stationary and ergodic conditioned on y. However it is not ergodic unconditional on y. To obtain an ergodic result, we should consider the time scaling t 7→ d2 t. Moreover, this time scaling is the optimal rate, as in Theorem 3.3. 2
Theorem 3.4. Let Q satisfy Assumption 1. For Sd = Nd (0, ld Id ), RWM(Pd , Sd ) is weakly consistent with rate Td = d2 . Moreover, the rate for weak consistency can not exceed d2−ǫ for any ǫ > 0 and hence RWM(Pd , Sd ) attains the optimal rate in this sense.
3.3
Discussion
• As studied in Theorems 3.3 and 3.4, the convergence rate for the heavy-tail target distribution is d2 , which is much worse than that for the light-tail target distribution. Though the rate is quite bad, it is optimal as described in those theorems. • The optimality of Gaussian random-walk Metropolis algorithm is rather counter intuitive in two respects. In Jarner and Roberts [2007], they showed that heavy-tail increment distribution improves the polynomial rate of convergence. It may seem a contradiction, but it is not. High-dimensional asymptotics corresponds to the local property of the MCMC but the ergodic theory corresponds to the global property. Thus the conclusion of Theorem 3.4 and Jarner and Roberts [2007] are for two different properties. If we choose the initial state X0d properly (See Lemma 4 of Kamatani [2014] and Section B of Kamatani [2013]), then local property may be more useful for the choice of Sd . On the other hand if we have no idea for the choice of X0d , global property may provide more information for convergence. 8
In practically, it is also discussed the usefulness of heavy-tail increment distribution for heavy-tail target distribution. In our setting, the target distribution belongs to a class of unimodal rotationally symmetric distributions. If the target distribution is multi-modal, then it may be better to use heavytail increment distribution, though the theoretical validation for this might be difficult. If the target distribution is considered to be a perturbation of a rotationally symmetric distribution, then our conclusion may be applicable, and in that case, the Gaussian random-walk Metropolis algorithm attains the optimal rate. • Only the optimal rate was discussed. For light-tail case, optimality of the limit process was also considered in the literature. It may also be possible for our case for the choice of l when the increment 2 distribution is Sd = Nd (0, ld Id ). Optimal choice of Sd for all Sd in this sense is even more difficult, since the limit can be a diffusion, a pure-jump process and a diffusion with jumps. Further work should be done for this. • For a light-tail target distribution, expected squared jumping distance and expected acceptance rate work as good measures of efficiency in practice. For heavy-tail case, it is not easy to find such a statistic. One possibility is to estimate E[log kX1dk − log kX0d k] that is finite even for heavy-tail case and is on the order of d−1 . If this value is moderately large, the MCMC may have a good performance. If the target distribution is far from rotationally symmetric, I recommend to normalise (scale properly) the distribution in advance. • There is no known good strategy for heavy-tail target distribution that improves the convergence rate d2 . MaLa behaves like RWM as discussed in Jarner and Roberts [2007] and Kamatani [2009], and RWM with heavy-tail proposal does not improve the convergence rate. If the target is a perturbation of a stable distribution, then a discretization of the stable related process may work well. However this is out of the scope of this paper.
4
Proofs
Some notation is required for the proofs. Random variables Sections 4.1-4.3 deal with the Gaussian target distribution Pd = Nd (0, Id ). In this case, we analyze the properties of the random variable Z d := hX0d , W1d i +
kW1d k2 , X0d ∼ Pd , W1d ∼ Sd . 2
(4.1)
Note that kX0d + W1d k2 − kX0dk2 = 2Z d , and the acceptance probability αd (X0d , X0d + W1d ) is α(Z d ) := exp(−Z d+ ) for the Gaussian target distribution, where we write Z d+ = max{Z d , 0}. Moreover, the conditional distribution of Z d given W1d is N (kW1d k2 /2, kW1dk2 ).
Sections 4.4-4.6 are for heavy-tail target distribution. Write y = rd (X0d ) = kX0d k2 /d. In this case, we analyze the random variable Zd
d1/2 X0d d1/2 W1d 1 kd1/2 W1d k2 kX0d + W1d k2 − kX0d k2 = h , i + d d d 2 kX0d k2 2kX0 k2 /d kX0 k kX0 k 1 =: hXd0 , W1d i + kW1d k2 . 2
:=
9
Note that kX0d + W1d k2 − kX0d k2 = 2Zd y. Also, by Lemma 2.1, the acceptance probability αd (X0d , X0d + W1d ) is αd (Zd ; y) := = =
pd (X0d + W1d ) , 1} pd (X0d ) n kX d + W d k d d 2 2−d qd (kX0 + W1 k /d) 0 1
min{ min
(1 +
kX0d k
d+
qd (kX0d k2 /d)
,1
o
2Zd+ 2−d qd (y + 2Zd y) ) 2 . d qd (y)
For both Gaussian and heavy-tail target distributions, we will use U1d =
√ D X0d W1d E d , . d kX0 k kW1d k
In both cases, the law of this random variable is the first element of the uniform distribution on the surface of the sphere {x ∈ Rd ; kxk = d} conditional on W1d . Probability measures In the following six sections, we write Py ( · ) = P( · |rd (X0d ) = y), Py,σ ( · ) = P( · |rd (X0d ) = y, kW1d k = σ), Px,y ( · ) = P( · |πk (X0d ) = x, rd (X0d ) = y).
(4.2)
Denote Nσ for N (σ 2 /2, σ 2 ). Some important properties of this probability distribution Nσ are described in Section A. d Constants Let W1,k be the k-th element of the vector W1d ∈ Rd . We will use notation d y = rd (X0d ), σ = kW1d k, σk = |W1,k |.
We have σ 2 =
P
σk2 .
We use the following result of triangular arrays that is a slight modification of Lemma 9 of Genon-Catalot and Jacod [1993]. The proof follows the same lines as the proof of their lemma via Lenglart inequality, and is omitted. For d ∈ N, let {Gid }i be a filtration and let χdi,α be Gid -measurable random variables indexed by i ∈ N0 and d d α ∈ Ad . We denote Xi,α → 0 uniformly if the convergence P(|Xi,α | > ǫ) → 0 (d → ∞) holds uniformly in α ∈ Ad for any ǫ > 0. Let Td be a N0 -valued sequence. Pi p Lemma 4.1. The following two conditions imply supi
0 and y ∈ R, so that σd2 > 0. For any absolutely continuous Nσ -integrable function h, p Ey,σ [h(Z d ) − Nσ h + d−1/2 σ 2 yNσ f ′ ] ≤ d−1 6 + 3yd−1/2 + 4y 2 2/π σkh′ k∞ +
where f is the solution of Stein’s equation (A.5). In particular, if h(z) = ze−z , h p µ2 (σ) i ≤ d−1 6 + 3yd−1/2 + 4y 2 2/π σ. Ey,σ h(Z d ) + d−1/2 y 2 Proof. Set Z˜ d = σd U1 + σ 2 /2 and U1 ∼ N (0, 1). By Lemma 2.3
6σd ′ 3σd kh′ k∞ ≤ kh k∞ |Ey,σ [h(Z d ) − h(Z˜ d )]| ≤ d−1 d
(4.6)
if d − 1 ≥ d/2. If f = F(h, Nσ ) is the solution of (A.5), Ey,σ h(Z˜ d ) − Nσ h
σ2 = σ 2 Ey,σ [f ′ (Z˜ d )] − Ey,σ [(Z˜ d − )f (Z˜ d )] 2 = σ 2 Ey,σ [f ′ (Z˜ d )] − σd Ey,σ [U1 f (Z˜ d )] = (σ 2 − σd2 )Ey,σ [f ′ (Z˜ d )]
(4.7)
where the last equation comes from Stein’s characteristic (A.1). Now let g = F(f ′ , Nσ ). Then applying the above equation to f ′ in place of h we have |Ey,σ f ′ (Z˜ d ) − Nσ f ′ | ≤ |σ 2 − σd2 |kg ′ k∞ . p Furthermore, by (A.6), we have kg ′ k∞ ≤ 4σ −2 kf ′ k∞ ≤ 4σ −3 2/πkh′ k∞ and hence p |Ey,σ f ′ (Z˜ d ) − Nσ f ′ | ≤ |σ 2 − σd2 |4σ −3 2/πkh′ k∞ .
Therefore, by this inequality together with (4.6) and (4.7), we have p Ey,σ [h(Z d ) − Nσ h − (σ 2 − σ 2 )Nσ f ′ ] ≤ d−1 6σd + d(σ 2 − σ 2 )2 4σ −3 2/π kh′ k∞ . d d
√ By putting σ 2 − σd2 = −d−1/2 σ 2 y and σd ≤ σ(1 + y/2 d), the former inequality in the lemma follows. + Finally, the last claim follows by Nσ h = 0, kh′ k∞ ≤ 1 and by Lemma A.2 when h(z) = ze−z . Proof of Proposition 3.1. For ǫ < 1, let K(ǫ) = [ǫ, ǫ−1 ] be a closed interval in (0, ∞), and let f be the solution + of (A.5) for h(z) = ze−z with σ = kW1d k. Set ad (y) := dEy [rd (X1d ) − rd (X0d )], bd (y) := dEy [(rd (X1d ) − rd (X0d ))2 ], (4.8) cd (y) := dEy [(rd (X1d ) − rd (X0d ))4 ]. By Lemma 4.2,
uniformly in y ∈ K(ǫ) by the dominated convergence theorem. For diffusion part, we have bd (y) = 4Ey [(Z d )2 e−Z
d+
] =: 4Ey [h2 (Z d )].
Conditional distribution of Z d given y ∈ K(ǫ) converges to Nl , and h2 (Z d ) is uniformly integrable by Ey [(U1d )4 ] = 3d/(d + 2) (by (2.2)) and Ey [σd4 ] = l4 (1 + 2d−1 ). Thus bd (y) = 4Nl h2 + o(1) = 4µ2 (l) + o(1) uniformly in y ∈ K(ǫ). Finally we check cd (y) = d−1 24 Ey [(Z d )4 e−Z
d+
] ≤ d−1 24 Ey [(Z d )4 ] = O(d−1 ).
Thus the first claim follows by the convergence of Markov chain to a diffusion process (Theorem IX 4.21 of Jacod and Shiryaev [2003]). The last claim is a conclusion from Lemma B.1.
4.2
Proof of Theorem 3.1
For two random variables X and Y , we write X ≺ Y if there exists C > 0 such that PSd (|X| > ǫ) ≤ CPSd (|Y | > ǫ) for any ǫ > 0 and Sd ∈ Sd . In the proof of Theorem 3.1, we use the fact W1d 2 kX1d k2 − kX0d k2 ≺ hX0d , i . kW1d k
(4.9)
Wd
Note that the random variable hX0d , kW1d k i follows the standard normal distribution, which is, of course, free 1 from the choice of Sd . This can be proved by reversibility; PSd |kX1d k2 − kX0d k2 | > ǫ = 2PSd kX1dk2 − kX0d k2 < −ǫ ≤ 2PSd kX0d + W1d k2 − kX0d k2 < −ǫ W1d i < −ǫ = 2PSd kW1d k2 + 2kW1d khX0d, d kW1 k d 2 d W1d 2 W 1 hX0 , < −ǫ = 2PSd kW1d k + hX0d , i − i kW1d k kW1d k W1d 2 ≤ 2PSd − hX0d , i < −ǫ . d kW1 k Lemma 4.3. For Td /d → 0