OPTIMAL SCALING OF THE MALA ALGORITHM WITH

0 downloads 0 Views 432KB Size Report
Feb 6, 2017 - The real constants hp and hp are defined in (5.23) and in (5.27) respectively. Proof. ...... Nonlinearity, 28(7):2081–2103, 2015. [16] Luc ...
OPTIMAL SCALING OF THE MALA ALGORITHM WITH IRREVERSIBLE PROPOSALS FOR GAUSSIAN TARGETS

arXiv:1702.01777v1 [stat.ME] 6 Feb 2017

MICHELA OTTOBRE, NATESH S. PILLAI, AND KONSTANTINOS SPILIOPOULOS Abstract. It is known that reversible Langevin diffusions in confining potentials converge to equilibrium exponentially fast. Adding a divergence free component to the drift of a Langevin diffusion accelerates its convergence to stationarity. However such a stochastic differential equation (SDE) is no longer reversible. In this paper, we analyze the optimal scaling of MCMC algorithms constructed from discretizations of irreversible Langevin dynamics for high dimensional Gaussian target measures. In particular, we make use of Metropolis-Hastings algorithms where the proposal move is a time-step Euler discretization of an irreversible SDE. We call the resulting algorithm the ipMALA , in comparison to the classical MALA algorithm. We stress that the usual Metropolis-Hastings accept-reject mechanism makes the ipMALA chain reversible; thus, it is of interest to study the effect of irreversible proposals in these algorithms. In order to quantify how the cost of the algorithm scales with the dimension N , we prove invariance principles for the appropriately rescaled chain. In contrast to the usual MALA algorithm, we show that there could be three regimes asymptotically: (i) a diffusive regime, as in the MALA algorithm, (ii) a “transitive” regime and (iii) a “fluid” regime where the limit is an ordinary differential equation. Numerical results are also given corroborating the theory.

1. Introduction In this paper, we analyze the scaling properties of high dimensional Markov Chain Monte Carlo (MCMC) algorithms constructed using non-reversible Langevin diffusions. Consider a target measure Z 1 −U (x) π(dx) = e dx, Z = e−U (x) dx. (1.1) Z d R It is known that, under mild assumptions on the potential U(x), the Langevin stochastic differential equation (SDE) √ dXt = −∇U(Xt )dt + 2dWt (1.2)

has π as its unique invariant measure and is π-ergodic. For appropriate test functions f : Rd 7→ R, by the ergodic theorem, we have Z Z 1 t f (Xs )ds −→ f (x)π(dx). (1.3) t 0

Thus the Langevin diffusion (1.1) is a fundamental R tool for sampling from the target measure π or compute expectation of various functionals f dπ, in view of (1.3). The Langevin SDE is time reversible – its generator is a self-adjoint operator in L2 (π).

Date: February 8, 2017. K.S. was partially supported by the National Science Foundation (NSF) CAREER award DMS 1550918. NSP thanks ONR for partial financial support . 1

For vector fields Γ such that i.e., div(Γe−U ) = 0, diffusions of the form √ dXt = [−∇U(Xt ) + Γ(Xt )] dt + 2dWt ,

(1.4)

also have π as their invariant measure. The divergence free condition can be written as divΓ = Γ∇U.

(1.5)

Observe that, for Γ 6= 0, the diffusions in (1.4) are non-reversible. Numerical discretisations of (1.2) or (1.4) do not necessarily inherit the ergodic properties of the continuous-time dynamics [19]. In particular, discretized processes may not converge at all, or will have an invariant measure different from π. One way of eliminating the above issues is to append the discretization with a Metropolis-Hastings accept-reject mechanism. This process results in the MALA algorithm when the discretised SDE is (1.2), (see e.g., [17]) and in the ipMALA algorithm, which we introduce and study in this paper, when the discretised dynamics is the SDE (1.4). While the proposal comes from a non-reversible dynamics, having an accept-reject scheme makes the algorithm reversible again. It is known that adding a non-reversible component in a Langevin SDE could accelerate its convergence to stationarity. Indications of this phenomena are the main results of [9, 15, 16]. The main result in [9] states that, among the family of non-reversible diffusions (1.4), the one with the smallest spectral gap corresponds to Γ = 0. Similar results were established from an asymptotic variance and large deviations point of view in [15, 16]. Broadly speaking, it is a well documented principle that non-reversible dynamics have better ergodic properties than their reversible counterparts. This observation has sparked a significant amount of research work in recent years and people working in the field have taken many different approaches in order to exploit and analyse the beneficial effects of irreversibility in the algorithmic practice of MCMC methodology. In particular: i) Irreversible algorithms have been proposed and analyzed in [2, 3, 4]; ii) algorithms that are obtained by discretizing irreversible Markov processes in a way that the resulting Markov chain is still irreversible are studied in [8], [13]; iii) numerical algorithms that take advantage of the splitting of reversible-irreversible part of the equation are analyzed in [11, 6]. In addition, comparisons of MALA and of Langevin samplers without the accept-reject step have been performed in [5, 6]. In many cases of interest it has been observed that irreversible Langevin samplers have smaller mean square error when compared to the MALA algorithm, see [6]. The latter fact is related to the consideration that the variance reduction achieved by the irreversible Langevin sampler is more significant than the the error due to the bias of the irreversible Langevin sampler. In this paper we take a different standpoint and we analyse the exact performance of a Metropolis-Hastings algorithm where the proposal is based on discretizing (1.4) as opposed to (1.2). We call this algorithm the irreversible proposal MALA or ipMALA for short. To quantify the efficiency of the ipMALA algorithm in high dimensions and compare it to the usual MALA algorithm, we study its optimal scaling properties [17, 18, 14]. Optimal scaling aims to find the “optimal size” of the local proposal variance as a function of the dimension. The optimality criteria varies for different algorithms, but a natural choice for algorithms that have a diffusion limit is the expected square jumping distance [18]. This criteria lets us quantify the cost of the algorithm in terms of number of steps taken in stationarity to explore the target measure. The basic mechanism for Metropolis-Hastings algorithms consists of employing a proposal transition density q(x, y) in order to produce a reversible chain {xk }∞ k=0 which has the target 2

measure π as invariant distribution. At step k of the chain, a proposal move yk+1 is generated by using q(x, y), i.e., yk+1 ∼ q(xk , ·). Then such a move is accepted with probability α(xk , yk+1), where   π(yk+1)q(yk+1, xk ) α(xk , yk+1) = min 1, . (1.6) π(xk )q(xk , yk+1) The MALA algorithm is a Metropolis-Hastings algorithm with proposal move generated by a time-step discretization of the Langevin equation (1.2). The ipMALA algorithm we wish to analyze obtains proposals by discretizing (1.4). As explained before, any nontrivial Γ satisfying the divergence free condition (1.5) will preserve the invariant measure. A convenient choice, is to pick Γ such that divΓ = 0 ,

and Γ∇U = 0 .

A standard choice of Γ(x) is Γ(x) = S∇U(x),

(1.7)

where S is any antisymmetric matrix. A more elaborate discussion on other possible choices of Γ(x) can be found in [15]. The meaning of these conditions is straightforward: the flow generated by Γ must preserve Lebesgue measure since it is divergence-free; moreover, the micro-canonical measure on the surfaces {U = z} is preserved as well. We make one further important assumption. For the rest of the paper, we focus exclusively on Gaussian target measures. We believe most of our analysis should carry over to the important case in which the target measure has a Radon-Nikodym derivative with respect to a Gaussian measure using the methods in [12, 14]. The main result of the paper is follows. We consider Gaussian target measures π N ∼ N (0, C N ) on RN , where C N := diag{λ21, ..., λ2N } (see Section 2 for more details). Such a measure is clearly of the form (1.1) for a quadratic potential U. Therefore, in this case, the general form of a Euler-discretization of (1.4) is given by 2 σN α 1/2 N xN + σN CSxN zk+1 , k + σN C 2 k N N where zk+1 ∼ N (0, IN ), σN = ℓ/N γ and α > 0. The quantity yk+1 is the proposal, and then a Markov chain is formed using the usual accept-reject mechanism. Let xN k be the resulting ipMALA Markov chain. Interestingly, depending on whether α is bigger, equal or smaller than two, we will have different limiting behaviours. Broadly speaking, we show the following • if γ < 1/6 then the acceptance probability degenerates to zero exponentially quickly, and thus this case is not of practical interest. • if α ≥ 2 then the optimal value of γ is γ = 1/6. We prove that the continuous interpolant of two subsequent steps of the MALA algorithm, see (4.6) for proper definition, has a diffusion limit, which however is different for the cases α > 2 and α = 2. In either case though, the cost of the algorithm is still N 1/3 like in the standard MALA case. • if 1 ≤ α < 2 then take γ ≥ 1/6. In this regime we show that the cost of the algorithm is of the order N αγ . In particular, we show that the continuous interpolant, see (4.6), converges weakly to the solution of a deterministic ODE. Also, if γ > 1/6 then the acceptance probability tends to one exponentially quickly. Notice that if the product N yk+1 = xN k −

3

αγ < 1/3 then there is a gain in efficiency for the ipMALA algorithm as compared to MALA. It is also interesting to point out here (see also Remark 4.2 for more details) that the ODE that we get in the limit in this regime, can be related to Hamilton’s ODE in HMC. The paper is organized as follows. In Section 2 we introduce the notation used throughout the paper. In Section 3 we describe and motivate the algorithm that we will examine. Section 4 contains the rigorous statement of our main result and its main implications. In Section 5 we give a detailed heuristic argument to explain how the main result is obtained and why it should hold. In Section 6 we present some examples of potential choices for the antisymmetric matrix and the computation of the corresponding limiting acceptance probabilities. Section 7 contains extensive simulation studies that demonstrate the theoretical results. Rigorous proofs are relegated to the Appendix. 2. Preliminaries and Notation Let (H, h·, ·i, k · k) denote an infinite dimensional separable Hilbert space with the canonical norm derived from the inner-product. Let C be a positive, trace class operator on H and {φj , λ2j }j≥1 be the eigenfunctions and eigenvalues of C respectively, so that X Cϕj = λ2j ϕj , λ2j < ∞. (2.1) j

The eigenvalues λj are non-decreasing. We assume a normalization under which {φj }j≥1 forms a complete orthonormal basis in H. Let π be a Gaussian probability measure on H with covariance operator C, that is,

If X

N

is the finite dimensional space

π ∼ N (0, C).

H ⊃ X N := span{φj }N j=1

spanned by the first N eigenvectors of the covariance operator (notice that the space X N is isomorphic to RN ), then for any fixed N ∈ N and x ∈ H we denote by P N (x) the projection of x on X N . Moreover, the finite dimensional projection π N of the measure π on X N is given by the Gaussian measure π N ∼ N (0, C N ) where C N := diag{λ21 , . . ., λ2N } (or, more precisely, C N := P N ◦ C ◦ P N ). Thus we have, i,N 2

N

N

1,N

π := π(x ) = π(x

N,N

, . . ., x

N |x | Y 1 1 − 2λ2 i )= √ e ( 2π)N i=1 λi

(2.2)

Let S˜ : H → H be a bounded linear operator. Thus there exists a constant κ > 0 such that ˜ ≤ κkxk, kSxk x ∈ H. (2.3) N ˜ For any N ∈ N we can consider the projected operator S , defined as follows: S˜N x := (P N ◦ S˜ ◦ P N )x. The operator S˜N is also bounded on H. Since X N is isomorphic to RN , S˜N can be represented by an N × N matrix. Throughout the paper we require the following: S˜ is such that for any 4

N ∈ N, the matrix S˜N can be expressed as the product of a symmetric matrix, namely C N , and an antisymmetric matrix, S N : S˜N = C N S N ,

for every N ∈ N.

(2.4)

Throughout the paper we will use the following notation: • x and y are elements of the Hilbert space H; • the letter N is reserved to denote the dimensionality of the space X N where the target measure π N is supported; • xN is an element of X N ∼ = RN (similarly for y N and the noise ξ N ); the j-th compoj,N nent of the N-dimensional vector xN (in the basis {ϕj }N . j=1 ) is denoted by x

We analyse Markov chains evolving in RN . Because the dimensionality of the space in which the chain evolves will be a key fact, we want to keep track of both the dimension N and the step k of the chain. Therefore, N N • xN k will denote the k-th step of the chain {xk }k∈N evolving in R ; • compatibly with the notation set above, xi,N is the i-th component of the vector k i,N N N N xk ∈ R ; i.e., xk = hxk , ϕi i; N N N • two (double) sequences of real numbers {AN k } and {Bk } satisfy Ak . Bk if there exists a constant K > 0 (independent of N and k) such that N AN k ≤ KBk , N for all N and k such that {AN k } and {Bk } are defined. The notation . is used for functions and random functions as well, with the constant K being independent of the argument of the function and on the randomness.

As we have already mentioned, k · k is the norm on H, namely 2

kxk :=

∞ X i=1

hx, ϕi i2 .

With abuse of notation, we will also write N X i,N 2 x . kx k = N 2

i=1

We will also use the weighted norm k · kC , defined as follows: kxk2C

−1

:= hx, C xi =

∞ X |hx, ϕi i|2 i=1

λ2i

,

for all x ∈ H such that the above series is convergent; analogously, N i,N 2 X x kxN k2C N := . λ2i i=1 5

If AN is an N × N matrix, AN = {AN i,j }, we define VAN

:=

N X i=1

λ2i

N X N 2 2 A λ j i,j j=1

2 = EπN k(C N )1/2 AN xN k2 = EπN h(C N )1/2 z N , AN xN i

(2.5)

In the above z N ∼ N (0, IN ), xN ∼ π N and EπN denotes expectation with respect to all the sources of noise contained in the integrand. We will often also write Ek and Ex , to mean Ek = E[·|xk ],

E = E[·|xk = x].

Finally, many of the objects of interest in this paper depend on the matrix S N and on the parameters α and γ. When we want to stress such a dependence, we will add a subscript p , see for example the notation for the drift dp in Theorem 4.1. 3. The ipMALA algorithm As we have already mentioned, the classical MALA algorithm is a Metropolis-Hastings algorithm which employs a proposal that results from a one-step discretization of the Langevin dynamics (1.2). This is motivated by the fact that the dynamics (1.2) is ergodic with unique invariant measure given by π (1.1) and can therefore be used to sample from π. The Langevin dynamics that samples from our target of interest, the Gaussian measure π N in (2.2), reads as follows: √ dXt = −(C N )−1 Xt dt + 2dWt , Xt ∈ X N . If ∆ is any positive definite, N-dimensional symmetric matrix, then the equation √ dXt = −∆(C N )−1 Xt dt + 2∆dWt is still ergodic with invariant invariant measure π N in (2.2). Now notice that if C is a trace class operator, then C −1 is unbounded, so in the limit as N → ∞ any of the above two dynamics (and their discretizations) would lead to numerical instabilities. To avoid the appearance of unbounded operators, we can choose ∆ = C N /2 and therefore consider the SDE 1 dXt = − Xt dt + (C N )1/2 dWt . 2 Discretizing the above and using such a discretization as a Metropolis-Hastings proposal would result in a well-posed MALA algorithm to sample from (2.2). However, as explained in the Introduction, here we want to analyze the MALA algorithm with irreversible proposal. As in (1.7), we next consider the non-reversible SDE:   1 N N −1 dXt = − Xt + S (C ) Xt dt + (C N )1/2 dWt , 2 where S N is any N × N antisymmetric matrix. Again to avoid the appearance of unbounded operators, we modify the irreversible part of drift term and, finally, obtain the dynamics   1 N N dXt = − Xt + C S Xt dt + (C N )1/2 dWt . (3.1) 2 Notice that, for any xN ∈ X N ,  ∇ · (C N S N xN ) π N = Trace(C N S N )π N − hC N S N xN , (C N )−1 xN iπ N = 0, 6

(3.2)

having used the antisymmetry of S N and the symmetry of C N . Therefore, π N is invariant for the dynamics (3.1). This justifies using the Metropolis-Hastings proposal (3.3) below, which is a one-step Euler discretization of the SDE (3.1). We now describe the ipMALA algorithm. If the chain is in xN k at step k, the ipMALA algorithm has the proposal move: N yk+1 = xN k −

2 σN α N N N N xN + σN C S xk + σN (C N )1/2 zk+1 2 k

(3.3)

where ℓ , ℓ, γ > 0 (3.4) Nγ and α > 0. The choice of the parameters α, γ and ℓ will be discussed in Section 4 below. 1,N N,N N S N can be any antisymmetric N × N matrix and zk+1 = (zk+1 , . . ., zk+1 ) is a vector of i.i.d standard Gaussians. With this proposal, the acceptance probability is σN =

β N (xN , y N ) = 1 ∧ eQ

N (xN ,y N )

(3.5)

where 2   1 2α−2 σN kC N S N xN k2C N − kC N S N y N k2C N ky N k2C N − kxN k2C N + σN 8 2 α−2 N N N + 2σN hS y , x i.

QN (xN , y N ) = −

The above expression for QN is obtained with straightforward calculations, after observing that the proposal kernel q(xN , y N ) implied by (3.3) is such that   2 σN 1 N N N α N N N 2 N N x − σN C S x kC N . q(x , y ) ∝ exp − 2 ky − x + 2σN 2 If β˜N ∼ Bernoulli(β N (xN , y N )), then the Metropolis-Hastings chain {xN k }{k∈N} resulting from using the proposal (3.3) can be written as follows N N ˜N N ˜N N ˜N N xN k+1 = β yk+1 + (1 − β )xk = xk + β (yk+1 − xk ).

(3.6)

4. Main result and Implications To understand the behaviour of the chain (3.6), we start by decomposing it into its drift and martingale part; that is, let ζ > 0 and write N xN k+1 = xk +

1 N N 1 N dp (xk ) + ζγ/2 Mk,p ζγ N N

(4.1)

where the approximate drift dN p is defined as N ζγ N N dN p (xk ) := N Ek (xk+1 − xk ) N and the approximate diffusion Mk,p is given by   N N N N Mk,p := N ζγ/2 xN k+1 − xk − Ek (xk+1 − xk ) .

7

(4.2)

Using (3.3)-(3.4) and (3.6), we rewrite the approximate drift (4.2) as follows:    ℓ2 N ℓα N N N N N ζγ QN dp (xk ) = N Ek (1 ∧ e ) − 2γ xk + αγ C S xk 2N N   ℓ N N + N ζγ Ek (1 ∧ eQ ) γ (C N )1/2 zk+1 . N

(4.3) (4.4)

Looking at (4.3) it is clear that we need to examine three different cases, namely i) Diffusive regime : 2γ < αγ, i.e., α > 2 ii) “Transition regime”: 2γ = αγ, i.e., α = 2 iii) Fluid regime : 2γ > αγ, i.e., α < 2. The names of the above regimes will be clear after the statement of Theorem 4.1, see also Remark 4.2. We will show that, in order for the algorithm to have a well defined non-trivial limit, in the cases i) and ii), we should choose ζ = 2, whereas in case iii) we should choose ζ = α. For this reason, it is intended from now on that ζ = α if α < 2

and

ζ = 2 if α ≥ 2.

(4.5)

With this observation in mind, we come to state the main result of this paper, Theorem 4.1. The assumptions under which Theorem 4.1 holds are detailed and motivated in Section 5.1. For the time being we state the theorem and, in Remark 4.2, we make several observations regarding practical implications of the statement. In order to state Theorem 4.1, let us now introduce the continuous interpolant of the chain {xN k }: x(N ) (t) = (N ζγ t − k)xk+1,N + (k + 1 − N ζγ t)xk,N ,

tk ≤ t < tk+1 ,

(4.6)

where tk = k/N ζγ . We stress that the operator S˜ appearing in the statement of the theorem is as in Section 2 (see (2.3)-(2.4)). In particular, the projected operator S˜N is, for every N, the product of the matrix C N and of the antisymmetric matrix S N appearing in the proposal (3.3). In order to be able to properly motivate the assumptions of the paper but at the same time present our main result here, we defer the precise statement of the assumptions in Section 5 but we state the precise result below. Theorem 4.1. Let Assumption 5.1, Assumption 5.2 and Condition 5.3 hold and let x0 ∼ π. Then, as N → ∞, the continuous interpolant x(N ) (t) of the chain {xN k } (defined in (4.6)-(4.5) N = P (x ), converges weakly in C([0, T ]; H) and (3.6), respectively) started from the state xN 0 0 to the solution of the following equation dx(t) = dp (x(t))dt + Dp dWtC ,

x(t) ∈ H,

(4.7)

where WtC is a H-valued Brownian motion with covariance C. In addition, the drift coefficient dp (x) : H → H and the diffusion constant Dp ≥ 0 are defined as follows:  ℓ2 − 2 hp x if γ = 1/6 and α > 2      2 ˜ if γ = 1/6 and α = 2 dp (x) := (4.8) − ℓ2 hp x + hp ℓ2 Sx      ˜ hp ℓα Sx if γ ≥ 1/6 and 1 ≤ α < 2, 8

and

 p ℓ hp if γ = 1/6 and α ≥ 2 Dp := 0 if γ ≥ 1/6 and 1 ≤ α < 2 The real constants hp and hp are defined in (5.23) and in (5.27) respectively. Proof. See Appendix A and B.

(4.9)



We now comment on the practical implications of Theorem 4.1. Remark 4.2. The three limiting drifts (and corresponding diffusion coefficients) appearing in (4.8) correspond to the three different regimes i), ii) and iii), which we identified before the statement of the theorem. The choice γ = 1/6 in case i) and ii) (i.e., in the diffusive and transition regime, respectively) and γ ≥ 6 in the fluid regime will be fully motivated in Section 5.1 below. The various results obtained in the different regimes are summarised in the Tables 1 and 2 below. The constants cj that the table refers to are those appearing in Assumption 5.2. • In the regimes i) and ii), the effective time-step implied by the interpolation ((4.5)(4.6)), is N −2γ = N −1/3 . Therefore, if γ = 1/6 and α ≥ 2, Theorem 4.1 implies that the optimal scaling for the proposal variance when the chain is in its stationary regime is of the order N −1/3 . That is, the cost of the algorithm (in terms of number of steps needed, in stationarity, to explore the state space) is O(N 1/3 ). This is the same scaling as obtained in [17, 14] for the MALA algorithm. Therefore, in the regimes i) and ii), the ipMALA algorithm has the same scaling properties of the usual MALA. We stress that the diffusion limit in case i) and ii) are different (see also table below). However either limiting diffusions admit our target measure as invariant measure. More discussions on this can be found in Section 5.2. • When α < 2 and γ ≥ 1/6, the (rescaled) chain converges to a fluid limit; i.e., the limiting behaviour of the chain is described by an ODE. Such an ODE still admits our target as invariant measure. With the same reasoning as above, the main results implies that, in this regime, the cost of the algorithm (i.e., number of steps needed, in stationarity, to explore the state space) is of the order N αγ . Therefore, if we choose αγ < 1/3, we can potentially achieve an improvement with respect to standard MALA. This is in line with the results of [15, 16]. Furthermore, as will be demonstrated later on, in this regime one can choose α, γ and the matrix S N such that the average acceptance probability tends to one. This can be achieved for example if the constant c1 appearing in Assumption 5.2 is equal to zero. However, if we do so, then the constant hp appearing in the limiting drift is zero. That is, the limiting behaviour of the chain is described by the trivial ODE x˙ = 0, see Table 2. This implies that, for large N, the chain will tend to get stuck. However, we also comment here that in the simulations that we performed we did not see the chain getting stuck when x˙ = 0, which may imply that one may need to go to very high dimensions to observe such issues. • In regime iii), i.e., when α < 2 and γ ≥ 1/6, we get a deterministic ODE in the limit. As already mentioned, the target measure π is still an invariant measure for this ODE, see (3.2). This implies that the micro-canonical measure is being preserved on the level sets of the potential function {U = z}. This is reminiscent of Hamilton’s ODE’s in HMC as one can potentially envision adding a velocity component and 9

thus constructing and HMC algorithm with the appropriate Hamiltonian function. We plan to pursue this in a subsequent work. cj 6= 0 for j = 2, 3 γ = 1/6 ⇒ acc. prob is O(1) p 2 α>2 dxt = − ℓ2 hp xt + ℓ hp dWtC N 1/3 p 2 ˜ t + ℓ hp dW C N 1/3 α=2 dxt = − ℓ2 hp xt + hp ℓ2 Sx t ˜ t 1≤α 1/6 ⇒ acc. prob is O(1) α≥2 too costly N αγ (> N 1/3 ) ˜ t 1≤α2 dxt = − ℓ2 hp xt + ℓ hp dWtC α>2 dxt = 0 1 N 1/3 ) 1≤α> σN ; so we expect that, asymptotically, the last term in (5.2) will be negligible. The same reasoning can be applied to the terms in (5.3). Therefore, irrespective of the choice of γ and α, we have the approximation

 1 2α  ˜N N 2 ˜N (C N )1/2 z N k2 N − 2σ α−1 h(C N )1/2 z N , S N xN i QN ≃ σ k S x k − k S N α N C C 2 N 2α−2 ˜N N 2 2α−1 ˜N − 2σN kS x kC N − σN hS (C N )1/2 z N , S˜N xN iC N 1 4α−2 ˜N 2 N 2 3α−1 ˜N k(S ) x kC − σN hS (C N )1/2 z N , (S˜N )2 xN iC − σN 2   2 σN 3α−2 1− − σN hS˜N xN , (S˜N )2 xN iC N . 2

(5.6)

We further observe that, in stationarity (and again irrespective of the choice of γ, α and S N ) the term  1 2α  ˜N N 2 σN kS x kC − kS˜N (C N )1/2 z N k2C N 2

is smaller than the term

2α−2 ˜N N 2 −2σN kS x kC N

in the sense that it is always of lower order in N. Moreover, due to the skew-symmetry of S N , the term (5.6) is identically zero: hS˜N xN , (S˜N )2 xN iC N = hC N S N xN , C N S N C N S N xN iC N = hC N S N xN , S N C N S N xN i = 0 . 11

Therefore we can make the further approximation α−1 N 1/2 N 2α−2 ˜N N 2 QN z , S N xN i − 2σN kS x kC N α ≃ − 2σN h(C ) 1 4α−2 ˜N 2 N 2 2α−1 ˜N k(S ) x kC N − σN hS (C N )1/2 z N , S˜N xN iC N − σN 2 3α−1 ˜N − σN hS (C N )1/2 z N , (S˜N )2 xN iC N =: RαN .

(5.7) (5.8) (5.9)

As a result of the above reasoning we then have the heuristic approximation N QN α ≃ Rα .

(5.10)

At this point it is worth to notice the following:

2 EπN kS˜N xN k2C N = EπN k(C N )1/2 S N xN k2 = VSN = EπN hC 1/2 z N , S N xN i 2 ˜N N 1/2 N ˜N N N 2 N 2 ˜ EπN k(S ) x kC N = VS S˜ = EπN hS (C ) z , S x iC N ,

where the first equality in (5.11) holds because k(C N )1/2 S N xN k2 = kS˜N xN k2 N C

(5.11) (5.12)

for all x,

and the others follow from (2.5). With the previous observations in mind, we now heuristically argue that the only sensible choice for γ is γ = 1/6 if α ≥ 2 and γ ≤ 1/6 if α < 2. In order to do so we look at the decomposition (5.1) of QN and recall the definition (3.5) of the acceptance probability; we then write N ¯N β N = 1 ∧ (eQ eQα ). ¯ N : if we start the chain in stationarity, i.e. xN ∼ π N then Let us start by looking at Q 0 N N xk ∼ π for every k ≥ 0. In particular, if xN ∼ π N then xN can be represented as P xN = N i=1 λi ρi ϕi , where ρi are i.i.d. N (0, 1). We can therefore use the law of large numbers and observe that N X N 2 kx kC N = |ρi |2 ≃ N. i=1

Similarly, by the Central Limit Theorem (CLT) the term hxN , (C N )1/2 z N iC N is O(N 1/2 ) and converges to a standard Gaussian as N → ∞. Again by the CLT we can see that also the term kxN k2C N − kz N k2 is O(N 1/2 ) in stationarity. Therefore, recalling (3.4), one has

3 6 ¯ N ≃ − ℓ hxN , (C N )1/2 z N iC N − ℓ kxN k2 N . Q C N 3γ N 6γ With these observations in place we can then argue the following: ¯ N is O(1) (in particular, for large N it converges to a Gaussian • If γ = 1/6 then Q ¯N with finite mean and variance, see (5.19)); therefore eQ is O(1) as well ¯ N → −∞ (more precisely EQ ¯ N → −∞), hence eQ¯ N → 0 • If γ < 1/6 then Q ¯ N → 0, hence eQ¯ N → 1 • If γ > 1/6 then Q Let us now informally discuss the acceptance probability β N in each of the above three cases, taking into account the behaviour of QN α as well. To this end we point out that by (5.11)(5.12) and (5.7) - (5.9), as N → ∞, the (average, given x, of the) quantity QN α can only tend to zero, go to −∞ or be O(1), but it cannot diverge to +∞. With this in mind,

12

¯ N is O(1)) then one has the following two sub-cases a): If γ = 1/6 (that is, Q N a1): γ = 1/6 and either QN α → 0 or Qα is O(1). In this case the limiting acceptance probability does not degenerate to zero (and does not tend to one) N a2): γ = 1/6 and QN → 0. α → −∞. In this case β N Q N b): If γ < 1/6 then e can only tend to zero (if Qα → 0, to −∞ or is O(1)). This is not a good regime as in this case β N would overall tend to zero. c): If γ > 1/6 we have three sub-cases N c1): γ > 1/6 and QN is O(1). α is O(1). In this case β N N c2): γ > 1/6 and Qα → 0, in which case β → 1. N c3): γ > 1/6 and QN degenerates to zero. α → −∞ , in which case β N We clearly want to rule out all the cases in which β tends to zero, (because this implies that the chain is getting stuck). Therefore, if we want to “optimize” the behaviour of the chain, the only two good choices are a1), c1) and, potentially c2). However, considering that the cost of running the chain is either 2γ (when α ≥ 2) or αγ (when α < 2), the case γ > 1/6 is always going to be sub-optimal if α ≥ 2; which is why in Theorem 4.1 we only consider γ > 1/6 if α < 2. We therefore want to make assumptions on S N , γ and α to guarantee that our analysis falls only into one of the good regimes, i.e., to guarantee that QN α is of order one (or tends to zero). Assumption 5.1 and Assumption 5.2 below are enforced in order to guarantee that this is indeed the case. To explain the nature of Assumption 5.1 below, we recall the approximation (5.10) and the expression for RαN , namely: ℓα−1 ℓ2(α−1) ˜N N 2 N 1/2 N N N h(C ) z , S x i − 2 kS x kC N (α−1)γ N 2(α−1)γ 1 ℓ2(2α−1) ℓ2α−1 k(S˜N )2 xN k2C − (2α−1)γ hS˜N (C N )1/2 z N , S˜N xN iC − 2(2α−1)γ N 2N ℓ3α−1 ˜N N 1/2 N ˜N 2 N − (3α−1)γ hS (C ) z , (S ) x iC (5.13) N Roughly speaking, Assumption 5.1 requires that a CLT should hold for all the scalar products in the above expressions, Assumption 5.2 requires a Law of Large Numbers (LLN) to hold for the remaining terms. We will first state such assumptions and then comment on them in Remark 5.5. We recall that S˜N = C N S N so, since C N is fixed by the problem and invertible, the following assumptions can be equivalently rephrased in terms of either S˜N or S N . RαN = − 2

Assumption 5.1. The entries of the matrix S N do not depend on N. There exist positive constants d1 , d2 , d3 > 0 such that the sequence of matrices {S N }N (and the related {S˜N }N ) satisfies the following conditions, for some α ≥ 1: i) ii) iii)

1

N (α−1)γ 1

D

h(C N )1/2 z N , S N xN i −→ N (0, d1)

D hS˜N (C N )1/2 z N , S˜N xN iC N −→ N (0, d2) N (2α−1)γ 1 D hS˜N (C N )1/2 z N , (S˜N )2 xN iC N −→ N (0, d3), (3α−1)γ N

13

D

where −→ denotes convergence in distribution. Assumption 5.2. The parameter α ≥ 1 and the sequence of matrices {S N }N are such that 2 kSx ˜ N k2C (5.14) i) lim EπN 2(α−1)γ − c1 = 0 N →∞ N 2 kS˜2 xN k2 ii) lim EπN 2(2α−1)γC − c2 = 0 N →∞ N 2 kS˜T S˜2 xk C iii) lim EπN 2(3α−1)γ − c3 = 0 , N →∞ N for some constants c1 , c2 , c3 ≥ 0. We do not consider sequences {S N }, such that (5.14) is satisfied with c1 = 0 and α = 1. That is, we consider sequences {S N }N such that (5.14) is satisfied with c1 = 0, only in the cases in which this happens for α > 1, see Remark 5.5. In view of (5.11)-(5.12), Assumption 5.2 implies the following three facts EπN kS˜N xN k2C N = c1 N →∞ N 2(α−1)γ EπN k(S˜N )2 xN k2C N = c2 lim N →∞ N 2(2α−1)γ k(S˜N )T (S˜N )2 xN kC = c3 . lim EπN N →∞ N 2(3α−1)γ Let us state a sufficient condition under which Assumption 5.1 holds. lim

(5.15) (5.16) (5.17)

Condition 5.3. Let z N ∼ N (0, IdN ) and xN ∼ π N . The sequence of matrices {S N }N is such that for every h = 1, . . ., N the σ-algebra generated by the random variable z h,N (S N xN )h is independent of the σ-algebra generated by the sum h−1 X

z j,N (S N xN )j .

j=1

In the above we stress that (S N xN ) is a vector in RN and (S N xN )h is the h-th component of such a vector. In order for Assumption 5.1 to be satisfied, the same should hold for (S˜N )T S˜N and for (S˜N )T (S˜N )2 as well. Notice that, because C N is diagonal, if (S N )2 and (S N )3 satisfy Condition 5.3, then also ˜ N and (S˜N )T (S˜N )2 do. We give an example of a matrix satisfying all such conditions (S˜N )T (S) in Remark 5.5. Finally, for technical reasons, linked merely to the proof of statement (ii) of Lemma B.2, we also assume the following. Assumption 5.4. (i) Under π N , the variables appearing in (5.15)- (5.17) have exponential moments, i.e., there exists a constant m, independent of N, such that EπN e

˜ N xN k2 kS CN N 2(α−1)γ

, EπN e

˜ N )2 x N k 2 k(S CN N 2(2α−1)γ

14

, EπN e

˜ N )T (S ˜ N )2 x N k k(S C N 2(3α−1)γ

< m.

(ii) There exists some r > 1 such that EπN

4 !r N X λ6j (S N xN )j j=1

N 2γ(α−1)

2. Then the approximate drift of the chain is given by (4.3)-(4.4), where we take ζ = 2. We will prove that when α > 2 the addend (4.4) is asymptotically small. This happens because QN and z N are asymptotically independent; in particular we will show that the correlations between QN and z N decay faster than N −1/6 , that is  i h 1/2 N 1/6 QN C zk+1 → 0 N Ek 1 ∧ e

(the formal statement of the above heuristic is contained in (B.17) and (B.18)). Therefore    ℓα ˜ N N ℓ2 N N N 1/3 QN d (xk ) ≃ N Ek (1 ∧ e ) − 1/3 xk + α/6 S xk 2N N   2    ℓ N ℓα QN QN N N ˜ = Ek (1 ∧ e ) − xk + Ek (1 ∧ e ) (α−2)/6 S xk . (5.25) 2 N

We use the definition (5.23) and the heuristic approximation (5.21) and observe that if α > 2 the second addend in (5.25) disappears in the limit; therefore the drift coefficient of the limiting diffusion is ℓ2 dp (x) = − hp x . 2 • If instead γ = 1/6 and α = 2 then the second addend in (5.25) does not disappear in the limit; on the contrary, with the same reasoning as in the previous case, it will converge to ˜ Moreover, the term in (4.4) is no longer asymptotically small and contributes to the hp ℓ2 Sx. ¯ N + QN : limiting drift. The reason why this happens can be seen by recalling that QN = Q α N N ¯ and z are not affected by the choice of α; however the clearly, the correlations between Q N value of α does affect the decay of the correlations between QN α and z . In Appendix B we present a calculation which shows that (under appropriate conditions on S), if α = 2 then i  h 1/2 N 1/6 QN ˜ k C zk+1 ≃ −2ℓ2 hp Sx N ℓEk 1 ∧ e 17

where hp is a real constant, namely hp := EeQ 1{Q 1/2. Then, we can compute EπN kS˜3 xN k2C =

N X

λ2i

i=1

=2

N X i=1

N X j=1

(Sij )2 λ2j =

X

i odd

i−2k |Si,i+1|2 (i + 1)−2k +

(2i − 1)−2k Ji2 (2i)−2k 19

X

i even

i−2k |Si,i−1 |2 (i − 1)−2k

In order to make sure that c1 6= 0, let us choose Ji = (2i − 1)k (2i)k i[2(α−1)γ−1]/2 with P 2(α−1)γ−1 , which is the partial sum of the α > 1. Then, we have that EπN kS˜3 xN k2C = 2 N i=1 i Riemann zeta function. P −s Let us recall that for s 6= 1 and ζN (s) = N one has the asymptotic Euler-Maclaurin i=1 i formula     X m 1 1 d(2r−1) 1 B2r d(2r−1) 1 1 dx − −1 + − ζN (s) = xs 2 (N + 1)s r! dx(2r−1) xs x=N +1 dx(2r−1) xs x=1 1 r=1 Z N +1 1 d2m 1 − B2m (x − xxy) 2m s dx (2m)! 1 dx x Z

N +1

where Bn are Bernoulli numbers and Bn (x) are Bernoulli polynomials. Evaluating now this at m = 1 we get the expression      1 1 1 1 s 1−s ζN (s) = 1− (N + 1) − 1 − −1 + 1−s 2 (N + 1)s 12 (N + 1)s+1 Z N +1 1 s(s + 1) B2 (x − xxy) s+2 dx (6.1) − 2 x 1 Using the Euler-Maclaurin formula we then compute by (6.1) with s = 1 − 2(α − 1)γ 2 kS˜3 xN k2 c1 = lim EπN 2(α−1)γC = lim EπN N →∞ N →∞ N

PN

2(α−1)γ−1 i=1 i N 2(α−1)γ

=

2 1 = . 1 − (1 − 2(α − 1)γ) (α − 1)γ

Next we compute c2 for this particular choice of the S matrix. We have

EπN kS˜32 xN k2C =

N X

λ2i

i=1

=2

j=1

N X

i odd

=2

N N N X X X 4 2 2 2 2 2 ˜ ˜ λ4i λ4i+1 J(i+1)/2 ((S S)ij ) λj = λi ((S S)ii ) λi = 2

N X

i=1

i−4k (i + 1)−4k (2(i + 1)/2 − 1)4k (2(i + 1)/2)4k −4k

i

(i + 1)

−4k

4k

(i) (i + 1)

i odd

2[2(α−1)γ−1] N  X i+1 =2 2 i odd =2

N X

i odd

i2[2(α−1)γ−1]

i=1

20

4k



i+1 2



2[2(α−1)γ−1]

i+1 2

2[2(α−1)γ−1]

As with the computation for c1 the latter is related to the partial sum of the Riemann zeta function. Using now (6.1) with s = −2[2(α − 1)γ − 1] P 4(α−1)γ−2 2 N kS˜32 xN k2C i=1 i c2 = lim EπN 2(2α−1)γ = lim N →∞ N →∞ N N 2(2α−1)γ (N + 1)1+2[2(α−1)γ−1] − 1 1 = lim 2 2(2α−1)γ N →∞ N 1 + 2[2(α − 1)γ − 1] =0 As far as c3 is concerned, we obtain by a computation similar to the computation for c2 that kS˜3T S˜32 xkC c3 = lim EπN 2(3α−1)γ =0 N →∞ N For the same reasons as in Remark 5.5 we also have di = ci for i = 1, 2, 3. Hence, 1 we have obtained that in this example d1 = c1 = (α−1)γ and d2 = c2 = d3 = c3 = 0.   6 6 ℓ 1 ℓ − a, 16 + b in (5.22), with a = 2ℓ2(α−1) (α−1)γ and Then, this implies that Q ∼ N − 32

1 . Hence, in this case b = 2a and thus hp = 0. b = 4ℓ2(α−1) (α−1)γ Next we discuss optimal acceptance probabilities. Since in this case b = 2a which then implies hp = 0, we can answer this question in the case of regimes i), i.e., α > 2 and ii) α = 2. In these cases we can find the ℓ that maximized the speed of the limiting diffusion, i.e., ℓ2 hp and then for the maximizing ℓ compute the limiting acceptance probability hp as given by (5.24). In Table 3 we record optimal acceptance probabilities hp derived in (5.24) for different choices for α when γ = 1/6 in the case the matrix S being S3 as defined above.

α optimal hp

2 4 6 8 10 15 30 0.234 0.574 0.702 0.767 0.803 0.848 0.884

Table 3. Optimal acceptance probability when S N = S3N for different α′ s when γ = 1/6. In particular, Table 3 shows that by increasing α, we can increase the optimal acceptance probability as desired. In Section 7 we present simulation studies based on using S1N , S2N and S3N confirming and illustrating the theoretical results. 7. Simulation studies The goal of this section is to illustrate the theoretical findings of this paper via simulation studies. Upon close inspection of Theorem 4.1 it becomes apparent that closed form expressions of quantities such as optimal acceptance probabilities and optimal scalings are hard to obtain and dependent on the choice of the sequence S N . For this reason we will resort to simulations. As a measure of performance we choose similarly to [17, 18] the expected squared jumping distance (ESJD) of the algorithm defined as 2 1 ESJD = E Xk+1 − Xk1 21

It is easy to see that maximizing ESJD is equivalent to minimizing the first order lag autocorrelation function, ρ1 = Corrπ (Xk+1, Xk ). Hence, motivated by [18] and based on Theorem 4.1 we define the computational time as CTp = −

N ζγ

1 log ρ1

(7.1)

where ζ = 2 if α ≥ 2 and ζ = α if α = 2. We seek to minimize CTp . A word of caution is necessary here. Even though CTp as a measure of performance is easily justifiable in the case that one has a diffusion limit, i.e., when α ≥ 2, the situation is more complex for α < 2. By Theorem 4.1 if α < 2, the algorithm converges to a deterministic ODE limit 1 and not to a diffusion limit. Hence, in this case the absolute jumping distance Xk+1 − Xk1 may make more sense as a measure of performance in the limit as N → ∞. However, in the deterministic case, maximizing the absolute jumping distance or the squared jumping distance are equivalent tasks. Hence, working with the acceptance probability and with ESJD and as a consequence with CTp as defined in (7.1) may be reasonable criteria on quantifying convergence. The simulations presented below were performed using a MPI C parallel code. The numerical results that follow report sample averages and sample standard deviations of quantities such as acceptance probability and different observables. The reported values were obtained based on 1600 independent repetitions of MALA and ipMALA algorithm with 104 steps for each independent realization. For each independent realization of the algorithm the numerical values selected are the ones corresponding to minimizing CTp . We also remark here that in all of the simulation studies CTp was a convex function of the acceptance probability (this resembles the behavior in the standard MALA case, see [18]). Below we report statistical estimator and corresponding empirical standard deviation for optimal acceptance probability. In addition, to demonstrate convergence to the equilibrium we also computed a few observables. We report below statistical estimators along with their corresponding statistical standard deviations for 2

θ2 = EπN kXk , and θ2 = EπN

N X

Xi3 ,

i=1

where with some abuse of notation we have denoted by Xi the ith component of the X vector. First we record simulations for the antisymmetric matrix S N = S1N , defined in Section 6. We remark here that we also did the same simulations with the S = S2N and the numerical results were statistically the same. So, we only report the ones for S N = S1N , see Tables 4-6. In the case of S N = S1N , we have that ci = di = 0 and thus the optimal acceptance probability in the limit as N → ∞ is the same as with standard MALA, i.e. around 0.574. However, as the Tables 4-6 demonstrate when γ = 1/6 (the optimal choice for standard MALA) this is being realized for much larger values of N for ipMALA compared to MALA. On the other hand if we take γ < 1/6 (which is not optimal for standard MALA), e.g., γ = 1/2 in Table 6, and focus on the fluid regime, regime iii), we notice that CTp is estimated to be smaller with ipMALA compared to MALA, indicating potential advantages in the computational time of exploration of the space. 22

γ = 1/6 standard MALA α = 1.2 α = 1.5 α = 1.8 α = 2.0 α = 2.5 α = 3.0 γ = 1/2 standard MALA α = 1.2 α = 1.5 α = 1.8 α = 2.0 α = 2.5 α = 3.0

\ (b hp , sd(h p )) (0.537, 0.043) (0.464, 0.039) (0.472, 0.041) (0.481, 0.041) (0.488, 0.040) (0.507, 0.043) (0.535, 0.044)

\ (θb2 , sd(θ 2 )) (1.071, 0.022) (1.064, 0.025) (1.063, 0.025) (1.061, 0.025) (1.061, 0.025) (1.059, 0.024) (1.061, 0.023)

\ (θb3 , sd(θ 3 )) (−0.0002, 0.062) (0.004, 0.071) (−0.0016, 0.073) (−0.0015, 0.074) (0.0005, 0.077) (−0.0041, 0.076) (0.00068, 0.077)

dp CT 0.725 1.289 1.175 1.070 1.005 1.038 1.063

(0.753, 0.008) (0.646, 0.008) (0.649, 0.008) (0.653, 0.009) (0.655, 0.009) (0.662, 0.011) (0.668, 0.012)

(1.071, 0.020) (1.065, 0.024) (1.066, 0.024) (1.063, 0.024) (1.064, 0.024) (1.063, 0.023) (1.064, 0.023)

(−0.0006, 0.066) (−0.0025, 0.082) (0.0025, 0.082) (−0.0011, 0.081) (0.0010, 0.083) (0.0011, 0.084) (0.00317, 0.078)

0.427 0.663 0.466 0.327 0.556 0.549 0.541

Table 4. Dimension N = 10 and S N = S1N .

γ = 1/6 standard MALA α = 1.2 α = 1.5 α = 1.8 α = 2.0 α = 2.5 α = 3.0 γ = 1/2 standard MALA α = 1.2 α = 1.5 α = 1.8 α = 2.0 α = 2.5 α = 3.0

\ (b hp , sd(h p )) (0.538, 0.052) (0.463, 0.047) (0.476, 0.048) (0.489, 0.049) (0.496, 0.052) (0.518, 0.051) (0.528, 0.053)

\ (θb2 , sd(θ 2 )) (1.054, 0.025) (1.046, 0.028) (1.046, 0.027) (1.046, 0.027) (1.046, 0.031) (1.046, 0.026) (1.047, 0.028)

\ (θb3 , sd(θ 3 )) (−0.0035, 0.086) (−0.0023, 0.100) (−0.0021, 0.097) (0.0016, 0.096) (0.0017, 0.094) (−0.0028, 0.096) (0.00031, 0.093)

dp CT 0.900 1.826 1.485 1.209 1.053 1.037 1.021

(0.949, 0.003) (0.788, 0.006) (0.832, 0.005) (0.865, 0.005) (0.883, 0.005) (0.914, 0.004) (0.931, 0.003)

(1.036, 0.038) (1.021, 0.043) (1.023, 0.041) (1.028, 0.041) (1.030, 0.041) (1.031, 0.038) (1.034, 0.037)

(−0.0015, 0.144) (−0.0007, 0.169) (0.0008, 0.166) (−0.0043, 0.158) (−0.0021, 0.158) (−0.0001, 0.153) (0.0002, 0.151)

0.776 1.326 0.684 0.361 0.873 0.828 0.803

Table 5. Dimension N = 50 and S N = S1N .

Let us next our attention to the behavior of the algorithm when S N = S3N . In this case 1 c1 = (α−1)γ 6= 0, but as in the case of S1N or S2N we still have that hp = 0. However, as we saw in Table 3 the effect of the value of α on the optimal acceptance probabilities is significant. Let us demonstrate the behavior via a number of simulation studies reported in Tables 7-9 as we did for the case of S N = S1N . As we described in Section 6, such a computation makes sense mainly in the case α ≥ 2 and γ = 1/6. Of course one can compute via simulation empirical values for any value of α > 1 and γ > 0. The acceptance probabilities in the case α ∈ (1, 2) were very small, leading to estimators with large variances. Hence, we only report values for larger values of α where one may expect to have results that are comparable to what standard MALA gives. In order to be consistent with Tables 4-6, we also present data for γ = 1/2. 23

γ = 1/6 standard MALA α = 1.2 α = 1.5 α = 1.8 α = 2.0 α = 2.5 α = 3.0 γ = 1/2 standard MALA α = 1.2 α = 1.5 α = 1.8 α = 2.0 α = 2.5 α = 3.0

\ (b hp , sd(h p )) (0.549, 0.054) (0.473, 0.049) (0.484, 0.054) (0.498, 0.054) (0.507, 0.056) (0.523, 0.054) (0.533, 0.057)

\ (θb2 , sd(θ 2 )) (1.045, 0.027) (1.037, 0.029) (1.036, 0.034) (1.037, 0.036) (1.041, 0.084) (1.041, 0.027) (1.039, 0.034)

\ (θb3 , sd(θ 3 )) (0.0015, 0.102) (−0.0012, 0.115) (−0.0028, 0.111) (−0.0021, 0.109) (0.0074, 0.235) (−0.0018, 0.106) (0.00019, 0.104)

dp CT 0.955 2.080 1.623 1.268 1.076 1.052 1.031

(0.975, 0.002) (0.809, 0.006) (0.865, 0.005) (0.904, 0.004) (0.922, 0.004) (0.952, 0.003) (0.966, 0.002)

(1.006, 0.050) (0.981, 0.055) (0.989, 0.053) (0.993, 0.054) (0.997, 0.049) (1.002, 0.049) (1.002, 0.049)

(0.0103, 0.206) (0.0039, 0.228) (0.0138, 0.216) (−0.0066, 0.213) (0.0004, 0.215) (0.0018, 0.211) (−0.0016, 0.204)

0.970 1.705 0.785 0.371 1.055 1.009 0.985

Table 6. Dimension N = 100 and S N = S1N .

γ = 1/6 standard MALA α=2 α=4 α=6 α=8 α = 10 α = 15 α = 30 γ = 1/2 standard MALA α=2 α=4 α=6 α=8 α = 10 α = 15 α = 30

\ (b hp , sd(h p )) (0.541, 0.043) (0.229, 0.056) (0.566, 0.065) (0.686, 0.063) (0.751, 0.051) (0.784, 0.052) (0.821, 0.045) (0.855, 0.026)

\ (θb2 , sd(θ 2 )) (1.073, 0.022) (0.935, 0.081) (1.026, 0.034) (1.045, 0.072) (1.051, 0.026) (1.056, 0.061) (1.059, 0.031) (1.069, 0.025)

\ (θb3 , sd(θ 3 )) (−0.001, 0.061) (−0.001, 0.211) (0.006, 0.141) (0.003, 0.101) (0.0001, 0.101) (0.004, 0.172) (−0.002, 0.096) (−0.003, 0.087)

dp CT 0.725 8.139 3.367 1.946 1.418 1.151 0.951 0.798

(0.753, 0.008) (0.205, 0.066) (0.555, 0.092) (0.691, 0.083) (0.761, 0.068) (0.808, 0.064) (0.866, 0.054) (0.924, 0.046)

(1.072, 0.020) (0.869, 0.118) (0.979, 0.087) (0.997, 0.056) (1.001, 0.038) (1.015, 0.036) (1.023, 0.032) (1.038, 0.067)

(−0.002, 0.068) (0.0002, 0.339) (−0.012, 0.242) (0.002, 0.174) (−0.008, 0.166) (0.002, 0.168) (−0.0003, 0.156) (−0.002, 0.206)

0.198 9.419 2.022 1.411 1.232 1.051 0.551 0.213

Table 7. Dimension N = 10 and S N = S3N .

Appendix A. Proof of Theorem 4.1 In this section we present the proof of our main results. The proof is based on diffusion approximation techniques analogous to those used in [12]. In [12] the authors consider the MALA algorithm with reversible proposal. That is, if we fix S = 0 in our paper and Ψ = 0 in their paper, the algorithms we consider coincide. For this reason we try to adopt a notation as similar as possible to the one used in [12] and, for the sake of brevity, we detail only the parts of the proof that differ from the work [12] and just sketch the rest. We start by recalling that, by Fernique’s theorem, EπN kxN kp = Ek(C N )1/2 z N kp . 1,

for all p ≥ 1.

This fact will be often implicitly used without mention in the remainder of the paper. 24

(A.1)

γ = 1/6 standard MALA α=2 α=4 α=6 α=8 α = 10 α = 15 α = 30 γ = 1/2 standard MALA α=2 α=4 α=6 α=8 α = 10 α = 15 α = 30

\ (b hp , sd(h p )) (0.538, 0.052) (0.216, 0.077) (0.563, 0.074) (0.689, 0.068) (0.748, 0.064) (0.784, 0.063) (0.824, 0.045) (0.849, 0.035)

\ (θb2 , sd(θ 2 )) (1.054, 0.025) (0.882, 0.177) (1.004, 0.041) (1.026, 0.035) (1.036, 0.042) (1.048, 0.098) (1.048, 0.031) (1.054, 0.025)

\ (θb3 , sd(θ 3 )) (0.004, 0.089) (−0.035, 0.451) (0.004, 0.180) (−0.009, 0.141) (0.003, 0.157) (0.006, 0.181) (−0.005, 0.123) (−0.0002, 0.114)

dp CT 0.912 8.453 3.143 2.024 1.512 1.269 0.991 0.843

(0.849, 0.002) (0.222, 0.102) (0.564, 0.013) (0.699, 0.023) (0.774, 0.071) (0.816, 0.034) (0.873, 0.088) (0.932, 0.073)

(1.038, 0.037) (0.681, 0.111) (0.837, 0.107) (0.887, 0.139) (1.001, 0.019) (0.967, 0.056) (0.948, 0.101) (0.958, 0.071)

(0.0004, 0.147) (0.025, 0.437) (0.008, 0.378) (0.003, 0.272) (0.005, 0.267) (−0.002, 0.263) (0.008, 0.264) (−0.001, 0.114)

0.211 3.396 1.696 1.405 1.269 1.101 0.781 0.583

Table 8. Dimension N = 50 and S N = S3N .

γ = 1/6 standard MALA α=2 α=4 α=6 α=8 α = 10 α = 15 α = 30 γ = 1/2 standard MALA α=2 α=4 α=6 α=8 α = 10 α = 15 α = 30

\ (b hp , sd(h p )) (0.538, 0.057) (0.225, 0.071) (0.559, 0.083) (0.688, 0.073) (0.749, 0.068) (0.784, 0.059) (0.823, 0.051) (0.851, 0.034)

\ (θb2 , sd(θ 2 )) (1.047, 0.036) (0.891, 0.178) (1.003, 0.039) (1.019, 0.047) (1.032, 0.036) (1.034, 0.038) (1.043, 0.038) (1.053, 0.033)

\ (θb3 , sd(θ 3 )) (0.001, 0.093) (0.003, 0.352) (−0.008, 0.180) (−0.001, 0.168) (−0.006, 0.155) (−0.001, 0.144) (−0.008, 0.131) (−0.0016, 0.124)

dp CT 0.938 7.654 3.467 2.012 1.488 1.245 0.981 0.942

(0.948, 0.002) (0.221, 0.101) (0.562, 0.015) (0.696, 0.034) (0.771, 0.082) (0.817, 0.067) (0.873, 0.093) (0.937, 0.075)

(1.019, 0.045) (0.782, 0.113) (0.843, 0.139) (0.841, 0.121) (0.951, 0.082) (0.989, 0.063) (1.003, 0.102) (0.908, 0.101)

(0.0001, 0.183) (0.015, 0.357) (0.012, 0.313) (0.001, 0.275) (−0.001, 0.217) (−0.003, 0.311) (−0.002, 0.235) (−0.013, 0.111)

0.211 3.145 1.551 1.299 1.187 1.127 0.967 0.486

Table 9. Dimension N = 100 and S N = S3N .

We also recall that the chain {xN k }k that we consider is defined in (3.6); the drift-martingale decomposition of such a chain is given in equation (4.1) . Let us start by recalling the definition of the continuous interpolant of the chain, equations (4.5)-(4.6), and by introducing the piecewise constant interpolant of the chain xN k , that is, x¯(N ) (t) = xN k

tk ≤ t < tk+1 ,

where tk = k/N ζγ . It is easy to see (see e.g. [13, Appendix A]) that Z t (N ) N x (t) = x0 + dN x(N ) (v))dv + wpN (t), p (¯ 0

25

(A.2)

where wpN (t) For any t ∈ [0, T ], we set wˆpN (t)

:= = +

1

:=

Z

Z

Z

0 t

0 t

0

N ζγ/2

k−1 X j=1

MjN + N ζγ/2 (t − tk )MkN .

t

 N (N )  dp (¯ x (v)) − dp (x(N ) (v)) dv + wpN (t)





 dN x(N ) (v)) − dp (¯ x(N ) (v)) dv p (¯

 dp (¯ x(N ) (v)) − dp (x(N ) (v)) dv + wpN (t).

With the above notation, we can then rewrite (A.2) as Z t (N ) N x (t) = x0 + dp (x(N ) (v))dv + wˆpN (t).

(A.3)

(A.4)

0

Let now C([0, T ]; H) denote the space of H-valued continuous functions, endowed with the uniform topology and consider the map I : H × C([0, T ]; H) −→ C([0, T ]; H) (x0 , η(t)) −→ x(t).

That is, I is the map that to every (x0 , η(t)) ∈ H × C([0, T ]; H) associates the (unique solution) of the equation Z t x(t) = x0 + dp (x(s))ds + η(t). (A.5) 0

(N )

I(xN ˆpN ). 0 ,w

From (A.4) it is clear that x = Notice that, under our continuity assumption ˜ on S, I is a continuous map. Therefore, in order to prove that x(N ) (t) converges weakly to x(t) (where x(t) is the solution of equation (A.5) with η(t) = Dp W C (t)), by the continuous mapping theorem we only need to prove that wˆpN converges weakly to Dp W C (t), where W C (t) is a H-valued C-Brownian motion. The weak convergence of wˆpN to Dp W C (t) is a consequence of (A.3) and of Lemmata A.1 and A.2. Then, we get the statement of Theorem 4.1. The proof of Lemma A.1 and Lemma A.2 is contained in the remained of this Appendix. Lemma A.1. Under Assumption 5.1, Assumption 5.2 and Condition 5.3, the following holds Z T EπN kdN x(N ) (t)) − dp (¯ x(N ) (t))kdt −→ 0, as N → ∞ (A.6) p (¯ 0

and

EπN

Z

0

T

kdp (¯ xN (t)) − dp (x(N ) (t))kdt −→ 0,

as N → ∞,

(A.7)

where the function dp (x) has been defined in the statement of Theorem 4.1. Lemma A.2. If Assumption 5.1, Assumption 5.2 and Condition 5.3 hold, then wpN (t) converges weakly in C([0, T ]; H) to Dp W C (t), where W C (t) is a H-valued C-Brownian motion and the constant Dp has been defined in the statement of Theorem 4.1. 26

Proof of Lemma A.1. We start by proving (A.7), which is simpler. The drift coefficient dp is globally Lipshitz; therefore, using (3.6), (3.3) and (3.4), if tk ≤ t < tk+1 , we have EπN kdp(¯ x(N ) (t)) − dp (x(N ) (t))k . EπN k¯ x(N ) (t) − x(N ) (t)k N N N . (N ζγ t − k) EπN kxN k+1 − xk k . Eπ N kyk+1 − xk k   1 1 1 N . + αγ EπN kxN Ek(C N )1/2 zk+1 k → 0. k k+ 2γ N N Nγ

Let us now come to the proof of (A.6). From (4.3)-(4.4), we have N N N dN p (x) − dp (x) = A1 + A2 + A3 − dp (x),

where

  ℓ2 N := N Ex (1 ∧ e ) − 2γ x 2N i h N (ζ−α)γ α Q N ˜N N A2 := N ℓ Ex (1 ∧ e )S x h i 1/2 N (B.17) (ζ−1)γ QN AN := N ℓE (1 ∧ e )C z = N (ζ−1)γ ℓǫN x 3 p (x). N AN 1



ζγ

QN

(A.8)

(A.9)

We split the function dp (x) in three (corresponding) parts:

dp (x) = A1 (x) + A2 (x) + A3 (x), with



2

− ℓ2 hp x if γ = 1/6 and α ≥ 2 0 otherwise  ˜ if γ ≥ 1/6 and α ≤ 2 hp ℓα Sx A2 := 0 otherwise  ˜ if γ ≥ 1/6 and 1 ≤ α ≤ 2 −2hp ℓα Sx A3 := 0 otherwise We therefore need to consecutively estimate the above three terms. • AN 1 − A1 : if α ≥ 2 (and γ = 1/6) we fix ζ = 2 and we have h   i QN N − h kxk + EπN kxN − xk EπN kAN − A k . E E 1 ∧ e x p 1 π 1 (A.1)  2 1/2 . EπN hN (x) − h + EπN kxN − xk −→ 0, p p A1 :=

(A.10)

as the first addend tends to zero by Lemma B.2 and the second addend tends to zero by definition (see also [14, equation (4.3)]). If α < 2 then ζ = α and we have N

(α−2)γ EπN kAN EπN kEx (1 ∧ eQ )xN k . N (α−2)γ → 0 . 1 k . N

(A.11)

• AN 2 − A2 : if α ≤ 2 then, recalling (4.5), a calculation analogous to the one in (A.10) gives the statement. If α > 2 then we can act as in (A.11). • AN 3 − A3 : by Lemma B.2 (and (4.5)), we have This concludes the proof.

EπN kAN 3 − A3 k → 0 as N → ∞. 27



Proof of Lemma A.2. The calculations here are standard so we only prove it for the case γ = 1/6 and α > 2. Let us recall the martingale difference given by (5.29). In the case 2 γ = 1/6 and α > 2, we have ζ = 2 and dp (x) = − ℓ2 hp x. Hence, the expression (5.29) becomes       −1/6 N N N −1/6 N 1 ˜ ˜ N + ℓβ˜N C 1/2 z N dp (xN dp (xk ) + N (1−α)/6 ℓα β˜N Sx Mk = N β k ) k+1 − N k hp     h i N ˜ N + N −1/6 1 hp dN (xN ) − β˜N dp (xN ) + N −(α−1)/6 ℓα β˜N Sx = ℓβ˜N C 1/2 zk+1 k p k k hp 2 By Lemma A.1, we have that EπN kdN p (x) − dp (x)k −→ 0 as N → ∞. This implies that 2 ˜N N −1/3 EπN khp dN p (x) − β dp (x)k → 0.

At the same time, we also notice that ˜ 2 . N −(α−1)/3 ℓ2α EπN kxk2 −→ 0 if α > 1. N −(α−1)/3 ℓ2α EπN kβ˜N Sxk   Hence, if we define MN (x) = Ex MkN ⊗ MkN |xN k = x , we obtain that up to a constant i h N 1/2 2 ˜ EπN Tr(MN (x)) − Ex kℓβ C zk ≤ N −1/3

Then, as in Lemma 4.8 of [14] we obtain that h i EπN ℓ2 hp Tr(C) − Ex kℓβ˜N C 1/2 zk2 → 0, as N → ∞

which then immediately implies that EπN ℓ2 hp Tr(C) − Tr(MN (x)) → 0, as N → ∞.

The latter result implies that the invariance principle of Proposition 5.1 of [1] holds, which then imply the statement of the lemma.  Appendix B. Auxiliary estimates Before starting the proof of Lemma B.2 we recall decompose QN (which has been defined in (5.1)-(5.5)) as follows: let ℓ6 ℓ3 −a− hx, C 1/2 z N iC 32 4N 3γ ℓ(2α−1) ℓα−1 − 2 (α−1)γ h(C N )1/2 z N , S N xN i − (2α−1)γ hS˜N (C N )1/2 z N , S˜N xN iC N N N (3α−1) ℓ − (3α−1)γ hS˜N (C N )1/2 z N , (S˜N )2 xN iC N . N

Z N := −

(B.1)

Then QN = Z N + eN ⋆ , 28

(B.2)

where ℓ6 ℓ6 − kxN k2C N 32 32 N 6γ ℓ2(2α−1) ℓ2(α−1) k(S˜N )2 xN k2C N + a − 2 (α−1)2γ kS˜N xN k2C N − 2γ(2α−1) N 2N + iN (x, z) + eN (x, z),

eN ⋆ :=

(B.3)

with N iN (x, z) := iN 1 (x, z) + i2 (x, z),

having defined  ℓ4 N 2 N 2 kx k N − kz k C 8N 4γ  ℓ2α  ˜N N 2 ˜N (C N )1/2 z N k2 N , k S x k − k S iN (x, z) := N 2 C C 2N 2αγ iN 1 (x, z) :=

(B.4) (B.5)

and ℓ2(α+1) ℓ5 N N 1/2 N N − hx , (C ) z i kS˜N xN k2C C 8N 5γ 4N 2(1+α)γ ℓ3+α ℓ(2α+1) ˜ N 1/2 N ˜N N N 1/2 N N N − h(C ) z , S x i + hS(C ) z , S x iC N . 4N (3+α)γ 2N (2α+1)γ

eN :=

(B.6)

Finally, we set e˜N (x, z) := eN +

ℓ2(α+1) kS˜N xN k2C N . 2(1+α)γ 4N

(B.7)

That is, e˜N contains only the addends of eN that depend on the noise z. Furthermore, we split QN (x, z) into the terms that contain z j,N and the terms that don’t, N Qj and QN j,⊥ , respectively; that is N QN = QN j + Qj,⊥ ,

where N N QN ˜N + (iN j := e 1 )j + (i2 )j + Zj ,

(B.8)

N N N N N having denoted by (iN 1 )j , (i2 )j and Zj , the part of i1 , i2 and Z , respectively, that depend on z j,N .

29

Lemma B.1. Let Assumption 5.1, Assumption 5.2 and Condition 5.3 hold; then, 1 1 + 4γ , for all α ≥ 1, γ ≥ 1/6 2/3 N N N X 2 −→ 0, as N → ∞ λ2j (iN 1 )j

2 i) EπN e˜N .

ii) N 1/3 EπN

(B.9)

j=1

2 . 1 , for all α ≥ 1, γ ≥ 1/6 iii) EπN iN 2 N 4γ N X 2 1/3 N λ2j ZjN −→ 0, as N → ∞, for all α > 2, γ = 1/6 iv) N Eπ

(B.10)

j=1

2 1 1 for all α ≥ 1, γ ≥ 1/6 (B.11) v) EπN eN . 2/3 + 4γ , N N 2 vi) EπN (Varx Z N )1/2 − (VarQ)1/2 −→ 0, as N → ∞, for all α ≥ 1, γ ≥ 1/6 . (B.12)

Proof of Lemma B.1. Recall that, under π N , xi,N ∼ λi ρi , where {ρi }i∈N are i.i.d standard Gaussians. We now consecutively prove all the statements of the lemma. Proof of i). Notice that EπN

2 N X N 2 i i,N hx , (C N )1/2 z N iC N = E ρz =N. i=1

Therefore, since γ ≥ 1/6,

5 2 ℓ N N 1/2 N EπN 5γ hx , (C ) z iC N . N −2/3 . 8N

(B.13)

Furthermore, using (5.15) (which follows from point i) of Assumption 5.2) and (5.11), we have 2 1 (B.14) EπN h(C N )1/2 z N , S N xN i . N −8γ ; (3+α)2γ N similarly, using (5.16) (which follows from point ii) of Assumption 5.2) and (5.12), 1 N (2α+1)2γ

EπN

˜ N 1/2 N ˜N N 2 hS(C ) z , S x iC . N −4γ .

(B.15)

Now the first statement of the lemma is a consequence of (B.6)-(B.7) and the above (B.13), (B.14) and (B.15). Proof of ii). This is proved in [14], see calculations after [14, equation (4.18)], so we omit it. Proof of iii). This estimate follows again from Assumption 5.2, once we observe that if x ∼ π N then kS˜N xN k2C N and kS˜N (C N )1/2 z N k2C N are two independent random variables with 30

the same distribution. With this observation in place, we have 1/2 N 2 2 1/2 N 2 2 kSx kSx N 2 N 2 ˜ ˜ ˜ ˜ k − k SC k z k c c k SC z k 1 1 C C C C N N EπN − 2γ + 2γ − = EπN N 2αγ N 2αγ N N N 2αγ 2 2 kSx kSx ˜ N k2 c1 ˜ N k2C C −4γ . EπN − 2γ = N EπN 2(α−1)γ − c1 , 2αγ N N N

which gives the claim.

Proof of iv). We recall that ZjN has been introduced in (B.8). Using the antisymmetry of S N and the definition of S˜N , we have −hS˜N (C N )1/2 z N , S˜N xN iC = h(C N )1/2 z N , S N S˜N xN i and −hS˜N (C N )1/2 z N , (S˜N )2 xiC N = h(C N )1/2 z N , S N (S˜N )2 xN i. We can therefore write an explicit expression for ZjN : ℓα−1 ℓ3 xj,N z j,N − 2 λj z j,N (S N xN )j 4N 3γ λj N (α−1)γ ℓ2α−1 ℓ3α−1 + (2α−1)γ λj z j,N (S N S˜N xN )j + (3α−1)γ λj z j,N (S N (S˜N )2 xN )j . N N

ZjN = −

(B.16)

For the sake of clarity we stress again that in the above (S N xN ) is an N-dimensional vector and (S N xN )j is the j-th component of such a vector. Therefore, recalling (2.3), (2.1), (A.1) and setting γ = 1/6, we have N X

λ2j EπN

j=1

N N X N 2 EπN X 4 N N j 2 2 j j,N 2 Z . 1 + (α−1)/3 λE ρz λj (S x ) j N j=1 j N j=1

+

EπN

N (2α−1)/3

N X j=1

λ4j

N ˜N N j 2 (S S x ) +

EπN

N (3α−1)/3

N X j=1

λ4j

N ˜ N 2 N j 2 (S (S ) x )

1 1 1 1 + (α−1)/3 EπN kS˜N xN k2 + (2α−1)/3 EπN k(S˜N )2 xN k2 + (3α−1)/3 EπN k(S˜N )3 xN k2 N N N N 1 1 + (α−1)/3 EπN kxN k2 . . N N

.

Therefore, N

1/3

N X j=1

2 λ2j EπN ZjN .

1 1 + (α−2)/3 −→ 0, 2/3 N N

when α > 2.

Proof of v). Follows from Assumption 5.2, from statement i) of this lemma and from (B.7). 31

Proof of vi). From (B.1), ℓ3 ℓα−1 N Varx (Z ) =Ex − 3γ hx, C 1/2 z N iC − 2 (α−1)γ hC 1/2 z N , Sxi 4N N Therefore, Varx (Z N ) =

2 ℓ(2α−1) ˜ 1/2 N ˜ ℓ(3α−1) ˜ 1/2 N ˜2 − (2α−1)γ hSCN z , SxiC − (3α−1)γ hSCN z , S xiC . N N

ℓ2(α−1) ℓ6 1/2 2 E kxk + 4 Ex kCN Sxk2 x C 6γ 2(α−1)γ 16N N ℓ2(3α−1) ℓ2(2α−1) 1/2 ˜ 1/2 2 + 2(3α−1)γ Ex kCN S S˜2 xk2 + Ex r N + 2(2α−1)γ Ex kCN S Sxk N N

where r N contains all the cross-products in the expansion of the variance. By direct calculation and using the antisymmetry of S, one finds that most of such cross products vanish and we have (4α−2) ℓ(2α+2) N ˜ +4 ℓ Ex r := hSx, Sxi Ex kS˜2 xk2C . 3γ (2α−1)γ 2(2α−1)γ 2N N N Observe that ˜ = kSxk ˜ 2; hSx, Sxi C

using this fact, Assumption 5.2 implies that the first addend in the above expression for Ex r N vanishes as N → ∞. The second addend contributes instead to the limiting variance. Now straightforward calculations give the result.  Set ǫN p (x)

:= Ex

and hN p (x)

h

QN

1∧e



1/2 CN z N

i

(B.17)

  QN . := Ex 1 ∧ e

While hp (see (5.23)) is the limiting average acceptance probability, hN p (x) is the local average acceptance probability. Lemma B.2. Suppose that Assumption 5.1, Assumption 5.2 and Condition 5.3 hold. Then (i) If α > 2 and γ = 1/6, N →∞

2 N 1/3 EπN kǫN p (x)k −→ 0 ;

(B.18)

(ii) if 1 ≤ α ≤ 2 and γ ≥ 1/6 then →∞ α−1 ˜ 2 N−→ EπN kN γ(α−1) ǫN hp Sxk 0 p (x) + 2ℓ

where the constant hp has been defined in (5.26). (iii) if α ≥ 1, γ ≥ 1/6 and S N is such that (c1 , c2 , c3 ) 6= (0, 0, 0), then 2 N →∞ −→ 0 ; EπN hN p (x) − hp 32

(B.19)

(B.20)

(iv) finally, if 1 < α < 2, γ > 1/6 and S N is such that c1 = c2 = c3 = 0, then 2 N →∞ −→ 0 , EπN hN p (x) − 1

i.e. the constant hp in (B.20) is equal to one. This means that, as N → ∞, the acceptance probability tends to one.

Proof of Lemma B.2. • Proof of (i). Acting as in [14, page 2349], we obtain N 2 . hǫp (x), ϕj i 2 . λ2j Ex QN j Taking the sum over j on both sides of the above then gives 2 kǫN p (x)k .

Therefore, if we show N

1/3

N X j=1

N X j=1

2 . λ2j Ex QN j

2 N →∞ −→ 0, λ2j Ex QN j

(B.18) follows. From (B.8), it is clear that the above is a consequence of Lemma B.1 (in particular, it follows from (B.9)-(B.10)). • Proof of (ii). Let us split QN as follows: QN = RN + eN + iN 2 ,

where eN and iN 2 are defined in (B.6) and (B.5), respectively, while RN := I N + i1 + B N + H N ,

(B.21)

having set ℓ6 ℓ2(α−1) ˜ N 2 ℓ2(2α−1) N 2 kx k − 2 k Sx k − kS˜2 xN k2C (B.22) C C 32 N 6γ N 2γ(α−1) 2N 2γ(2α−1) ℓα−1 B N := −2 γ(α−1) hC 1/2 z N , SxN i (B.23) N ℓ(2α−1) ˜ 1/2 N ˜ N ℓ3 H N := − 3γ hx, C 1/2 z N iC − γ(2α−1) hSC N z , Sx iC 4N N ℓ(3α−1) ˜ 1/2 N ˜2 N (B.24) − γ(3α−1) hSC N z , S x iC , N is defind in (B.4). The j-th component of N γ(α−1) ǫN p can be therefore expressed as I N := −

and iN 1 follows:

h i h i γ(α−1) QN j,N γ(α−1) RN j,N N γ(α−1) ǫj,N = N E (1 ∧ e )λ z = N E (1 ∧ e )λ z + T0j x j x j p

where T0j := hT0 , ϕj i and

T0 := N γ(α−1) Ex

h

 i N N 1/2 (1 ∧ eQ ) − (1 ∧ eR ) CN z N .

(B.25)

(B.26)

We now decompose RN into a component which depends on z j,N , RjN , and a component that N does not depend on z j,N , Rj,⊥ : N RN = RjN + Rj,⊥ , 33

with RjN := (i1 )j + (B N )j + (H N )j , N N N having denoted by (iN and H N , respectively, that 1 )j , (B )j and (H )j the part of i1 , B depend on z j,N . That is, ℓ4 j,N 2 (i1 )j = − 4γ z ; 8N as for (H N )j , it suffices to notice that

(B N )j + (H N )j = ZjN , and the expression for ZjN is detailed in (B.16) (just set α = 2 in (B.16)). With this notation, from (B.25), we further write h i h i [RN −(iN )j )−HjN ] γ(α−1) QN j γ(α−1) j 1 N Ex (1 ∧ e )λj z = N Ex (1 ∧ e λj z + T0j + T1j (B.27) where, like before, T1j := hT1 , ϕj i and  i h  1/2 N γ(α−1) RN RN −(iN )j −HjN 1 CN z . T1 := N ℓEx (1 ∧ e ) − 1 ∧ e

We recall that the notation Ex denotes expected value given x, where the expectation is taken over all the sources of noise contained in the integrand. In order to further evaluate the RHS of (B.27) we calculate the expected value of the integrand with respect to the law j j of z j (we denote such expected value by Ez and use Ez− to denote expectation with respect to z N \ z j,N ); to this end, we use the following lemma. Lemma B.3. If G is a normally distributed random variable with G ∼ N (0, 1) then     µ δG+µ µ+δ2 /2 E G 1∧e = δe Φ − − |δ| , |δ|

where Φ is the CDF of a standard Gaussian.

N We apply the above lemma with µ = Rj,⊥ and δ = δjB , where

δjB

ℓα−1 := −2 γ(α−1) λj (SxN )j . N

We therefore obtain ! N i h j R 2 B N N N N z j,⊥ N γ(α−1) λj Ex (1 ∧ e[R −(i1 )j −Hj )] z j = N γ(α−1) λj Ex− δjB eRj,⊥ +(δj ) /2 Φ − B − δjB δj B 2 RN j,⊥ +(δj ) /2

= −2ℓα−1 λ2j (S N xN )j Ezx e

B 2 RN j,⊥ +(δj ) /2

= −2ℓα−1 λ2j Ex (Sx)j e

! N B Rj,⊥ Φ − B − δj δ j

1{RNj,⊥ 2γ(α − 1), the RHS of the above tends to zero by using (B.31). The term (II)4b can be dealt with in a completely analogous manner. • The terms T4 and T5 can be studied similarly to what has been done in [10], see calcui,N lations from equation (8.31), in particular the terms ei,N  3,k , e5,k . References

[1] Erich Berger. Asymptotic behaviour of a class of stochastic approximation procedures. Probability theory and related fields, 71(4):517–552, 1986. [2] Joris Bierkens. Non-reversible metropolis-hastings. Statistics and Computing, pages 1–16, 2015. [3] A. Bouchard-Cˆ ote, S.J. Vollmer, and GA Pavliotis. The bouncy particle sampler: A non-reversible rejection-free markov chain monte carlo method. submitted, 2015. [4] Persi Diaconis, Susan Holmes, and Radford M Neal. Analysis of a nonreversible Markov chain sampler. Annals of Applied Probability, pages 726–752, 2000. [5] AB Duncan, T Lelievre, and GA Pavliotis. Variance reduction using nonreversible langevin samplers. Journal of Statistical Physics, 163(3):457–491, 2016. [6] AB Duncan, GA Pavliotis, and K Zygalakis. Nonreversible langevin samplers: Splitting schemes, analysis and implementation. submitted, 2017. [7] Aryeh Dvoretzky et al. Asymptotic normality for sums of dependent random variables. In Proc. 6th Berkeley Symp. Math. Statist. Probab, volume 2, pages 513–535, 1972. [8] Alan M Horowitz. A generalized guided Monte Carlo algorithm. Physics Letters B, 268(2):247–252, 1991. [9] Chii-Ruey Hwang, Shu-Yin Hwang-Ma, and Shuenn-Jyi Sheu. Accelerating diffusions. The Annals of Applied Probability, 15(2):1433–1444, 2005. [10] J. Kuntz, Michela Ottobre, and Andrew M Stuart. Diffusion limit for the random walk metropolis algorithm out of stationarity. arXiv preprint arXiv:1405.4896, 2014. [11] Jianfeng Lu and Konstantinos Spiliopoulos. Multiscale integrators for stochastic differential equations and irreversible langevin samplers. arXiv preprint arXiv:1606.09539, 2016. [12] Jonathan C Mattingly, Natesh S Pillai, and Andrew M Stuart. Diffusion limits of the random walk metropolis algorithm in high dimensions. The Annals of Applied Probability, 22(3):881–930, 2012. [13] Michela Ottobre, Natesh S Pillai, Frank J Pinski, and Andrew M Stuart. A function space hmc algorithm with second order langevin diffusion limit. Bernoulli, 22(1):60–106, 2016. 38

[14] Natesh S Pillai, Andrew M Stuart, and Alexandre H Thi´ery. Optimal scaling and diffusion limits for the langevin algorithm in high dimensions. The Annals of Applied Probability, 22(6):2320–2356, 2012. [15] Luc Rey-Bellet and Konstantinos Spiliopoulos. Irreversible langevin samplers and variance reduction: a large deviations approach. Nonlinearity, 28(7):2081–2103, 2015. [16] Luc Rey-Bellet and Konstantinos Spiliopoulos. Variance reduction for irreversible langevin samplers and diffusion on graphs. Electronic Communications in Probability, 20, 2015. [17] Gareth O Roberts and Jeffrey S Rosenthal. Optimal scaling of discrete approximations to langevin diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60(1):255–268, 1998. [18] Gareth O Roberts and Jeffrey S Rosenthal. Optimal scaling for various metropolis-hastings algorithms. Statistical science, 16(4):351–367, 2001. [19] Gareth O Roberts and Richard L Tweedie. Exponential convergence of langevin distributions and their discrete approximations. Bernoulli, pages 341–363, 1996. Department of Mathematics, Heriot Watt University,Edinburgh, EH14 4AS, Scotland E-mail address: [email protected] Department of Statistics, Harvard University, Cambridge, MA, 02138-2901,USA E-mail address: [email protected] Department of Mathematics and Statistics, Boston University, Boston, MA, 02215, USA E-mail address: [email protected]

39