On stochastic gradient Langevin dynamics with dependent data

On stochastic gradient Langevin dynamics with dependent data streams in the logconcave case ∗

arXiv:1812.02709v1 [math.ST] 6 Dec 2018

M. Barkhagen

´ Moulines E. Y. Zhang

N. H. Chau

M. R´asonyi

S. Sabanis

December 7, 2018

Abstract Stochastic Gradient Langevin Dynamics (SGLD) is a combination of a Robbins-Monro type algorithm with Langevin dynamics in order to perform data-driven stochastic optimization. In this paper, the SGLD method with fixed step size λ is considered in order to sample from a logconcave target distribution π, known up to a normalisation factor. We assume that unbiased estimates of the gradient from possibly dependent observations are available. It is shown that, for all ε > 0, the Wasserstein-2 distance of the nth iterate of the SGLD algorithm from π is dominated by c1 (ε)[λ1/2−ε + e−aλn ] with appropriate constants c1 (ε), a > 0.

1

Introduction

Sampling target distributions is an important topic in statistics and applied probability. In this paper, we are concerned with sampling from the distribution π defined by Z Z e−U(x) dx, A ∈ B(Rd ), e−U(x) dx/ π(A) := Rd

A

where B(Rd ) denotes the Borel sets of Rd and U : Rd → R+ is continuously differentiable. One of the recursive schemes considered in this paper is the unadjusted Langevin algorithm. The idea is to construct a Markov chain which is the Euler discretization of a continuous-time diffusion process whose invariant distribution is π. More precisely, we consider the overdamped Langevin stochastic differential equation √ (1) dθt = −h(θt )dt + 2dBt , with a (possibly random) initial condition θ0 (independent of (Bt )t≥0 ) where h := ∇U and (Bt )t≥0 is a d-dimensional Brownian motion. It is well-known that, under appropriate conditions, the Markov semigroup associated with the Langevin diffusion (1) is reversible with respect to π, and the rate of convergence to π is geometric in the total variation norm (see [20], [25] and Theorem 1.2, Theorem 1.6 in [1]). The EulerMaruyama discretization scheme for (1) is given by λ

θ 0 := θ0 ,

λ

λ

λ

θn+1 := θ n − λh(θ n ) +

√ 2λξn+1 ,

(2)

where (ξn )n∈N is an independent sequence of standard Gaussian d-dimensional random variables (independent of θ0 ), 0 < λ ≤ 1 is a fixed step size, and θ0 is the Rd -valued random variable representing the initial values λ of both (2) and (1). When the step size λ is fixed, the homogeneous Markov chain (θn )n∈N converges to a distribution πλ (under suitable assumptions) which differs from π but, for small λ, it is close to π in an appropriate sense (see [9]). ∗ All the authors were supported by The Alan Turing Institute, London under the EPSRC grant EP/N510129/1. N. H. C. and M. R. also enjoyed the support of the NKFIH (National Research, Development and Innovation Office, Hungary) grant KH 126505 and the “Lend¨ ulet” grant LP 2015-6 of the Hungarian Academy of Sciences.

1

We now adopt a framework where the exact gradient h is unknown, but one can observe its unbiased estimates. Let H : Rd × Rm → Rd be a measurable function, let (Xn )n∈Z be an Rm -valued (strict sense) stationary process. Furthermore, we suppose that h(θ) = E[H(θ, X0 )], θ ∈ Rd (we implicitly assume the existence of the expectation). It is technically convenient to assume also that Xn = g(εn , εn−1 , . . .),

n ∈ Z,

(3)

for some i.i.d sequence (εn )n∈Z with values in a Polish space X and for a measurable function g : X N → Rm . We assume in the sequel that θ0 , (εn )n∈Z , (ξn )n∈N are independent. For each 0 < λ ≤ 1, define the Rd -valued random process (θnλ )n∈N by recursion: √ λ θ0λ := θ0 , θn+1 := θnλ − λH(θnλ , Xn+1 ) + 2λξn+1 . (4) The goal of this work is to establish an upper bound on the Wasserstein distance between the target distribution π and its approximations (Law(θnλ ))n∈N . We improve the rate of convergence with respect to [23], see also [29]. Data sequences are, in general, not necessarily i.i.d. or even Markovian. They may exhibit long memory as in the case of many models in finance and queuing theory, see e.g. [28, 2]. It is thus crucial to ensure the validity of sampling procedures like (4) in such circumstances, too. The paper is organized as follows. Section 2 introduces the theoretical concept of conditional L-mixing which we will require for the process X. This notion accommodates a large class of (possibly non-Markovian) processes. In Section 3, assumptions and main results are presented. Section 4 discusses the contributions of our work. In Sections 5, 6, 7 we analyze the properties of (1), (2), and (4), respectively. Certain proofs and auxiliary results are contained in Section 8. Notation and conventions. Scalar product in Rd is denoted by h·, ·i. We use k · k to denote the Euclidean norm (where the dimension of the space may vary). B(Rd ) denotes the Borel σ- field of Rd . For each R ≥ 0 we denote B(R) := {x ∈ Rd : kxk ≤ R}, the closed ball of radius R around the origin. We are working on a probability space (Ω, F , P ). For two sigma algebras F1 , F2 , we define F1 ∨ F2 := σ (F1 ∪ F2 ) . Expectation of a random variable X will be denoted by EX. For any m ≥ 1, for any Rm -valued random variable X and for any 1 ≤ p < ∞, let us set kXkp := E 1/p kXkp . We denote by Lp the set of X with kXkp < ∞. The indicator function of a set A will be denoted by 1A . The Wasserstein distance of order p ∈ [1, ∞) between two probability measures µ and ν on B(Rd ) is defined by Wp (µ, ν) =

inf

π∈Π(µ,ν)

Z

X

1/p kx − ykp dπ(x, y) ,

(5)

where Π(µ, ν) is the set of couplings of (µ, ν), see e.g. [27] for more information about this distance.

2

Conditional L-mixing

L-mixing processes and random fields were introduced in [11]. They proved to be useful in tackling difficult problems of system identification, see e.g. [12, 13, 14, 15, 24]. In [5], in the context of stochastic gradient methods, the related concept of conditional L-mixing was introduced. We now recall its definition below. We consider a filtered probability space equipped with a discrete-time filtration (Hn )n∈N as well as with a decreasing sequence of sigma-fields (Hn+ )n∈N such that Hn is independent of Hn+ , for all n. Fix an integer d ≥ 1 and let D ⊂ Rd be a set of parameters. A measurable function X : N × D × Ω → Rm is called a random field. We will drop dependence on ω ∈ Ω in the notation and write (Xn (θ))n∈N,θ∈D . A random process (Xn )n∈N corresponds to a random field where D is a singleton. A random field is Lr -bounded for some r ≥ 1 if sup sup kXn (θ)kr < ∞. n∈N θ∈D

For a family (Zi )i∈I of real-valued random variables, there exists one and (up to a.s. equality) only one random variable g = ess supi∈I Zi such that: (i) g ≥ Zi , a.s. for all i ∈ I, 2

(ii) if g ′ is a random variable, g ′ ≥ Zi , a.s. for all i ∈ I then g ′ ≥ g P − a.s., see e.g. [22, Proposition VI.1.1]. Now we define conditional L-mixing. For some r ≥ 1, let (Xn (θ))n∈N,θ∈D be a random field bounded in Lr . Define, for each n ∈ N, Mrn (X) := ess sup sup E 1/r [kXn+m (θ)kr Hn ], θ∈D m∈N + n γr (τ, X) := ess sup sup E 1/r [kXn+m (θ) − E[Xn+m (θ)|Hn+m−τ ∨ Hn ]kr Hn ], τ ≥ 1, θ∈D m≥τ

Γnr (X) :=

∞ X

γrn (τ, X).

τ =1

When necessary, Mrn (X, D), γrn (τ, X, D) and Γnr (X, D) are used to emphasize dependence of these quantities on the domain D which may vary. Definition 2.1 (Conditional L-mixing). We call (Xn (θ))n∈N,θ∈D uniformly conditionally L-mixing (UCLM) with respect to (Hn , Hn+ )n∈N if (Xn (θ))n∈N is adapted to (H)n∈N for all θ ∈ D; for all r ≥ 1, it is Lr -bounded; and the sequences (Mrn (X))n∈N , (Γnr (X))n∈N are also Lr -bounded for all r ≥ 1. In the case of stochastic processes (when D is a singleton) the terminology “conditionally L-mixing process” will be used. Example 2.2. Let us consider, for example, a linear process Xn = Xn (θ), θ ∈ D, such that Xn :=

∞ X

ak εn−k ,

k=0

n ∈ Z,

with singleton set D, scalars ak , k ∈ Z and some sequence (εk )k∈Z of i.i.d. R-valued random variables satisfying kε0 kp < ∞ for all p ≥ 1. Let Hn = σ{εj , j ≤ n}, and Hn+ = σ{εj , j > n} for n ∈ N. If we further assume that kak k ≤ c(1 + k)−β , k ∈ N for some c > 0, β > 3/2 then [5, Lemma 4.3 ] shows that (Xn )n∈N is a conditionally L-mixing process with respect to (Hn , Hn+ )n∈N . If (Xn )n∈N is conditionally L-mixing with respect to (Hn , Hn+ )n∈N then so is (F (Xn ))n∈N for a Lipschitz-continuous function F , see Remark 2.3 of [5]. Finally, we know from Remark 7.3 of [10] that a broad class of functionals of geometrically ergodic Markov chains have the L-mixing property. It is possible to show, along the same lines, the conditional L-mixing property of these functionals, too.

3

Assumptions and main results

Let us define Gn := σ{εj , j ≤ n},

Gn+ := σ{εj , j ≥ n + 1},

∀n ∈ N,

where (εn )n∈Z is the noise sequence generating (Xn )n∈Z , see (3) above. Assumption 3.1. The process (Xn )n∈N is conditionally L-mixing with respect to (Gn , Gn+ )n∈N . Assume also kθ0 kp < ∞ for all p ≥ 1. Assumption 3.2. There exist constants L1 , L2 > 0 such that for all θ1 , θ2 ∈ Rd , x1 , x2 ∈ Rm kH(θ1 , x1 ) − H(θ2 , x2 )k ≤ L1 kθ1 − θ2 k + L2 kx1 − x2 k, Assumption 3.1 implies, in particular, that kX0 k ∈ Lr , for any r ≥ 1, thus, under Assumptions 3.1 and 3.2, h(θ) := E[H(θ, X0 )], θ ∈ Rd is indeed well-defined. Assumption 3.3. There is a constant a > 0 such that for all θ1 , θ2 ∈ Rd and x ∈ Rm , hθ1 − θ2 , H(θ1 , x) − H(θ2 , x)i ≥ akθ1 − θ2 k2 . 3

(6)

Two important properties immediately follow from Assumptions 3.2, 3.3. (B1) The function h is L-smooth: there exists a non-negative constant L such that ∀θ1 , θ2 ∈ Rd .

kh(θ1 ) − h(θ2 )k ≤ L1 kθ1 − θ2 k,

(7)

(B2) The function h is strongly convex: there exists a constant a > 0 such that hθ1 − θ2 , h(θ1 ) − h(θ2 )i ≥ akθ1 − θ2 k2 ,

∀θ1 , θ2 ∈ Rd ,

(8)

which implies due to Theorem 2.1.12 in [21], hθ1 − θ2 , h(θ1 ) − h(θ2 )i ≥ a ˜kθ1 − θ2 k2 + where a ˜=

1 kh(θ1 ) − h(θ2 )k, a + L1

∀θ1 , θ2 ∈ Rd ,

aL1 a+L1 . λ

Our aim is to estimate kθnλ − θn k2 , uniformly in n ∈ N. To begin with, we present an example where explicit calculations are possible. Example 3.4. Let H(θ, x) := θ+x and let (Xn )n∈Z be a sequence of independent standard Gaussian random variables, independent of (ξn )n∈N . We observe that the function H satisfies Assumptions 3.2 and 3.3. Take θ0 := 0. It is straightforward to check that λ

θn − θnλ =

n−1 X j=0

(1 − λ)j λXn−j ,

which clearly has variance n−1 X j=0

(1 − λ)2j λ2 =

λ(1 − (1 − λ)2n ) . 2−λ

It follows that λ sup kθn n∈N

−

θnλ k2

=

r

λ . 2−λ

√ This shows that the best estimate we may hope to get is of the order λ. Our Theorem 3.5 below achieves this bound asymptotically as p → ∞. Let λ′ = min 1/(a + L1 ), 1/(2a), a1/2p /(8pL21 ) . Then, one obtains the main results as follows. Theorem 3.5. Let Assumptions 3.1, 3.2 and 3.3 hold. For every even number p ≥ max{8, (a + 1)/a} and λ < λ′ , there exists C ◦ (p) > 0 (not depending on d) such that λ

1

3

kθnλ − θn k2 ≤ C ◦ (p)λ 2 − 2p ,

n ∈ N.

(9)

The proof of this theorem is postponed to Section 8.2. Theorem 3.6. Let Assumptions 3.1, 3.2, 3.3 hold and let λ < 1/(a + L1 ) (due to Lemma 6.3). Then there is a probability πλ such that λ n ∈ N, W2 (Law(θ n ), πλ ) ≤ c1 e−aλn , and √ W2 (π, πλ ) ≤ c λ, where c1 is given explicitly in Lemma 6.3 and c is given explicitly in the proof. The proof of Theorem 3.6 is provided in Section 5. The next corollary relates our findings in Theorems 3.5, 3.6 to the problem of sampling from π. 4

Corollary 3.7. Let Assumptions 3.1, 3.2 and 3.3 hold. For each κ > 0, there exist constants c1 (κ), c2 (κ) > 0 such that, for each 0 < ǫ ≤ 1/2 one has W2 (Law(θnλ ), π) ≤ ǫ

whenever λ < λ′ satisfies

λ ≤ c1 (κ)ǫ2+κ and n ≥

c2 (κ) ln(1/ǫ), ǫ2+κ

(10)

where c1 (κ), c2 (κ) depend only on d, a, L1 and L2 . Proof. Denote by C˜ = max{C ◦ (p), c1 , c}. Take p large enough so that κ > 6/(p − 3) and thus χ := 3/p ≤ κ/(κ + 2) holds. Theorems 3.5 and 3.6 imply that W2 (Law(θnλ ), π)

≤

≤

≤

λ

λ

W2 (Law(θnλ ), Law(θ n )) + W2 (Law(θ n ), πλ ) + W2 (πλ , π) ˜ 21 −χ + e−aλn + λ 12 ] C[λ 1 ˜ 2+κ C[λ + e−aλn ].

1 ˜ −aλn ≤ ˜ 2+κ , Cλ ˜ 2+κ ≤ ǫ/2 holds. Now it remains to choose n large enough to have Ce Choosing λ ≤ ǫ2+κ /(2C) ˜ ǫ/2 or, equivalently, aλn ≥ ln(2C/ǫ). Noting the choice of λ and ln(1/ǫ) ≥ ln 2 > 0 this is possible if

n≥

C˜ ǫ2+κ

ln(1/ǫ).

Assumption 3.8. The process (Xn )n∈N is i.i.d. with kX0 kp and also kθ0 kp being finite for any p ≥ 1. Assumption 3.9. There exists a real-valued function α : Rm → R such that for all θ1 , θ2 ∈ Rd and x ∈ Rm , hθ1 − θ2 , H(θ1 , x) − H(θ2 , x)i ≥ α(x)kθ1 − θ2 k2 , and E[α(X0 )] < ∞. Corollary 3.10. Let Assumptions 3.2, 3.8 and 3.9 hold. For each κ > 0, there exist constants c1 (κ), c2 (κ) > 0 such that, for each 0 < ǫ ≤ 1/2, one obtains W2 (Law(θnλ ), π) ≤ ǫ. whenever λ < λ′ satisfies λ ≤ c1 (κ)ǫ2+κ and n ≥

c2 (κ) ln(1/ǫ), ǫ2+κ

(11)

where c1 (κ), c2 (κ) depend only on d, E[a(X0 )], L1 and L2 . Proof. One notes that (B1) is still valid and (B2) holds with α = α(X0 ). Consequently, Theorems 3.5 and 3.6 are still true. Hence the desired result is obtained.

4

Related work and discussion

Rate of convergence. Corollary 3.7 significantly improves on some of the results in [23] in certain cases, compare also to [29]. In [23] the monotonicity assumption (6) is not imposed, only a dissipativity condition is required and a more general recursive scheme is investigated. However, the input sequence (Xn )n∈N is assumed i.i.d. In that setting, Theorem 2.1 of [23] applies to (4) (with the choice δ = 0, β = 1, d fixed, see also the last paragraph of Subsection 1.1 of [23]), and we get that W2 (Law(θnλ ), π) ≤ ǫ holds whenever λ ≤ c3 (ǫ/ ln(1/ǫ))4 and n ≥ cǫ44 ln5 (1/ǫ) with some c3 , c4 > 0. Our results provide the sharper estimates (11) in a setting where (Xn )n∈N may have dependences. For the case of i.i.d. (Xn )n∈N see also the very recent [17]. 5

Choice of step size. It is pointed out in [25] that the ergodicity property of (2) is sensitive to the step size λ. Lemma 6.3 of [18] gives an example in which the Euler-Maruyama discretization is transient. As pointed out in [18], under discretization, the minorization condition is insensitive with appropriate sampling rate while the Lyapunov condition may be lost. An invariant measure exists if the two conditions hold simultaneously, see Theorem 7.3 of [18] and also Theorem 3.2 of [25] for similar discussions. In our paper we follow the footsteps of [6] in imposing strong convexity of U together with Lipschitzness of its gradient and thus we do obtain ergodicity of (2). More on exponential rates. The convergence rate of the Euler-Maruyama scheme (2) to its stationary distribution is comparable to that of the Langevin SDE (1), see [25] or the discussion after Proposition 3.2 of [23]. However, [23] proved the exponential convergence of (1) under condition that the inverse temperature parameter β satisfies β > 2/m, where m appears in the dissipativity condition (A.3) of [23], ∃m > 0, b ≥ 0 :

hθ, ∇h(θ, z)i ≥ mkθk2 − b,

∀θ ∈ Rd .

Let us consider the same setting with strongly convex h and β = 1. The constant m, which controls the quadratic term, plays a similar role to a in our Assumption 3.3. In [23], m has to be larger than 2, i.e., outside a compact set, the function U should be sufficiently convex. From the present paper it becomes clear that for the approximation (2) to work, strong convexity (a > 0) is enough. Our result is stronger and not included in [23], however, it requires global convexity. Note also that√the dimension d has different effects on ˜ convergence rate: it is exp(−λn exp(O(d))) in [23] and exp(−λn)O( d) in our case, see Lemma 6.3. Horizon dependence. For the convergence of the Euler-Maruyama scheme (2) to the stationary distribution of the Langevin dynamics (1), [6] uses the total variation distance and Kullback-Leibler √ divergence to obtain a bound (up to constants in the exponent and in front of the expression) like e−λn + nλ, where n is the time horizon. As explained in their Remark 1, this bound is not sharp and improving it is a challenging question. Theorem 3.6, or see also [9], provides √ W2 (δx Rλn , π) ≤ W2 (δx Rλn , πλ ) + W2 (πλ , π) ≤ ce−aλn + c λ, (12) with some c > 0, see Section 6 for the definition of the operator√Rλ . This provides a bound that is independent of n. Note also that the dependence on dimension is of order d as in the result of [6], see Lemma 6.3 below.

5

Analysis for the Langevin diffusion (1)

By Theorem 2.1.8 of [21], U has a unique minimum at some point x∗ ∈ Rd . Note that due to the Lipschitz condition (B1), the SDE (1) has a unique strong solution. By [16], Section 5.4.C, Theorem 4.20, one constructs the associated strongly Markovian semigroup (Pt )t≥0 given for all t ≥ 0, x ∈ Rd and A ∈ B(Rd ) by Pt (x, A) = P (θt ∈ A|θ0 = x). Consider the infinitesimal generator A associated with the SDE (1) defined for all f ∈ C 2 (Rd ) and x ∈ Rd by Af (x) = −hh(x), ∇f (x)i + ∆f (x), where ∆f (x) =

d ∂2f P 2 is the Laplacian. i=1 ∂xi

Lemma 5.1. Let Assumptions 3.2 and 3.3 hold ((B1), (B2) are thereby implied). Consider the Lyapunov function V : Rd → [0, ∞) defined by V (x) = kx − x∗ k2p , p ≥ 1. Then the drift condition AV (x) ≤ −cV (x) + cβ.

(13)

is satisfied with c = ap and β = (2d + 4(p − 1))p /ap . Moreover, one obtains sup Pt V (x) ≤ V (x) + β. t≥0

Proof. For all x ∈ Rd , by B2, we have AV (x) = (−2phx − x∗ , h(x)i + 2pd + 4p(p − 1)) kx − x∗ k2p−2 ≤ −2apkx − x∗ k2p + (2pd + 4p(p − 1))kx − x∗ k2p−2 . 6

(14)

For all kx − x∗ k ≥

p (2d + 4(p − 1))/a, one obtains

AV (x) ≤ −apV (x)

p For all kx − x∗ k ≤ (2d + 4(p − 1))/a, AV (x) + apV (x) ≤ p(2d + 4(p − 1))p /ap−1 . Finally, by Theorem 1.1 in [19], one obtains (14). Lemma 5.2. Let Assumptions 3.2 and 3.3 hold ((B1), (B2) are thereby implied). (i) For all t ≥ 0 and x ∈ Rd , Z ky − x∗ k2 Pt (x, dy) ≤ kx − x∗ k2 e−2at + (d/a)(1 − e−2at ). Rd

(ii) The stationary distribution π satisfies Z

Rd

kx − x∗ k2 π(dx) ≤ d/a.

Proof. See Proposition 1 of [9]. Lemma 5.3. Let Assumption 3.2 and 3.3 hold ((B1), (B2) are thereby implied). Let (θt )t≥0 be the solution of the SDE started at x ∈ Rd . For all t ≥ 0 and x ∈ Rd E kθt − xk2 |θ0 = x ≤ dt(2 + L21 t2 /3) + 32 t2 L21 kx − x∗ k2 . Proof. See Lemma 19 of [9].

Theorem 3.6, which gives the convergence rate of the Euler-Maruyama scheme (2) to the Langevin dynamics (1), is proven below. In order to have explicit constants, the argument in [9] is adopted in our model λ λ setting. The discrete time process (θn )n∈N is lifted to a continuous time process (θt )t∈R+ such that on a grid of size λ the distributions of the two processes coincide. To do so, let us define the grid tn = nλ for each n ∈ N and for t ∈ [tn , tn+1 ) set ( √ Rt θt = θtn − tn h(θs )ds + 2(Bt − Btn ), (15) √ Rt λ λ λ θ t = θtn − tn h(θ tn )ds + 2(Bt − Btn ). λ

λ

λ

Note that at the grid points θ tn+1 = θ tn − λh(θ tn ) + sequence of Gaussian random variables.

√ √ 2λξn+1 where ξn+1 = (Btn+1 − Btn )/ λ is an i.i.d

λ

Proof of Theorem 3.6. Let θ0 have law π and θ0 := x for some fixed x ∈ Rd . Let ζ0 = π ⊗ δx . Note that λ t0 = 0 and (θ0 , θ0 ) are distributed according to ζ0 . Fix n ≥ 1. Then using (B2), we obtain kθtn+1 −

λ θ tn+1 k2

= kθtn = kθtn

Z tn+1

2 Z tn+1

λ λ λ

(h(θs ) − h(θ tn ))dsi − − 2hθ − + , θ (h(θs ) − h(θtn ))ds t n tn

tn tn

Z tn+1

2

λ λ − θtn k2 + (h(θs ) − h(θtn ))ds

λ θtn k2

tn

λ

λ

λ

− 2λhθtn − θ tn , h(θtn ) − h(θ tn )i − 2hθtn − θtn , ≤ kθtn −

tn+1

tn

(h(θs ) − h(θtn ))dsi

Z tn+1

2

2λ λ λ λ

aλkθtn − θ tn k2 − + (h(θs ) − h(θtn ))ds kh(θtn ) − h(θtn )k2

− 2˜ a + L1 tn Z tn+1 λ (h(θs ) − h(θtn ))dsi. − θtn ,

λ θtn k2

− 2hθtn

Z

tn

7

Moreover,

Z

tn+1

tn

2 Z

λ

λ(h(θtn ) − h(θλ )) + (h(θs ) − h(θ tn ))ds = tn

tn+1

tn

2

(h(θs ) − h(θtn ))ds

Z tn+1

2 !

+ (h(θs ) − h(θtn ))ds ≤ 2 λ k(h(θtn ) −

tn Z tn+1 λ 2 kh(θs ) − h(θtn )k ds , ≤ 2 λ2 k(h(θtn ) − h(θtn ))k2 + λ λ h(θ tn ))k2

2

tn

where the last line follows from the Cauchy-Schwarz inequality. Thus, since λ < 1/(a + L1 ), we have that λ

λ

kθtn+1 − θtn+1 k2 ≤ (1 − 2˜ aλ)kθtn − θtn k2 + 2λ − 2hθtn −

λ θtn ,

tn+1

Z

λ

+ 2|hθtn −

Z

tn+1

tn

kh(θs ) − h(θtn )k2 ds

(h(θs ) − h(θtn ))dsi

tn

≤ (1 − 2˜ aλ)kθtn − θtn k2 + 2λ λ θtn ,

Z

tn+1

Z

tn+1

tn

kh(θs ) − h(θtn )k2 ds

(h(θs ) − h(θtn ))dsi|.

tn

Using |ha, bi| ≤ ǫkak2 + (4ǫ)−1 kbk2 then yields λ

λ

a − ǫ))kθtn − θtn k2 + (2λ + (2ǫ)−1 ) kθtn+1 − θtn+1 k2 ≤ (1 − 2λ(˜

Z

tn+1

tn

kh(θs ) − h(θtn )k2 ds.

Let (Ft )t≥0 denote the natural filtration of the Brownian motion (Bt )t≥0 . By (B1) and Lemma 5.3, we have that λ E kθtn+1 − θ tn+1 k2 |Ftn Z tn+1 λ E(kh(θs ) − h(θtn )k2 |Ftn )ds ≤ (1 − 2λ(˜ a − ǫ))kθtn − θtn k2 + (2λ + (2ǫ)−1 ) tn

λ

≤ (1 − 2λ(˜ a − ǫ))kθtn − θtn k2 + (2λ + (2ǫ)−1 )L21 ≤ (1 − 2λ(˜ a − ǫ))kθtn −

λ θtn k2

≤ (1 − 2λ(˜ a − ǫ))kθtn −

λ θtn k2

+ (2λ + (2ǫ)−1 )L21 −1

+ 2λ + (2ǫ)

Taking ǫ = a ˜/2 and using induction one obtains Z λ ˜λ)n+1 E kθtn+1 − θtn+1 k2 ≤ (1 − a

Rd

+

n X

Z

tn+1

tn tn+1

Z

L21

tn

E(kθs − θtn k2 |Ftn )ds 2ds + 32 s2 L21 kθtn − x∗ k2 + 31 dL21 s3 ds

dλ2 + 12 λ3 L21 kθtn − x∗ k2 +

ky − xk2 π(dy) +

n X

(2λ + a ˜−1 )L21 (dλ2 +

2 1 4 12 λ dL1

.

2 1 4 12 λ dL1 )(1

k=0

−a ˜λ)k

(2λ + a ˜−1 )L21 21 λ3 L21 δ k (1 − a ˜λ)n−k ,

k=0

where δ k = e−2atk E(kθ0 − x∗ k2 ) + d/a(1 − e−2atk ) ≤ d/a, due to Lemma 5.2-(ii). Hence λ ˜λ)n+1 (kx − x∗ k2 + d/a) + λL21 a ˜−1 (2λ + a ˜−1 )(d + E kθtn+1 − θ tn+1 k2 ≤ (1 − a ˜−1 )d/(a˜ a). + 12 λ2 L41 (2λ + a

8

2 1 2 12 λ dL1 )

(16)

By Lemma 6.3 below, for all x ∈ Rd , (δx Rλn )n≥0 converges in Wasserstein distance to πλ as n → ∞, we have that √ W2 (π, πλ ) = lim W2 (π, δx Rλn ) ≤ c λ, n→∞

where c = L21 a ˜−1 (2λ + a ˜−1 )(d +

1 2 2 12 λ L1 d

1/2 + 21 L21 λd/a) .

Note that for the Langevin SDE (1), the Euler and Milstein schemes coincide, which implies that the optimal rate of convergence is 1 instead of 1/2. The bound provided in Theorem 3.6 can thus be improved under an additional assumption, see Subsection 8.3.

6

Ergodic properties of the recursive scheme (2)

For a fixed step size λ ∈ (0, 1), consider the Markov kernel Rλ given for all A ∈ B(Rd ) and x ∈ Rd by Z 2 (4πλ)−d/2 exp −(4λ)−1 ky − θ + λh(θ)k dy. Rλ (θ, A) =

(17)

A

The discrete-time Langevin recursion (2) is a time-homogeneous Markov chain, and for any n ≥ 1, and for any bounded (or non-negative) Borel function f : Rd → R, Z λ λ λ λ E[f (θ n )|θn−1 ] = Rλ f (θn−1 ) = f (y)Rλ (θn−1 , dy). Rd

We say that a function V : Rd → [1, ∞) satisfies a Foster-Lyapunov drift condition for Rλ if there exist constants λ > 0, α ∈ [0, 1), c > 0 such that for all λ ∈ (0, λ], Rλ V ≤ αλ V + λc.

(18)

The following lemma shows that (18) holds for a suitable polynomial Lyapunov function. Lemma 6.1. Under Assumptions 3.1, 3.2 and 3.3, for each integer p ≥ 1, the function V (x) := kxk2p satisfies the Foster-Lyapunov drift condition for the Markov kernel Rλ . Proof. This is almost identical to the proof of Lemma 7.1 below, hence omitted. As a first consequence of Lemma 6.1, we have Lemma 6.2. Under Assumptions 3.1, 3.2 and 3.3, λ

sup EV (θ n ) < ∞. n

Proof. Using the fact that the function V satisfies the Foster-Lyapunov drift condition for Rλ , see Lemma 6.1, we compute h i h i λ λ λ λ λ EV (θ n ) = E E[V (θ n )|θ n−1 ] = E Rλ V (θ n−1 ) ≤ αλ E[V (θ n−1 )] + λc ≤ . . . λ

λ

≤ αnλ EV (θ 0 ) + (α(n−1)λ + . . . + αλ + 1)λc ≤ αnλ EV (θ 0 ) + (1 − αnλ )(1 − αλ )−1 λc.

Noting that sup0 2, r   !1/2 k m X X p 2 1/r  bj Wj Hn  ≤ Cr bj sup E Mrn (W )Γnr (W ), (25) n 1, the first inequality of (19) and Minkowski’s inequality imply for k ≥ n, E 1/r [kH(θ, Xk )kr |Hn ] ≤ L1 j + L2 E 1/r kXk kr |Hn ] + kH(0, 0)k and hence

Mrn (H(θ, X), B(j)) ≤ L Mrn (X) + j + L−1 kH(0, 0)k ,

where L = max{L1 , L2 }. We also have, using Lemma 8.3, that

+ + E 1/r [kH(θ, Xk ) − E[H(θ, Xk )|Hn ∨ Hn−τ ]kr |Hn ] ≤ 2E 1/r [kH(θ, Xk ) − H(θ, E[Xk |Hn ∨ Hn−τ ])kr |Hn ] + ≤ 2L2 E 1/r [kXk − E[Xk |Hn ∨ Hn−τ ]kr |Hn ]

which implies Γnr (H(θ, X), B(j)) ≤ 2L2 Γnr (X). 12

Lemma 8.3. Let G, H ⊂ F be sigma-algebras. Let X, Y be random variables in Lp such that Y is measurable with respect to H ∨ G. Then for any p ≥ 1, E 1/p |X − E[X|H ∨ G]|p G ≤ 2E 1/p |X − Y |p G . Proof. See Lemma 6.1 of [5].

8.2

Proof of Theorem 3.5

Clearly, since (Xn )n∈N is conditionally L-mixing with respect to (Gn , Gn+ )n∈N , it remains conditionally Lmixing with respect to (Hn , Hn+ )n∈N , too. For each θ ∈ Rd , 0 ≤ i ≤ j, we recursively define √ z λ (j, i, θ) := θ, z λ (j + 1, i, θ) := z λ (j, i, θ) − λh(z(j, i, θ)) + 2λξj+1 . λ ). Note Let T := ⌊1/λ⌋. We then set, for each n ∈ N and for each nT ≤ k < (n + 1)T , z λk := z λ (k, nT, θnT λ

that z λk is then defined for all k ∈ N and θk = z λ (k, 0, θ0 ).

Lemma 8.4. There is C ♭ > 0 such that sup sup kH(θnλ , Xn+1 )k2 + kh(z λn )k2 ≤ C ♭ . λ

On stochastic gradient Langevin dynamics with dependent data

On stochastic gradient Langevin dynamics with dependent data

Suggest Documents

The True Cost of Stochastic Gradient Langevin Dynamics

The promises and pitfalls of Stochastic Gradient Langevin Dynamics

Data-Dependent Stability of Stochastic Gradient Descent

ON STOCHASTIC PROXIMAL GRADIENT ALGORITHMS

Coarse-gradient Langevin algorithms for dynamic data integration and ...

Stochastic Langevin equations: Markovian and non-Markovian dynamics

Inertial stochastic dynamics. I. Long-time-step methods for Langevin ...

Stochastic trailing string and Langevin dynamics from AdS/CFT

Stochastic modeling of excitable dynamics: improved Langevin model

Coarse-gradient Langevin algorithms for dynamic data integration and ...

Stochastic gradient decent methods for estimation with large data sets

Fast-Forward Langevin Dynamics with Momentum Flips

LANGEVIN DYNAMICS WITH CONSTRAINTS AND COMPUTATION ...

Stochastic gradient descent with differentially private updates

SGDR: STOCHASTIC GRADIENT DESCENT WITH WARM RESTARTS

Communication-Efficient Distributed Stochastic Gradient Descent with

Averaged Stochastic Gradient Descent with ... - Semantic Scholar

On the Convergence of Stochastic Gradient MCMC Algorithms with ...

Stochastic Gradient Boosting

On the Regularizing Property of Stochastic Gradient

Riemannian stochastic variance reduced gradient on Grassmann ...

Stochastic thermodynamics for delayed Langevin systems

Stochastic Volatility Regression for Functional Data Dynamics

Riemannian stochastic variance reduced gradient