statistical theory of non-linear stochastic systems i. - Semantic Scholar

7 downloads 0 Views 185KB Size Report
STATISTICAL THEORY OF. NON-LINEAR STOCHASTIC. SYSTEMS I. László Gerencsér. Zalán Mátyás. Gábor Molnár-Sáska. Zsanett Orlovits. (Lecture Notes).
STATISTICAL THEORY OF NON-LINEAR STOCHASTIC SYSTEMS I. L´aszl´o Gerencs´er

Zal´an M´aty´as G´abor Moln´ar-S´aska Zsanett Orlovits (Lecture Notes)

Computer and Automation Research Institute of the Hungarian Academy of Sciences, SZTAKI Stochastic Systems Research Group 2004. c All rights reserved. °

Contents 1 A basic estimation problem

2

2 Parametric Markov processes

8

3 The basic algorithm

9

4 The ODE-method

13

5 Lq estimates

23

6 Appendix

28

1

1

A basic estimation problem

The LMS-method. Let (xn , yn ) be a second order stationary process, xn ∈ Rd , yn ∈ R. Find θ ∈ Rd such that E(yn − θT xn )2 is minimized. The problem is quadratic in θ. The solution is θ∗ = (Exn xTn )−1 (Exn yn ). A low-complexity procedure to find θ∗ is the LMS-method (least mean squares): θn+1 = θn + γxn (yn − xTn θn ). This is a widely studied procedure, for an excellent reference see [6]. The signed LMS method (more precisely: the signed error LMS method) is θn+1 = θn + γxn sgn(yn − xTn θn ) see Bucklew, Kurtz, Sethares [2]. The reformulation of the LMS problem: solve Exn (yn − xTn θ) = 0. Now we move to a Markovian framework. The classical definition of a Markov-process: given a measurable space (X , A), where X is a separable, complete metric space (i.e. a Polish space). Definition 1.1 A transition kernel P (x; A), x ∈ X , A ∈ A is a function of two variables such that 1. P (x; ·) is a measure for all x ∈ X 2. P (·; A) is measurable for all A ∈ A. Definition 1.2 An X -valued process x = (xn ), n = 0, 1 . . . over some probability space (Ω, F, Q) is a homogenous Markov process with transition kernel P (x; A) if Q-almost surely P (Xn+1 ∈ A|Xn , . . . , X0 ) = P (Xn+1 ∈ A|Xn ) and P (Xn+1 ∈ A|Xn = x) = P (x; A). Remark 1.1 In BMP we have X = Rk , but the Euclidean structure is not really needed. 2

A modern view of Markov processes: let (U, B) be a measurable space, and let f :X ×U →X be a measurable mapping. Let U = (Ui ), i = 1, 2, . . . be an U-valued i.i.d. sequence over some probability space (Ω, F, Q). Then the process defined by Xn+1 = f (Xn , Un+1 ),

x0 = ξ

with ξ being independent of U is a homogenous Markov-process. (Prove this!) The mapping Tn : x → f (x, Un ) is a random mapping of X into itself. We thus can also write Xn+1 = Tn+1 Xn The corresponding transition kernel is P (x; A) = Q(T x ∈ A).

(1)

Conversely if a given transition kernel can be written in this form, then we say that T realizes P (x; A). This is not just an example for a Markov process, as the following theorem shows. Theorem 1.1 (see [7]) Let X be a Polish space, and let A = B(X ) be its Borel σ-algebra. Let P (x; A) be a Markov kernel with x ∈ X , A ∈ A. Then P can be realized by a random mapping T of the form T : x → f (x, U ), where U can be taken to be a uniformly distributed in [0, 1]. Discussion. The realization of Markov-process is described in Kifer [7], see also [9]. An open problem(?): assume that Pθ (x, A) is H¨older-continuous in θ in the Prohorov-metric. Does it follow that there exists a realization f such that f is H¨older-continuous in θ for all fixed (x, u)? Remark 1.2 This is at first sight an unexpected result. Assume e.g. that xn+1 = Axn + Ben+1 , where xn is vector-valued, (en ) is Gaussian i.i.d., and A is stable, i.e. its eigenvalues are all in the open unit disc of the complex plane {z : |z| < 1}. 3

Then, it is easy to see that there exists a stationary solution. And now suddenly we claim that the very same process can be defined by a recursion of the form xn+1 = f (xn , Un+1 ), where (Un ) is i.i.d. in [0, 1]! Morale: The same transition kernel may have a number of different realization! Local averaging. Let f : X → R be a bounded, measurable function. Define a function P f by Z P f (x) = f (y)P (x, dy), X

or in BMP-s notation: Z πf (x) =

f (y)π(x, dy). X

We can also write P f (x) = E(f (Xn+1 )|Xn = x). P f is thus the local average of f over positions reached in one step starting from x. Proposition 1.1 P f is a bounded measurable function. (Hint for proof: Use the fact that P (x, A) is measurable in x.) A difference operator acting on f is defined as g = P f − f. Definition 1.3 A probability measure µ on (X , A) is invariant for the transition kernel P (x; A) if for all A ∈ A Z µ(A) = µ(dx)P (x; A). X

An alternative definition: if the distribution of X0 is µ, then the distribution of X1 will be also µ. Proposition 1.2 If the distribution of X0 is invariant, then (xn ) is stationary. (Prove it!) 4

Proposition 1.3 Assume that µ is an invariant measure for the transition kernel P (x; A). Let f be an arbitrary bounded, measurable function, f : X → R. Then Z (P f − f )(x)µ(dx) = 0. X

Proof.

Z

Z P f (x)µ(dx) =

X

Z µ(dx)

X

P (x, dy)f (y). X

Interchanging the order of integration we get Z Z f (y) µ(dx)P (x, dy). X

Now

X

Z µ(dx)P (x, dy) = µ(dy) X

due to the invariance of µ, thus we get Z f (y)µ(dy) X

This step is a bit sloppy, but it can be made precise which proves the proposition. ¥ Remark 1.3 The definition of P f can be extended to unbounded measurable functions, assuming that for all x ∈ X Z |f (y)|P (x, dy) < ∞, X

or equivalently all X0 = x Ex |f (x1 )| < ∞. Problem: Extend the Proposition 1.3 to unbounded functions. The Poisson-equation. Assume that µ is an invariant measure on (X , A) for the transition kernel P (x; A). Let k : X → R be a bounded measurable function such that Z k(x)µ(dx) = 0. X

5

The Poisson equation: find a bounded measurable function f : X → R such that P f − f = k. The solution of a Poisson-equation is a kind of integration of k along the Markov-process. Remark 1.4 If f is a solution of the Poisson-equation, then for any constant c on X f + c is also a solution. Remark 1.5 If the Poisson-equation does have a bounded, measurable solution f for all k satisfying Z k(x)µ(dx) = 0, X

then for any bounded measurable function k the modified Poisson-equation P f − f = k − k0 , with

Z k0 =

k(x)µ(dx) X

has a bounded measurable solution. The usefulness of Poisson-equations is demonstrated by the following theorem. Proposition 1.4 Assume that for k : X → R we have P f − f = k, where k, f are both bounded and measurable functions. Then N 1 X k(xn ) N n=1

converges to 0 almost surely. Remark 1.6 Note that the existence of an invariant measure is not assumed!

6

Proof. Write

N X

k(xn ) =

n=1

N X

(P f (xn ) − f (xn )).

n=1

Couple the terms P f (xn+1 ) − f (xn ), and write the above sum as N −1 X

(P f (xn+1 ) − f (xn )) + P f (x1 ) − f (xN ).

n=1

Since the terms in the sum are bounded martingale differences, and f, P f are bounded, the proposition follows from the law of large numbers for martingales, see [5]. ¥ Extension of the definition of P f to unbounded f : Proposition 1.5 Assume that the initial distribution of X is µ and Eµ |f (X1 )| < +∞. Then P f = E[f (X1 )|X0 = x] is well defined µ-a.s. Proof. Let f = f + − f − , where f + , f − are the positive and negative parts of f , respectively. Then Z Z Z + P f µ(dx) = µ(dx) f + (x1 )P (x, dx1 ) = X

Z

X

Z

f + (x1 )

= X

X

µ(dx)P (x, dx1 ) < +∞ X

by the assumption. Thus P f + is finite µ a.s. Similarly for f − , thus the claim follows. ¥ We can rephrase the above as follows: if f ∈ L1 (µ) then also P f ∈ L1 (µ). Proposition 1.6 Let p ≥ 1, and assume that f ∈ Lp (µ). P f ∈ Lp (µ), and ||P f ||p ≤ ||f ||p , where || · ||p denotes the Lp (µ)-norm.

7

Then also

Proof. Let f ≥ 0. Then by convexity  p Z Z  P (x, dy)f (y) ≤ P (x, dy)f p (y). X

X

Integrating with respect to µ(dx) we get the claim as above.

¥

Extension of Proposition 1.4: Proposition 1.7 Let p > 1 and let f ∈ Lp (µ). Then N 1 X k(xn ) N n=1

converges to 0 µ × P almost surely. Prove it!

2

Parametric Markov processes

Let θ ∈ D ⊂ Rd , where D is an open domain. A parametric family of Markov-transition kernels is a function Pθ (x; A), where for each θ, x P· (·; A) is a probability measure in (θ, x), and for each A ∈ A P· (·; A) is measurable in (θ, x). Definition 2.1 (Xn ) is a Markov-process with time-varying dynamics (θn ), if for all x ∈ X , A ∈ A P (Xn+1 ∈ A|Xn = x) = Pθn (x, A), and here θn is Fn -measurable, where Fn = σ{X0 , . . . , Xn }. Alternative definition: A parametric family of Markov processes is given by a random mapping T : D × X → X defined by (θ, x) → fθ (x, U ) where U is an U-valued random variable and fθ (x, u) is measurable in (θ, x, u). If θ is fixed then the mapping x → fθ (x, U ) will be denoted by Tθ . 8

Remark 2.1 By Theorem 1.1 given above if X is a Polish space, then Pθ (x, A) can be realized by a random mapping. Consider local averaging for Markov processes with time-varying dynamics. Let f (θ, x), θ ∈ D, x ∈ X be a non-negative measurable function. Proposition 2.1 We have Z E[f (θn , Xn+1 )|Fn ] =

f (θn , y)Pθn (Xn , dy). X

Hint: The conditional distribution of (θn , Xn+1 ) given Xn is Pθ (Xn , dy).

3

The basic algorithm

Let H = H(θ, x) be an Rd -valued function defined on D × X . We will use a dual interpretation of H: for each fixed x it will be considered a vector field on D, and for each fixed θ it will be considered as a function over X . Assume for a moment that for any θ ∈ D the Markov chain with transition kernel Pθ (x, A) has a unique invariant probability measure µθ and define Z h(θ) = H(θ, x)µθ (dx). X

Then our objective is to solve the equation h(θ) = 0. The basic algorithm: Let θ0 = a, X0 = x be the initial conditions and define recursively θn+1 = θn + γn+1 H(θn , Xn+1 ), Xn+1 = f (θn , Xn , Un+1 ) with step sizes γn , satisfying the following condition. Condition A.1. X γn > 0, γn & 0 and γn = ∞. We assume the existence of a norm |x| in X , with the only property |x| ≥ 0. The analysis of this algorithm is the objective of Part II. of BMP. 9

Condition A.5.-0. We assume that the parametric Markov-process (Xn ) with transition kernel Pθ (x, A) is Lq -stable for any q ≥ 1 uniformly in θ in the following sense: for any q ≥ 1 , any compact set Q ⊂ D, for all θ ∈ Q, x ∈ X we have for all n ≥ 1 Eθ [|Xn |q |X0 = x] ≤ Cq (1 + |x|q ) where Cq also depends on Q. This assumption is implicit in Condition A.5. of BMP. The advantage of this condition is that it is defined purely in terms of the Markov process and not in terms of the basic algorithm. Condition A.3. There exists a q 0 ≥ 1 such that in any compact set Q ⊂ D we have for all θ ∈ Q |H(θ, x)| ≤ C(1 + |x|q ), where C depends on Q. Combining this with Condition A.5.-0, it follows that if the initial distribution of X0 is µ and for some q > 1 Z 0 |x|qq µ(dx) < +∞ X

then

0

sup E|H(θ, Xn )|qq < +∞, n

uniformly in θ for θ ∈ Q. Remark 3.1 It follows that if X0 = x a constant then, replacing qq 0 by q, for any q ≥ 1 we have sup E|H(θ, Xn )|q < +∞, n

uniformly in θ for θ ∈ Q. In the literature we also find conditions for the existence of exponential moments requiring that there exists an ε > 0 such that sup E exp |ε n

∂ H(θ, Xn )| < +∞ ∂θ

uniformly in θ for θ ∈ Q. The dual role of these two kinds of conditions will be explained later. 10

Condition A.4. (i). It will be assumed that for all θ ∈ D there exists a constant h(θ) such that the Poisson-equation (I − Πθ ) ν(x) = H(θ, x) − h(θ)

(P1)

has a solution. This solution will be denoted by νθ = νθ (x). Condition A.4. (iii)/a. There exists a q 0 ≥ 1 such that for all compact domains Q ⊂ D we have for θ ∈ Q 0

|νθ (x)| ≤ C(1 + |x|q ) where C also depends on Q. The latter condition, combined with the condition that ³ ´ q0 Eθ |X1 | | X0 = x < +∞ for any x ∈ X implies that Πθ νθ is well defined. In view of the above conditions and Proposition 1.4 it follows that for any fixed θ ∈ D and X0 = x we have N 1 X lim H(θ, Xn ) = h(θ) n N n=1

with probability 1 or Q − a.s. Remark 3.2 The same conclusion holds Q × dµ − a.s. if X0 has an initial distribution µ such that Z |x|q µ(dx) < +∞. X

An important feature of the basic algorithm is that H may be discontinuous in θ. But we require that ”averaging H” over X , yielding h, will smooth out eventual discontinuities. Condition A.4. (ii). h is locally Lipschitz continuous in D. A similar, but less direct condition is imposed on the local average of H. Apply Πθ to the Poisson equation and write wθ (x) = Πθ νθ (x) to get (I − Πθ ) wθ (x) = Πθ Hθ (x) − h(θ) 11

(P2)

where Hθ denotes the partial function H(θ, x) with θ fixed. Condition A.4. (iii)/b. There exists a q 0 ≥ 1 and a λ with 1/2 ≤ λ ≤ 1 such that for all compact Q ⊂ D, θ, θ0 ∈ Q and any x ∈ X we have 0

|wθ (x) − wθ0 (x)| ≤ C|θ − θ0 |λ (1 + |x|q ) where C depends on Q. Remark 3.3 Connection between νθ and wθ . If νθ (x) is a solution of the Poisson equation (I − Πθ ) ν(x) = H(θ, x) − h(θ) then rearranging this we get νθ (x) = Πθ νθ (x) + Hθ (x) − h(θ) = = wθ (x) + Hθ (x) − h(θ). Conversely, if wθ (x) is a solution of the Poisson-equation (P2), then νθ (x) = wθ (x) + Hθ (x) − h(θ)

(P1’)

is a solution of the Poisson equation (P1). Indeed, it is easy to see that Πθ νθ is well defined. Now (I − Πθ ) ν(x) = (Πθ Hθ (x) − h(θ)) + (Hθ (x) − Πθ Hθ (x)) = = Hθ (x) − h(θ). Condition A.4. (iii)/a and Condition A.5.-0. imply 0

wθ (x) ≤ C(1 + |x|q )

(2)

for θ ∈ Q, x ∈ X . Conversely, (2) implies A.4. (iii)/a. Example 3.1 We give an example for the smoothing effect of Πθ : let d = 1, X = R and H(θ, x) = sgn(θ − x). Let Π be independent of θ, then Z Z ΠHθ (x) = sgn(θ − x) dP (x) = sgn dP (x − θ). If dP (x − θ) = ϕ(x − θ)dx with a smooth (infinitely differentiable) kernel ϕ then ΠHθ (x) will be smooth with respect to θ. 12

Remark 3.4 The verification of Condition A.4. is the subject of Chapter 2 of Part II. of BMP. The stopped process. We enforce the boundedness of the process (θn ) by introducing τ (Q) = min{n : θn 6∈ Q} and restricting our analysis to the time-domain 0 ≤ n ≤ τ (Q). We impose the following ”boundedness condition”, which is a purely technical condition, the verification of which will require substantial extra work. Condition A.5. For all q > 0 and Q ⊂ D compact, θ ∈ Q and all n we have Ex,a {I(n < τ )(1 + |Xn+1 |q )} ≤ Cq (1 + |x|q ) where Cq depends also on Q. What is this condition about? It extends A.5.-0 significantly by requiring that (Xn ) is Lq -stable in a sense even if it is driven by a time-varying dynamics determined by the sequence (θn )! Example 3.2 Let X = Rk , d = 1 and consider the Markov-process Xn+1 = A(θ)Xn + Un where Un is i.i.d., and for all q ≥ 1 E|Un |q < +∞. Assume that A(θ) is stable for all θ ∈ D, i.e. all its eigenvalues are strictly less than 1 in absolute value. Then Condition A.5.-0 is satisfied. However, it does not follow that for any sequence (θn ) ∈ Q ⊂ D the process Xn+1 = A(θn )Xn + Un will be Lq -bounded for any X0 = x. (Hint: Find two matrices A1 , A2 such that they are both stable but A1 A2 is not.)

4

The ODE-method

Define the associated ODE as y˙ = h(y)

y0 = a.

(ODE)

We expect that under suitable conditions θn will ”track” the solution trajectory of (ODE). The definition and quantification of the tracking error 13

is a basic step in classical ODE-method. In BMP a different route is taken: the starting point is a Lyapunov-function for the ODE and this is then used to analyse the convergence of our basic algorithm. Let Φ : Rd −→ R be a C 2 function such that M2 := sup ||Φ00 (θ)|| < +∞.

Rd

θ∈

Write our algorithm as θn+1 = θn + γn+1 h(θn ) + γn+1 (H(θn , Xn+1 ) − h(θn )) and consider the evolution of Φ(θn ). Define εn = εn (Φ) = Φ(θn+1 ) − Φ(θn ) − γn+1 Φ0 (θn )h(θn ). Let R(θ, θ0 ) = Φ(θ0 ) − Φ(θ) − (θ0 − θ)Φ0 (θ) where Φ0 (θ) is the gradient of Φ. Then 1 |R(θ, θ0 )| ≤ M2 |θ − θ0 |2 . 2 Remark 4.1 We need this global condition to control cases when θ exits Q. Rewriting εk : subtract and add (θk+1 − θk )Φ0 (θk ). Thus we get εk = R(θk+1 , θk ) + Φ0 (θk )(θk+1 − θk − γk+1 h(θk )). Remark 4.2 Note that the ”standard” local tracking error θk+1 − θk − γk h(θk ) shows up here. Thus any result along the lines of classical ODE method can also be used for a Lyapunov-function analysis! Taking into account the recursion defining θk+1 and taking out γk+1 we get ²k = R(θk+1 , θk ) + γk+1 Φ0 (θk )(H(θk , Xk+1 ) − h(θk )). We would like to estimate the partial sum of ²k for reasons that will become clear later in the Lyapunov-analysis. The critical objects are the second terms. For θk = θ = const., and γk+1 = γ = const. we could simply use the Poisson-equation to deal with partial sums of these terms. The variation of θk and γk+1 requires additional devices. First, using Condition A.3. (ii), write, assuming θ ∈ Q H(θk , Xk+1 ) − h(θk ) = νθk (Xk+1 ) − Πθk νθk (Xk+1 ). 14

Then generate a martingale difference on the right hand side in the usual way by writing it as (νθk (Xk+1 ) − (Πθk νθk )(Xk )) + ((Πθk νθk )(Xk ) − (Πθk νθk )(Xk+1 )) . 0

Taking into account the condition |νθk (x)| ≤ C(1 + |x|q ) and Condition A.5. we get that for any q ≥ 1 I(k < τ )νθk (Xk+1 ) is Lq -bounded uniformly in k, and thus Φ0 (θk )I(k < τ ) (νθk (Xk+1 ) − (Πθk νθk )(Xk )) is uniformly Lq -bounded martingale difference. Define ²1k = γk+1 Φ0 (θk )I(k < τ )[νθk (Xk+1 ) − (Πθk νθk )(Xk )]. Time-varying telescopic sums. Let Ψθ (x) = Φ0 (θ)Πθ νθ (x) and consider

n−1 X

γk+1 (Ψθk (Xk ) − Ψθk (Xk+1 )).

k=0

Using a variant of Abel-summation (which is a discrete version of partial integration) we get γ1 Ψθ0 (X0 )−γn Ψθn−1 (Xn )+

n−1 X

n−1 X γk+1 (Ψθk (Xk )−Ψθk−1 (Xk ))+ (γk+1 −γk )Ψθk−1 (Xk ). k=1

k=1

Note that in the first sum the Ψ-s are evaluated of the same Xk . Remark 4.3 For θk = θ = const. we would get the usual Abel summation. A comparison lemma. The lemma below has been used for ODE analysis of stochastic approximation processes in [3]. Consider the ordinary differential equation y˙ t = F (t, yt ),

ys = ξ, s ≥ 1.

(3)

The solution of the above ODE will be denoted by y(t, s, ξ) in the time interval where it exists and is unique. Condition C. F = (F (t, y)) is defined for t ≥ 1, y ∈ D where D ⊂ Rp is an open domain, and F is continuously differentiable in (t, y). It is assumed that there exists a compact domain D00 ⊂ D such that for all ξ ∈ D00 the solution y(t, s, ξ) ∈ D is defined for all 1 ≤ s ≤ t < ∞. 15

Lemma 4.1 Assume that Condition C is satisfied. Let (xt ), 1 ≤ s ≤ t < ∞ be a continuous, piecewise continuously differentiable curve such that xt ∈ D00 for t ≥ s and xs = ys = ξ ∈ D00 . Then for t ≥ s µ ¶ Z t ∂ xt − y t = y(t, r, xr ) x˙ r − F (r, xr ) dr. (4) s ∂ξ Proof. We shall use subscripts to indicate partial derivatives. Write zr = y(t, r, xr ). Obviously the left hand side of (4) can be written as zt − zs and we have ¶ Z t Z tµ 0 yr (t, r, xr ) + yξ (t, r, xr )x˙ r dr. (5) zt − zs = zr dr = s

s

Taking into account the equality yr (t, r, xr ) = −yξ (t, r, xr ) · F (t, xr ) we get the lemma. ¥ The associated ODE. We have written our basic algorithm as θn+1 = θn + γn+1 h(θn ) + εn .

(6)

A natural way to approximate this stochastic algorithm is to use the deterministic recursion zn+1 = zn + γn+1 h(zn )

z0 = a.

(7)

This recursion is the Euler-approximation (with step sizes γn ) of the ordinary differential equation ¯˙ = h(θ(t)) ¯ θ(t)

¯ = a. θ(0)

(8)

¯ = a is θ(t; ¯ s, a). Let T > 0 be fixed. Define The general solution with θ(s) m(T ) = inf{k ≥ 0 : γ1 + . . . + γk+1 ≥ T } and for n = 1, . . . , m(T ) define tn =

n X

γk .

k=1

Let Q0 , Q00 be compact sets such that Q0 ⊂ int Q00 and let δ0 = inf{|θ − η| : θ ∈ Q0 , η ∈ Q0c0 }. 16

Condition A.6. There exist compact sets Q0 ⊆ Q00 ⊆ D such that ¯ 0, a) ∈ Q0 θ(t; 0

for all a ∈ Q0 , 0 ≤ t ≤ T.

¯ n ) for n ≤ ν where We compare θn and θ(t ν = m(T ) ∧ τ (Q00 ). ¯ n ) we have the recursion For θ(t ¯ n+1 ) − θ(t ¯ n ) = γn+1 h(θ(t ¯ n )) + αn , θ(t where, using the fact that h is Lipschitz continuous, we get 2 . |αn | ≤ Lγn+1

¯ n ) we get (Prove it!) Alternatively, using the notation θ¯n = θ(t θ¯n+1 = θ¯n + γn+1 h(θ¯n ) + αn .

(9)

Subtracting (9) from (6) and iterating the result we are led to θn − θ¯n =

n−1 X

γk+1 (h(θk ) − h(θ¯k )) +

n−1 X (εk − αk ). k=0

k=0

Using again the Lipschitz condition on h and multiplying both sides by I(n ≤ ν) we get I(n ≤ ν)|θn − θ¯n | ≤ L +|

n−1 X

γk+1 I(n ≤ ν)|θk − θ¯k | +

k=0 n−1 X

n−1 X

k=0

k=0

I(n ≤ ν)εk | + L

2 γk+1 .

Now we can use the discrete version of the Bellman-Gronwall lemma. Theorem 4.1 (Discrete Bellman-Gronwall lemma.) Let (fi , gi )ni=1 and c be non-negative real numbers, f0 = 0 and c ≥ 0 such that for k = 1 . . . n fk ≤

k X

gi fi−1 + c.

i=1

Then fn ≤ fn∗ where fk∗ is defined by fk∗

=

k X

∗ gi fi−1 + c.

i=1

It follows that fn ≤ c exp(

n P

gi ).

i=2

17

(10)

Proof.

Prove by induction that fn ≤ fn∗ . The proof of fn∗ ≤ c exp(

the dynamic form of equality (10) is ∗ ∗ fk∗ = fk−1 + gk fk−1

n P

gi ):

i=2

with f1∗ = c,

from which fn∗

=c

n Y

(1 + gi ).

i=2

The proof is completed using the inequality 1 + x ≤ ex .

¥

Applying the Bellman-Gronwall lemma with fk = I(n ≤ ν)|θk − θ¯k | we get I(n ≤ ν)|θn − θ¯n | ≤ (U1n + U2n ) exp(L

n X

γk )

k=1

where U1n = sup | 1≤m≤n

m−1 X

m(T )−1

I(m ≤ ν)εk |,

U2n = L

X

2 γk+1 .

k=0

k=0

Take supremum over n for n ≤ m(T ) of both sides. From here ½ ¾ n X ¡ ¢ 2 2 2 ¯ γk ). E sup I(n ≤ ν)|θn − θn | ≤ 2E(U1n ) + 2U2n exp(2L n≤ν

k=1

¯ n )| ≥ δ0 , we get that Remark 4.4 Since at time n = τ (Q00 ) we have |θn − θ(t for any 0 < δ < δ0 ¯ n )| ≥ δ} = { sup |θn − θ(t ¯ n )| ≥ δ}. {sup |θn − θ(t n≤ν

n≤m(T )

2 Now using the upper bound obtained for E(U1n ) (see BMP [1], Part II., 2 and applying the Proposition 7), the Cauchy-Schwartz inequality for U2n Markov-inequality, we get the following theorem.

Theorem 4.2 Assume Conditions A.1.-A.5. are satisfied and Q0 , Q, T and δ0 are as above. Then there exist constants C, L, s such that for 0 < δ < δ0 and all a ∈ Q, x ∈ X we have ! Ã m(T ) X C s sup |θn − θ¯n | > δ ≤ 2 (1 + |x| )(1 + T ) exp(2LT ) Px,a γk2 . δ n≤m(T ) k=1 18

Shift in time. Let Pn;x,a denote the conditioned distribution of the process (Xn+k , θn+k ) with initial values Xn = x, θn = a. Let us now start the process ¯ s, a) denote the solution of ODE (8) with initial value at time n. Let θ(t; ¯ = a and define θ(s) m(n, T ) = inf{k ≥ n : γn+1 + . . . γk+1 ≥ T }. Then

à Pn;x,a

! sup

¯ k ; tn , a)| > δ |θk − θ(t



(11)

n≤k≤m(n,T ) ∞ X C s γk2 . ≤ 2 (1 + |x| )(1 + T ) exp(2LT ) δ k=n

P 2 I.e. if γk < ∞ then the trajectories of the stochastic algorithm initialized at time n follow the deterministic trajectories of the ODE with probability increasing to 1. Asymptotic analysis. Assume that there exists a unique θ∗ ∈ D such that h(θ∗ ) = 0, and that this θ∗ is a globally asymptotically stable equilibrium point. Then Krasovskii [8] implies the following. Condition A.7. There exist a twice continuously differentiable positive function U : D 7→ R+ and a constant C ≤ +∞ such that (i) U (θ) −→ C if θ −→ ∂D or |θ| −→ ∞; (ii) U (θ) < C for θ ∈ D; (iii) U 0 (θ) · h(θ) ≤ 0 for all θ ∈ D. Thus D is a level set of U . Generally, for c ∈ R define the level sets K(c) = {θ; U (θ) ≤ c}. Note that due to Condition A.7. (i) the level set K(c) is a compact set whenever c < C. To analyze the algorithm we will use a level-crossing argument. Define the stopping times τ (c) = inf{n ≥ 1; θn ∈ 6 K(c)} ν(c) = inf{n ≥ 1; θn ∈ K(c)}. Lemma 4.2 Assume Conditions A.1. - A.7. hold and let c0 be such that θ∗ ∈ K(c0 ). Then for any c0 < c1 < c2 < C, ∀a ∈ K(c2 ), ∀x ∈ X ν(c1 ) < ∞

Px,a a.s. on {τ (c2 ) = +∞},

i.e. if θn does not exit K(c2 ) at all then it enters K(c1 ) almost surely. 19

Proof. Let Φ be a twice continuously differentiable function on Rd with bounded second derivatives which coincides with U on K(c2 ) (prove that there exists such a function!). Define the transformed error-process as εk (Φ) = Φ(θk+1 ) − Φ(θk ) − γk+1 Φ0 (θk )h(θk ). We have to show that the set Ω0 = {ν(c1 ) = τ (c2 ) = +∞} has measure 0. Notice that τ (c2 ) = +∞ implies τ = +∞. Choose α > 0 such that U 0 (θ)h(θ) < −α for all θ satisfying c1 ≤ U (θ) ≤ c2 . Then on Ω0 we have m(n,T )−1

m(n,T )−1

c1 − c2 ≤ Φ(θm(n,T ) ) − Φ(θn ) =

X

0

γk+1 U (θk )h(θk ) +

k=n

X

εk (Φ)

k=n

Since the first term is majorated by −αT we conclude that the series P ε (Φ) diverges. This, however, contradicts our earlier result stating that k k on {τ = +∞} the former series converges almost surely. ¥ Theorem 4.3 Assume Conditions A.1. - A.7. hold and let c0 be such that θ∗ ∈ K(c0 ). Then for any 0 < c < c2 < C, ∀a ∈ K(c), ∀x ∈ X θn −→ K(c0 ) Px,a a.s. on {τ (c2 ) = +∞}. Proof. Suppose indirectly that lim sup U (θn ) > c for some c > c0 . Choose c0 < c1 < c and define ν1n = inf{n; θn ∈ K(c1 )} τ1 = inf{n; θn 6∈ K(c)} νk = inf{n > τk−1 ; θn ∈ K(c1 )} τk = inf{n > νk ; θn 6∈ K(c)}. Due to the previous lemma νk < ∞ for any k ≥ 1 while τk < ∞ by our indirect assumption. Now choose Φ ∈ C 2 (Rd ) in such a way that it coincides with U on K(c2 ) and inf{Φ(θ); θ 6∈ K(c2 )} = c. Noting that Φ(θτn ) − Φ(θνn ) ≥ c − c1 > 0 and using the same argument as in Lemma 4.2 we yield a contradiction. ¥ To see that our theorem is indeed useful, we need to know something about the measure of the set {τ (c) = +∞}. Using once again the train of thought of Lemma 4.2 a lower bound follows easily from the bound given for the L2 norms of the transformed error process. 20

Lemma 4.3 Assume Conditions A.1 - A.7 hold and let c1 < c2 < C. Then there exist constants B and s such that for all ∀a ∈ K(c1 ), ∀x ∈ X s

Px,a {τ (c2 ) < +∞} ≤ B(1 + |x| )

∞ X

γk1+λ .

k=1

Putting together Lemma 4.3 and Theorem 4.3 and using a shift in time, we are led to the following result. Theorem 4.4 Assume Conditions A.1 - A.7 hold and c0 ∈ R is such that {θ| U 0 (θ)h(θ) = 0} ⊆ K(c0 ). Then for any compact set Q ⊆ D there exist constants B and s such that for all n ≥ 0, ∀a ∈ Q, ∀x ∈ X ∞ X

s

Pn,x,a {θk −→ K(c0 )} ≥ 1 − B(1 + |x| )

γk1+λ .

k=n+1

Relaxing Condition A.5. Condition A’.5. For all q ≥ 1, and Q ⊂ D compact we have R (i’) supθ∈Q Πθ (x, dy)|y|q ≤ C|x|q + β.; In addition there exists a positive integer r such that R (i) supθ∈Q Πrθ (x, dy)|y|q ≤ α ¯ |x|q + β; with α ¯ < 1. Example 4.1 Consider a linear dynamics Xn+1 = A(θn )Xn + Un+1 such that A(θ) is stable for θ ∈ D and continuous in θ, furthermore for all q≥1 E|Un |q < +∞. Then Condition A’.5. is satisfied. (Prove it!) Lemma 4.4 For all positive integers l there exists a constant Cl such that for all n ≥ 1 E(|Xn+l |q |Fn ) ≤ Cl (1 + |x|q )

21

Proof. For l = 1 the proposition follows directly from Condition A’.5. For general l E(|Xn+l |q |Fn+l−1 ) ≤ C1 (1 + |Xn+l−1 |q ). Taking conditional expectation with respect to Fn and using an inductive hypothesis for l − 1 we get the claim. ¥ Let now ν be a stopping time such that for ν ≥ n + 1 we have θn ∈ Q. Then we have Lemma 4.5 For all positive integer l there exists a constant Cl such that for all n ≥ 1 E[I{ν ≥ n + l}|Xn+l |q |Fn ] ≤ Cl I{ν ≥ n + 1}(1 + |Xn |q ). Proof. The key difference is at l = 1. Now we have E[I{ν ≥ n + 1}|Xn+1 |q |Fn ] = I{ν ≥ n + 1}E[|Xn+1 |q |Fn ] since I{ν ≥ n + 1} is Fn -measurable. Now Z q E[|Xn+1 | |Fn ] = |y|q Pθn (Xn , dy). X

Multiplying by I{ν ≥ n + l} and taking supremum for θ ∈ Q we get Z q I{ν ≥ n + 1}E[|Xn+1 | |Fn ] ≤ I{ν ≥ n + 1} sup |y|q Pθ (Xn , dy) θ∈Q

X

which is bounded by I{ν ≥ n + 1}C1 (1 + |Xn |q ), as stated. For l > 1 we proceed by induction. We have E[I{ν ≥ n + l}|Xn+l |q |Fn+l−1 ] ≤ I{ν ≥ n + l}C1 (1 + |Xn+l−1 |q ) by the already proven inequality for l = 1. Increase the right hand side by replacing I{ν ≥ n + l} by I{ν ≥ n + l − 1}, take conditional expectation with respect to Fn and apply the inductive hypothesis to get an upper bound of the required form. ¥

22

5

Lq estimates

In section we investigate the problem discussed in Section 4 for the case P this α γn < ∞, where α can be different from 2. Our first goal is to investigate E{

sup

I(k ≤ ν(², Q)|

n n : θk ∈ / Q} σ(²) = inf{k ≥ n : |θk − θk−1 | > ²} ν(², Q) = inf{τ (Q), σ(²)}. Theorem 5.1 We assume conditions (A1)-(A4) and (A5’). For any regular function φ with second derivatives, any compact subset Q ⊂ D and for all q ≥ 2 there exit constants B, s, ²0 > 0 (²0 is independent of φ) such that for all ² ≤ ²0 , T > 0, x, a we have E{

sup

I(k ≤ ν(², Q))|

n 0, all δ < δ0 , and all a ∈ Q1 , all x m(T )

X 1+q/2 B γk Px,a { sup |θn − θ(tn )| ≥ δ} ≤ q (1 + |x|s )(1 + T )q−1 exp(qLT ) δ n≤m(T ) k=1 (15) The proof of the theorem is based on Lemma 5.3 and the following lemmas. Lemma 5.4 For any compact subset Q of D and q ≥ 2, there exist constants A, s such that for all T > 0, all ² ≤ ²0 (q, Q), all a ∈ Q, all x, we have m(T )

X q A γk Px,a (σ(²) ≤ τ (Q), σ(²) ≤ m(T )) ≤ q (1 + |x|s ) ² k=1 24

(16)

Lemma 5.5 Given compact subsets Q1 ⊂ Q2 of D and q ≥ 2, there exist constants A, s such that for all T and all ² ≤ ²0 (q, Q2 ), all a ∈ Q1 , all x m(T )

Ex,a {

sup

q

s

|θn − θ(tn )| }} ≤ A(1 + |x| )(1 + T )

q−1

exp(qLT )

n≤ν∧m(T )

X k=1

1+q/2

γk

(17)

In the following we analyze the asymptotic properties of the algorithm under the following assumption: X γnα < ∞. (A60 ) There exits α > 1 such that n

We retain assumption (A7) and the notation K(c) = {θ : U (θ) ≤ c}, τ (c) = inf{n : θ ∈ / K(c)}, ν(c) = inf{n : θn ∈ K(c)}. Let q0 (α) = max{2, 2(1 − α)}, i.e. if q ≥ q0 (α), then we have both q ≥ 2 and 1 + q/2 ≥ α. We shall suppose further that there exists a compact subset F of D satisfying F = {θ : U (θ) ≤ c0 } ⊃ {θ : U 0 (θ)h(θ) = 0} (18) Theorem 5.3 We assume (A1)-(A4), (A5’), (A6’) and (A7), and suppose that F is a compact set satisfying (18). Then for any compact subset Q of D, and q ≥ q0 (α), there exist constants B, s such that for all a ∈ Q and all x X 1+q/2 Px,a (θn converges to F ) ≥ 1 − B(1 + |x|s ) γk (19) k≥1

The proof consists two steps. The first step is to prove that the set {σ(²0 ) = ∞, τ (c2 ) = ∞} is large, the second step is to prove that on this set the convergence is satisfied. These steps are expressed in the following statements. Proposition 5.1 Given c1 , c2 , q satisfying c0 < c1 < c2 < C, q ≥ q0 (α), there exist ²0 , B, s such that for all a ∈ K(c1 ), all x X 1+q/2 (20) Px,a (σ(²0 ) = ∞, τ (c2 ) = ∞) ≥ 1 − B(1 + |x|s ) γk k≥1

25

Consider the complement of {σ(²0 ) = ∞, τ (c2 ) = ∞} [ {τ (c2 ) < ∞, σ(²0 ) ≥ τ (c2 )} {σ(²0 ) < ∞, τ (c2 ) ≥ σ(²0 )}. Using Theorem 5.1 with ²0 = ²0 (q, K(c2 )) and the following lemma we get the first statement Lemma 5.6 Given c1 , c2 , q such that c0 < c1 < c2 < C, q ≥ q0 (α), there exist ²0 , B, s such that for ² ≤ ²0 , a ∈ K(c1 ), X 1+q/2 Px,a {τ (c2 ) < ∞, σ(²) ≥ τ (c2 )} ≤ B(1 + |x|s ) γk (21) k≥1

The convergence of the algorithm towards the compact set F is expressed in the second step. Proposition 5.2 Suppose c and ² satisfy c0 < c < C and ² ≤ ²0 (q, K(c)) for some q ≥ q0 (α). Then for all x and all a in the interior of K(c), θn converges Px,a a.s. towards F on {τ (c) = ∞, σ(²) = ∞}. At the end of the section we shall combine Theorem 5.2 and 5.3 to give a detailed description of the convergence of θn when D is the domain of attraction of a point θ∗ . We replace (A7) by (A7’): (A7’) There exists a positive function U ∈ C 2 on D, such that U (θ) → C ≤ ∞ if θ → ∂D or |θ| → ∞ and U (θ) < C if θ ∈ D which satisfies 1. U (θ∗ ) = 0, U (θ) > 0, θ ∈ D, θ 6= θ∗ 2. U 0 (θ)h(θ) < 0 for all θ ∈ D, θ 6= θ∗ Clearly F = {θ∗ } satisfy (18) for c0 = 0. Theorem 5.4 We assume (A1)-(A4) and (A5’)-(A7’). Then for any compact subset Q of D, for all q ≥ q0 (α), all δ > 0, there exist constants B, s, such that for all n ≥ 0, all a ∈ Q, all x X 1+q/2 γk (22) Pn,x,a (sup |θk − θ(tk , tn , a)| > δ) ≤ B(1 + |x|s ) k≥n

k>n

26

References [1] A. Benveniste, M. M´etivier and P. Priouret. Adaptive Algorithms and Stochastic Approximations. Berlin, Springer Verlag, 1990. [2] J. A. Bucklew, T. G. Kurtz and W. A. Sethares,. Weak Convergence and Local Stability Properties of Fixed Step Size Recursive Algorithms IEEE Trans. on Information Theory, (30) no. 3., 966-978., 1993. [3] S. Geman. Some averaging and stability results for random differential equations. SIAM Journal of Applied Mathematics, 36:87–105, 1979. [4] L. Gerencs´er. Rate of convergence of recursive estimators. SIAM J. Control and Optimization, 30(5):1200–1227, 1992. [5] D. Hall and C.C. Heyde. Martingale Limit Theory and Application, Academic Press, New York, 1980. [6] J. A. Joslin and A. J. Heunis. Law of the Iterated Logarithm for a Constant-Gain Linear Stochastic Gradient Algorithm, SIAM J. on Control and Optimization,(39) 533-570., 2000. [7] Y. Kifer. Ergodic Theory of Random Transformation. Progress in Probability ans Statistics, 10, 1986. [8] Krasovskii. Stability of Motion, Stanford University Press, 1963. [9] Rejtett Markov Modellek jegyzet, www.sztaki.hu/molnarsg/hmm3.ps

27

6

Appendix

A summary of assumptions We made the following assumptions: (A.1.)P(γn )n≥1 is a decreasing sequence of positive real numbers such that n γn = +∞. (A.2.) There exists a family {Πθ : θ ∈ Rd } of transition probabilities Πθ (x, A) on X such that for any measurable subset A of X we have P [Xn+1 ∈ A| Fn ] = Πθn (Xn , A).

(A.3.) There exists a q 0 ≥ 1 such that in any compact set Q ⊂ D we have for all θ ∈ Q |H(θ, x)| ≤ C(1 + |x|q ), where C depends on Q. (A.4.)

(i) For all θ ∈ D there exists a constant h(θ) such that the Poissonequation (I − Πθ ) ν(x) = H(θ, x) − h(θ) has a solution. Denote the solution by νθ = νθ (x).

(ii) h is locally Lipschitz-continuous on D. (iii)/a There exists a q 0 ≥ 1 such that for all compact domains Q ⊂ D we have for θ ∈ Q 0

|νθ (x)| ≤ C(1 + |x|q ) where C also depends on Q. (iii)/b There exists a q 0 ≥ 1 and a λ with 1/2 ≤ λ ≤ 1 such that for all compact Q ⊂ D, θ, θ0 ∈ Q and any x ∈ X we have for wθ = Πθ νθ 0

|wθ (x) − wθ0 (x)| ≤ C|θ − θ0 |λ (1 + |x|q ) where C depends on Q.

28

(A.5.-0.) The Markov-process (Xn ) is Lq -stable for any q ≥ 1 uniformly in θ: for any q ≥ 1 , any compact set Q ⊂ D, for all θ ∈ Q, x ∈ X we have for all n ≥ 1 Eθ [|Xn |q |X0 = x] ≤ Cq (1 + |x|q ) where Cq also depends on Q. (A.5.) For all q > 0 and Q ⊂ D compact, θ ∈ Q and all n we have Ex,a {I(n < τ (Q))(1 + |Xn+1 |q )} ≤ Cq (1 + |x|q ) where Cq depends also on Q. (A.6.) There exist compact sets Q0 ⊆ int Q00 ⊆ D such that ¯ 0, a) ∈ Q0 θ(t; 0

for all a ∈ Q0 , 0 ≤ t ≤ T

¯ 0, a) denotes the general solution of the associated ODE where θ(t; ¯˙ = h(θ(t)) ¯ θ(t)

¯ = a. θ(0)

(A.7.) There exist a twice continuously differentiable positive function U : D 7→ R+ and a constant C ≤ +∞ such that (i) U (θ) −→ C if θ −→ ∂D or |θ| −→ ∞; (ii) U (θ) < C for θ ∈ D; (iii) U 0 (θ) · h(θ) ≤ 0 for all θ ∈ D.

29

Suggest Documents