asymptotic recurrence and waiting times for stationary processes

6 downloads 0 Views 284KB Size Report
and waiting times for nite-valued stationary processes, under various mixing conditions. Let X = fXn ; n 2 Zg be a stationary ergodic process on the space of in ...
ASYMPTOTIC RECURRENCE AND WAITING TIMES FOR STATIONARY PROCESSES Ioannis Kontoyiannis1

Appeared in Journal of Theoretical Probability, 11, pp. 795 - 811, July 1998. Let X = f n ; 2 Zg be a discrete-valued stationary ergodic process distributed according to and let = ( ?1 0 1 ) denote a realization from X . We investigate the asymptotic behavior of the recurrence time n de ned as the rst time that the initial -block n1 = ( 1 2 n ) reappears in the past of . We identify an associated random walk, ? log ( 1n ) on the same probability space as X , and we prove a strong approximation theorem between log n and ? log ( 1n ). From this we deduce an almost sure invariance principle for log n . As a byproduct of our analysis we get uni ed proofs for several recent results that were previously established using methods from ergodic theory, the theory of Poisson approximation and the analysis of random trees. Similar results are proved for the waiting time n de ned as the rst time until the initial -block from one realization rst appears in an independent realization generated by the same (or by a di erent) process. X

x

:::;x

n

P

;x ;x ;::: R

n

x

x

P X

R

x ;x ;:::;x

;

P X

R

W

n

KEY WORDS: Recurrence; strong approximation; almost-sure invariance principles.

1. INTRODUCTION

Recurrence properties are important in the study of stationary processes in probability theory, and dynamical systems in ergodic theory. In this paper we investigate the asymptotic behavior of recurrence and waiting times for nite-valued stationary processes, under various mixing conditions. Let X = fXn ; n 2 Zg be a stationary ergodic process on the space of in nite sequences (S 1 ; B; P ), where S is a nite set, B is the - eld generated by nite-dimensional cylinders, and P is a shift-invariant ergodic probability measure. By x 2 S Z, x = (: : : ; x?1 ; x0 ; x1 ; : : :) we denote an in nite realization of X , and for i  j we write xji for the nite substring (xi ; xi+1 ; : : : ; xj ) and P (xji ) for the probability of the cylinder fy : yij = xji g. Similarly we write Xij for the vector (Xi ; : : : ; Xj ); xj?1 for the semi-in nite string (: : : ; xj ?1 ; xj ), and so on. Given two independent realizations x, y, our main quantities of interest are the recurrence time Rn de ned as the rst time until the opening string xn1 recurs in the past of x, and

Durand 141A, Information Systems Lab, Electrical Engineering Department, Stanford University, Stanford CA 943054055, USA. Tel: (650) 723-4544. Email: [email protected] 2 This work was partially supported by grants NSF #NCR-9205663, JSEP #DAAH04-94-G-0058, ARPA #J-FBI-94-218-2. 1

1

the waiting time Wn until the opening string xn1 from the realization x rst appears in the independent realization y:

Rn = Rn(x) = inf fk  1 : x??kk++1n = xn1 g Wn = Wn(xn1 ; y) = inf fk  1 : ykk+n?1 = xn1 g: There has been a lot of work on calculating the exact asymptotic behavior of Rn and Wn . Wyner and Ziv, motivated by coding problems in information theory, drew a deep connection between these quantities and the entropy rate of the underlying process.(22) They proved that Rn and Wn both grow exponentially with n and that the limiting rate is equal to the entropy rate H = H (P ) = limn E [? log P (X0 j X??n1 )]: (Here and throughout the paper `log' will denote the logarithm to base 2, and `ln' the logarithm to base e.) Speci cally, they showed that for stationary ergodic processes (1=n) log Rn converges to H in probability, that for stationary ergodic Markov chains (1=n) log Wn ! H in probability, and they also suggested that these results hold in the almost sure sense. Indeed this was later established by Ornstein and Weiss(13) who showed that for stationary ergodic processes 1 log R ! H P ? a.s.; (1:1) n

n

and by Shields(17) who showed that for functions of stationary ergodic Markov chains 1 log W ! H P  P ? a.s. n

n

(1:2)

The waiting time results were further extended, rst by Nobel and Wyner(12) who showed that the convergence in probability holds for processes that are -mixing with a certain rate, and then by Marton and Shields(11), who extended (1.2) to weak Bernoulli processes. Shields also provided a counter-example to show that (1.2) does not hold in the general ergodic case.(17) Wyner and Ziv used in their analysis a theorem of Kac,(10) which can be phrased as follows: If X is stationary ergodic, then for any opening string xn1 we have E (Rn j X1n = xn1 ) = 1=P (xn1 ). This provides a strong formal connection between Rn and H : Taking logarithms of both sides in Kac's theorem, dividing by n and applying the Shannon-McMillan-Breiman theorem(5) yields (1:3) lim 1 log E (R j X n ) = lim 1 log[1=P (X n )] = H a.s. n

n

n

n

n

1

1

We can therefore rephrase the Wyner-Ziv-Ornstein-Weiss result (1.1) by saying that they strengthened (1.3) by removing the conditional expectation (1:4) lim 1 log R = lim 1 log[1=P (X n )] = H a.s. n

n

n

n

n

1

2

The crucial observation here is that Eq. (1.4) can be thought of as a strong approximation result between log Rn and ? log P (X1n ): log[Rn P (X1n )] = o(n) a.s. (1:5) Our rst result is a sharper form of (1.5), and a corresponding result for Wn .

1.1 Main Results For ?1  i  j  1 let Bij denote the - eld generated by Xij and de ne, for d  1:



?1 ) ? log P (X0 = s j X ?1 )

(d) = max E log P (X0 = s j X?1 ?d s2S   0 (d) = sup jP (B \PA()B?)PP((AB) )P (A)j : A 2 B?1 ; B 2 Bd1  0 (d) = sup jP (B \ A) ? P (B )P (A)j : A 2 B?1 ; B 2 Bd1 ;

where 0=0 is interpreted as 0. as d ! 1.

X is called

-mixing if (d) ! 0 as d ! 1, and strongly mixing if (d) ! 0

Theorem 1. Let X be a nite-valued stationary ergodic process, and fc(n)g an arbitrary sequence of P non-negative constants such that n2?c n < 1. For the recurrence times Rn we have ( )

(i) log[Rn P (X1n )]  c(n); eventually a.s. 0 (ii) log[Rn P (X1n j X?1 )]  ?c(n); eventually a.s.

For the waiting times Wn we have

(iii) log[Wn P (X1n )]  ?c(n); eventually a.s. and if, in addition, X is -mixing, then

(iv) log[Wn P (X1n )]  c(n); eventually a.s. with respect to the product measure P = P P:

From Theorem 1 we can easily deduce:

Corollary 1. Strong approximation: Let X be a nite-valued stationary ergodic process. P (a) If (d) < 1 then for any > 0 log[Rn P (X1n )] = o(n ) a.s. 3

(b) If X is -mixing then for any > 0 log[Wn P (X1n )] = o(n ) a.s. with respect to the product measure P = PP . (c) In the general ergodic case we have

log[Rn P (X1n )] = o(n) a.s. The coecients (d) were introduced by Ibragimov in Ref. 8. If X is a Markov chain of order m  1 then (d) = 0 for all d  m, so the rate at which (d) decays may be interpreted as a measure of how well X can be approximated by nite-order Markov chains. Theorem 1 and Corollary 1 are proved in section 2. We can now use Corollary 1 to read o the exact asymptotic behavior of log Rn and log Wn from that of ? log P (X1n ). This quantity can be interpreted in information theoretic terms as the ideal Shannon codeword length for the string X1n , and its asymptotics are well-understood. If X is ergodic, the ShannonMcMillan-Breiman theorem says that (?1=n) log P (X1n ) converges almost surely to H , and combining this with Corollary 1 we get (1.1), and also (1.2) in the case when X is -mixing. If X is a Markov chain or, more generally, if it satis es certain conditions on the rate of decay of (d) and (d), then ? log P (X1n ) behaves like the partial sum sequence of a strongly mixing stationary process (Ref. 15, chapter 9), and so it satis es satis es a central limit theorem, a law of the iterated logarithm, their in nite dimensional (functional) counterparts, as well as an almost sure invariance principle. Combining this with Corollary 1 gives us almost sure invariance principles for log Rn and log Wn : Let fR(t) ; t  0g denote the continuous-time path obtained by letting R(t) = 0 for t < 1, and R(t) = [log Rbtc ? btcH ] for t  1, where btc denotes the largest integer not exceeding t.

Theorem 2. Almost sure invariance principle: Let X be a nite-valued stationary process such

that (d) = O(d?336 ) and (d) = O(d?48 ). The following series converges: 1 X ?1 ) ? H )(? log P (Xk j X k?1 ) ? H )]: 2 2 ? 1  = E [? log P (X0 j X?1 ) ? H ] + 2 E [(? log P (X0 j X?1 ?1 k=1

If 2 > 0, then without loss of generality in the sense of Strassen, there exists a Brownian motion fB (t) ; t  0g such that for any  < 1=294,

R(t) ? B (t) = O(t1=2? ) a.s. Corresponding results hold for the waiting times Wn in place of Rn , under the additional assumption that X is -mixing.

4

The phrase \without loss of generality in the sense of Strassen" means, as usual, that without changing its distribution, R(t) can be rede ned on a richer probability space, that contains a Brownian motion such that Theorem 2 holds. Theorem 2 is an immediate consequence of combining our Corollary 1 with Theorem 9.1 of Philipp and Stout.(15)

1.2 Consequences

The numerous corollaries that can be derived from the almost sure invariance principles of Theorem 2 are well-known:(18) log Rn and log Wn satisfy a central limit theorem and a law of the iterated logarithm (and also their functional counterparts { see Corollary 2, below), as well as stronger results such as Pn lim sup k=1pj log3 Rk ? kH j = 3?1=2 a.s., n!1  2n LLg(n) where LLg() denotes the function LLg() = ln ln( ^ e).

Corollary 2. Under the assumptions of Theorem 2, if  > 0: 2

(i) CLT:

log Rnp? nH ?! D N (0; 1)  n

Moreover, the sequence of processes 

R(pnt) ; t 2 [0; 1] ; n  1;  n

converges in distribution to standard Brownian motion. (ii) LIL: Rn ? nH = 1 a.s. p lim sup log n!1  2nLLg(n) Moreover, with probability one, the sequence of sample paths (

)

R(nt) ; t 2 [0; 1] ; n  1; p  2n LLg(n)

is relatively compact in the topology of uniform convergence, and the set of its limit points is the collection R of all absolutely continuous functions r : [0; 1] ! R, such that r(0) = 0 and 01 (dr=dt)2 dt  1. (iii) Corresponding results hold for the waiting times Wn in place of Rn , under the additional assumption that X is -mixing.

In the special case when X is an irreducible, aperiodic Markov chain, the one-dimensional version of the central limit theorem for Wn was rst proved by Wyner,(21) who remarked that his methods can be modi ed to handle the case of Rn as well. 5

1.3 Match Lengths

The story of the asymptotics of Rn and Wn can equivalently be told in terms of match lengths along a realization. Given a realization x we de ne Lm as the length n of the shortest pre x xn1 that does not appear starting anywhere else in the previous m positions x0?m+1 :

Lm = Lm (x) = inf fn  1 : xn1 6= x??jj++1m ; for all j = 1; 2; : : : ; mg: Following Wyner and Ziv(22) we observe that Rn > m if and only if Lm  n, and consequently all asymptotic results about Rn can be translated into results about Lm : The almost sure convergence of (1=n) log Rn to H is equivalent to 1 Lm (1:6) log m ! H a.s.; and the central limit theorem and law of the iterated logarithm for log Rn (Corollary 2) translate to:

Corollary 3. Under the assumptions of Corollary 2: (i) CLT:

Lm ?plogHm D H ?3=2 log m ?! N (0; 1)

(ii) LIL: lim sup n!1

H ?3=2

Lm ? logHm = 1 a.s. 2 log m LLg(log m)

p

In the case of the waiting time, given two independent realizations x, y from X , the dual quantity is the length n of the shortest pre x xn1 of x that does not appear in y starting anywhere in y1m :

Mm = Mm (x; y) = inf fn  1 : xn1 6= yjj+n?1; for all j = 1; 2; : : : ; mg: Here Wn > m if and only if Mm  n and results about Wn can be equivalently stated in terms of Mm . In particular, (1.6) and the results of Corollary 3 hold with Mm in place of Lm when X is -mixing.

1.4 History

Some brief remarks about the history of these results are in order here. The rst explicit connection between match lengths and entropy seems to have been made by Pittel,(16) whose results are phrased in terms of path lengths in random trees. Aldous and Shields(1) rst pointed out the relationship between the random tree interpretations of these results and coding algorithms. Recurrence times in relation to coding theory rst appeared in Willems(20) and Wyner and Ziv.(22) Wyner and Ziv discovered the results (1.1) and (1.2), which were formally established by Ornstein and Weiss(13) and Shields,(17) using methods 6

from ergodic theory. The waiting time results were further extended to mixing processes by Nobel and Wyner(12) and Marton and Shields.(11) In the Markov case, Wyner(21) used the Chen-Stein method for Poisson approximation and Markov coupling to prove the one-dimensional central limit theorem for Wn . Szpankowski(19) made explicit the equivalence between match lengths along random sequences and feasible paths in random trees. The approach introduced in this paper provides a probabilistic framework for studying the asymptotic behavior of Rn and Wn . From Theorem 2 we can deduce strong results that were not previously known, as well as several known results that were previously established using involved arguments and methods from other areas. Moreover, and, perhaps, more importantly, Theorem 1 tells us why these results are true: Because, in a strong pointwise sense, the recurrence time is asymptotically equal to reciprocal of the probability of the recurring string. We should also mention that this approach can be extended to random elds on Zd, though, of course, new subtleties arise in this case regarding the conditional structure of the measures and their mixing rates. Some of the ideas in the proof of Theorem 1 can be traced in the work of Wyner and Ziv,(22) Ornstein and Weiss,(13) and Shields.(17) We also mention that ideas related to the use of ? log P (X1n ) (or a similar random walk) as an approximating sequence were used by Ibragimov(8) in proving a re nement to the Shannon-McMillan-Breiman theorem, by Barron(3) in proving the Shannon source coding theorem in the almost sure sense, and by Algoet and Cover(2) in an elementary proof of the Shannon-McMillan-Breiman theorem. Apart from their theoretical interest, the results in this paper may be relevant to several areas of applications such as coding theory,(20;22) DNA sequence analysis,(14) and string searching algorithms.(7;9) In section 2 we prove our main result, Theorem 1, and we deduce Corollary 1 from it. In section 3 we brie y discuss the special case when X is a Markov chain, and we give an explicit characterization (Theorem 3) of the degenerate case when the asymptotic variance in Theorem 2 is zero. Section 4 contains an extension of our waiting time results to the case when the independent realizations x and y are produced by di erent processes, and section 5 contains the proof of Theorem 3.

2. STRONG APPROXIMATION

We rst deduce Corollary 1 from Theorem 1 and then we give the proof of Theorem 1.

Proof of Corollary 1. For part (a) let > 0 arbitrary; since

7

P

n2?n < 1 for any  > 0, from (i)

and (ii) of Theorem 1 we get

lim sup n1 log [Rn P (X1n )]  0 a.s. n!1

(2:1)

1 log [R P (X n j X 0 )]  0 a.s. lim inf n 1 ?1 n!1 n

(2:2)

0 log P (X1n ) ? log P (X1n j X?1 ) = O(1) a.s.

(2:3)

Therefore to prove (a) it suces to show

P

i?1 ) j. Taking 0 ) j  ni=1 j log P (Xi j X1i?1 ) ? log P (Xi j X?1 Observe that j log P (X1n ) ? log P (X1n j X?1 P P expectations of both sides we get E j log P (X1n ) ? log P (X1n j Xn1+1 ) j  ni=1 (i), and since 1 i=1 (i) < 1, this implies Eq. (2.3). P Part (b) follows immediately from (iii) and (iv) of Theorem 1, upon noticing that n2?n < 1 for any ; > 0. For part (c), taking = 1 in Eqs. (2.1) and (2.2) we see that to prove (c) it suces to show that 1 log P (X n ) ? log P (X n j X 0 ) ! 0 a.s. (2:4) 1 1 ?1 n

By the Shannon-McMillan-Breiman theore, the rst term converges almost surely to ?H , and the second P i?1 )] which converges to H = E [? log P (X j X 1 )]; almost term is equal to (1=n) ni=1 [? log P (Xi j X?1 0 1 surely, by the ergodic theorem. This proves (2.4) and completes the proof of Corollary 1. 2

Proof of Theorem 1. Part (i). Given an arbitrary positive constant K , by Markov's inequality and Kac's theorem, n n P (Rn > K j X1n = xn1 )  E (Rn j XK1 = x1 ) = K P1(xn ) ; 1 for any opening sequence xn1 with non-zero probability. Since P (xn1 ) is constant with respect to the conditional measure P ( j X1n = xn1 ) we can let K = 2c(n) =P (xn1 ) to get   P ( log[Rn P (X1n )] > c(n) j X1n = xn1 ) = P Rn > 2c(n) =P (xn1 ) j X1n = xn1  2?c(n) :

Averaging over all opening patterns xn1 2 S n , P ( log[Rn P (X1n )] > c(n) )  2?c(n) ; and the Borel-Cantelli lemma gives (i). 0 instead of the opening string X1n . Fix any x0?1 Part (ii). We now condition on the in nite past X?1 and consider ?



0 0 P log[Rn(X )P (X1n j X?1 )] < ?c(n) j X?1 = x0?1 =

2?c(n)



!

0 0 0 P z1n 2 S n : P (X1n = z1n j X?1 )< Rn(x0?1  z1n ) X?1 = x?1 ;

8

where  denotes concatenation of strings. If we let Gn = Gn (x0?1) denote the set n

o

z1n 2 S n : P (z1n j x0?1) < 2?c(n) =Rn (x0?1  z1n) ;

then the above probability can be written as X

z1n 2Gn

X

P (z1n j x0?1 ) 

z1n 2Gn

2?c(n) =Rn (x0?1  z1n )  2?c(n)

Since x0?1 is xed, for each j  1 there is exactly one string z1n from sum in Eq. (2.5) is bounded above by X

1=Rn (x?1 0

zn 2S n

 zn) 1

s X n



j =1

1

X

1=Rn (x0?1  z1n ):

z1n 2S n S n with Rn(x0

(2:5)

?1  z1 ) = j , so the n

1=j  Dn;

for some positive constant D, where s = jS j is the cardinality of S . Therefore 0 0 = x0?1 )  Dn2?c(n) ; )] < ?c(n) j X?1 P ( log[RnP (X1n j X?1

and since this bound is independent of x0?1 and summable over n, from the Borel-Cantelli lemma we deduce (ii). Part (iii). Consider the joint process (X ; Y ) distributed according to the product measure P = PP . Given an arbitrary constant K > 1, for any opening string xn1 with non-zero probability we have P(Wn

< K j X n = xn ) 1

1



Setting K = 2?c(n) =P (xn1 ) gives

bK c X j =1

P(Wn

= j j X n = xn ) 1

1



bK c X j =1

P (Yjj+n?1 = xn1 )  K P (xn1 ):

P(log[Wn P (X1n )] < ?c(n) j X1n = xn1 )  2?c(n) :

(If K = 2?c(n) =P (xn1 )  1 then P(Wn < K j X1n = xn1 ) = 0 since Wn  1 by de nition so the above bound will trivially hold.) This is independent of xn1 , so by the Borel-Cantelli lemma we get (iii). Part (vi). Let  2 (0; 1) arbitrary and choose d such that (d) < . Fix an integer N large enough so that 2c(n)  2(n + d) for all n  N , x an n  N , and let K  2(n + d) arbitrary. Then for any sequence xn1 with non-zero probability, we can expand P(Wn > K j X1n = xn1 ) =



P (Y1n 6= xn1 ; Y2n+1 6= xn1 ; : : : YKK +n?1 6= xn1 ) bK=(nY +d)c?1

[1 ? P (xn )] 1

= [1 ? P (xn1 )]





)+n )+n 6= xn1 Yii((nn++dd)+1 P Yjj((nn++dd)+1 6= xn1 ; 0  i < j

j =1 bK=(nY +d)c?1 j =1

[1 ? P (Bj j Aj )] ; 9



(2.6)

where Aj and Bj are the events, )+n 6= xn1 ; i = 0; 1; : : : ; j ? 1g 2 B1j(n+d)?d Aj = fYii((nn++dd)+1 )+n = xn1 g 2 Bj1(n+d)+1 : Bj = fYjj((nn++dd)+1

By the choice of d and stationarity we have P (Bj j Aj )  (1 ? )P (Bj ) = (1 ? )P (xn1 ), for all j , and substituting in Eq. (2.6) we get P(Wn > K j X1n = xn1 )

 [1 ? (1 ? )P (xn )] K= n d  1 [1 ? (1 ? )P (xn )] nKd : [

1

( + )]

+

1

For any n  N let K = 2c(n) =P (xn1 )  2(n + d) to obtain: P(log[Wn P (X1n )] > c(n) j X1n = xn1 )

 1 [1 ? (1 ? )P (xn )] P xn 1

 1 

(1? )2c(n) n+d

(

1

1

)

2c(n) n+d

;

P

where  = supf(1 ? z )1=z ; 0 < z  1 ? g < 1. Since n2?c(n) < 1, (1=n)2c(n) ! 1 as n ! 1 and we can choose N 0 large enough such that ((1?) )2c n =(n+d)  2n2?c(n) for all n  N 0 . Consequently, ( )

P(log[Wn P (X1n )] > c(n) j X1n = xn1 )

 2 n2?c n ; ( )

for all n  M = maxfN; N 0 g. Since this bound is independent of xn1 , X

n1

P(log[Wn P (X1n )] > c(n))

 M + 2 n2?c n < 1; nM X

( )

2

and the Borel-Cantelli lemma gives (iv) and completes the proof of the Theorem.

Remark. In the proofs of (ii) and (iii) only the stationarity (and not the ergodicity) of X was used.

3. MARKOV CHAINS If X is a stationary irreducible aperiodic Markov chain, then (d) and (d) both decay exponentially fast, and (d) = 0 for all d  1, so X satis es all the conditions of Theorem 1 and Corollary 1. Consider the chain X~ = fX~n = (Xn ; Xn ) ; n 2 Zg with state-space T = f(s; t) 2 SS : P (Xi = t j Xi = s) > 0g; X~ is also stationary irreducible and aperiodic. Let f (s; t) = [? log P (Xi = t j Xi = s)]; f : T ! R; and observe that here the entropy rate H of X +1

+1

+1

is equal to Ef (X~ i ), so that, with probability one, [? log P (X1n ) ? nH ] behaves like the sequence of partial 10

sums of a centered bounded function of a Markov chain, up to a bounded term:

? log P (X n ) ? nH = 1

nX ?1 i=1

[f (X~ i ) ? Ef (X~ i )] + [? log P (X1 ) ? H ]:

In this case the asymptotic variance of Theorem 2 reduces to 2 = lim 1 Var(? log P (X n )): n!1

n

1

(3:1)

(3:2)

The following characterization of the degenerate case 2 = 0 was stated by Yushkevich.(23) We supply a proof of a slightly stronger result in section 5.

Theorem 3. Let X be a nite-valued stationary irreducible aperiodic Markov chain with entropy rate

H , and let 2 be de ned by Eq. (3.2). Then 2 = 0 if and only if every string X1n+1 that starts and ends in some xed state j 2 S , has probability (given that X1 = j ) either zero or qn , for some constant q depending on X .

4. WAITING TIMES BETWEEN DIFFERENT PROCESSES Let X , Y be two independent stationary processes distributed according to the measures P and Q,

respectively, with values in the nite set S . We consider the waiting time Wn (xn1 ; y) until the opening string xn1 in the realization x of the X -process rst appears in an independent realization y produced by the Y -process. We generalize our earlier results about Wn to this case, under the additional natural assumption that all nite-dimensional marginals Pn of P are dominated by the corresponding marginals Qn of Q. If this is not satis ed then there will exist nite strings xn1 such that P (xn1 ) > 0 but Q(xn1 ) = 0, and Wn will be in nite with positive probability. The analog of Theorem 1 (parts (iii) and (iv)), reads:

Theorem 4. Let X ; Y be independent stationary processes, distributed according to P; Q, respectively. Assume that Pn  Qn for all n, and write P for the product measure P Q. For any sequence fc(n)g of non-negative constants such that P n2?c n < 1, ( )

log[Wn Q(X1n )]  ?c(n); eventually P ? a.s.; and if, in addition, Y is -mixing, then

log[Wn Q(X1n )]  c(n); eventually P ? a.s. The proof of Theorem 4 is a simple modi cation of the proof of the corresponding waiting time results in Theorem 1. Simply replace P by Q throughout the arguments that lead to (iii) and (iv), and note 11

that under the additional assumption that Pn  Qn for all n it suces to condition on opening strings xn1 of non-zero Pn-probability. Now assume X is ergodic and Y is a Markov chain and de ne the relative entropy rate between P and Q as # " P (X0 j X??n1 ) : D(P kQ) = nlim E log !1 P Q(X0 j X??n1 ) If for any  > 0 we let c(n) = n in Theorem 4, apply the generalized Shannon-McMillan-Breiman theorem(4) and let  decrease to zero, we get:

Corollary 4. If X is ergodic and Y is a Markov chain 1 log W = H (P ) + D(P kQ) P ? a.s. lim n n!1 n

Next, suppose that X and Y are both stationary Markov chains and that X is irreducible and aperiodic. De ne a continuous-time process fq(t) ; t  0g by letting q(t) = 0, for t < 1, and q(t) = [? log Q(X1btc ) ? btc(H (P ) + D(P kQ))], for t  1. Consider the chain X~ of section 3, let g : T ! R be de ned by g(s; t) = [? log Q(Xi+1 = t j Xi = s)]; and notice that Eg(X~i ) = H (P ) + D(P kQ). As in Eq. (3.1), we can think of [? log Q(X1n ) ? n(H (P ) + D(P kQ))] as the sequence of partial sums of a centered bounded function of a Markov chain (up to a bounded term): nX ?1 i=1

[g(X~i ) ? Eg(X~i )] + [? log Q(X1 ) ? (H (P ) + D(P kQ))]:

From well-known Markov chain results (Ref. 15, Theorem 10.1) it follows that fq(t)g satis es an almost sure invariance principle with asymptotic variance 2 = limn Var(? log Q(X1n )). We can therefore combine this with Theorem 4 to obtain an analog of Theorem 2 in the case P 6= Q: Let W (t) = 0 for t < 1 and W (t) = [log Wbtc ? btc(H (P ) + D(P kQ))]; for t  1.

Corollary 5. Let X and Y be stationary Markov chains and suppose that X is irreducible and

aperiodic. If 2 = limn Var(? log Q(X1n )) > 0; then, without loss of generality in the sense of Strassen, there exists a Brownian motion fB (t) ; t  0g such that for any > 1=4

W (t) ? B (t) = O(t ); a.s. In the special case where both X and Y are independent and identically distributed, Wyner(21) proved Corollary 4 and the one-dimensional version of the central limit theorem in Corollary 5. 12

5. PROOF OF THEOREM 3

In this section we prove the following strengthened version of Theorem 3: 2 = 0 if and only if all the nonzero transition probabilities from state i to state j are of the form 2?H vi =vj , for some positive constants vi ; i 2 S . Theorem 3 now follows with q = 2?H . We begin by deriving a generalization of a formula due to Frechet,(6) for the asymptotic variance of Markov chains. Let Z = fZn ; n 2 Zg be a stationary irreducible aperiodic Markov chain with nite state-space T; stationary distribution (qi )i2T , and kth order transition probabilities (qij(k) )i;j 2T . Let f be a real-valued function on T and write f() for f () ? Ef (X1 ). De ne ! 1 n X X 1 2  = lim Var f(Z ) = E (f(Z ))2 + 2 E (f(Z )f(Z )) n!1 n

i

i=1

1

=

X

1

k=1

k+1

1

X X qiqij(k) f(i)f(j ): f(j )2 qj + 2

k=1 i;j 2T

j 2T

(5.1)

i h P (k ) < 1 (for i; j 2 T ) the second term above becomes q ? q Letting sij = 1 j k=1 ij

2 P

X

i;j

qisij f(i)f(j ) = 2

X

i

qif(i)i ;

where i = j sij f(j ) (for j 2 T ), and substituting this in Eq. (5.1) gives 2 = Expanding

X

i

2



qi f(i) + i ? X

i

qji i = = = = = =

and so X

i

qji(f(i) + i)2 =

X

i

X

i X

i

X

m

X

m

qi i2 =

qji

X

m

X

j

qj

" X

i

qji(f(i) + i )2 ? j2 :

X

(5:2)

sim f(m)

X (k) X ? qm ) f(m) qji (qim

f(m)

i

X

k1

k1 k (qjm ? qm ) ( +1)

2 X X f(m) 4 (q(k)

m

#

k1

sjmf(m) ?

3

jm ? qm ) ? (qjm ? qm )5

X

m X m  j ? qjmf (m); m

qjmf(m)

qji[(f(i) + i ? j ) + j ]2 = 13

(5.3) X

i

qji[(f(i) + i ? j )2 + j2];

(5:4)

since by Eq. (5.3) the cross terms vanish X

i

qji2j (f(i) + i ? j ) = 2j

X

= 2j

X

i

i

qjif(i) ? j +

X

i

qjii

qjif(i) ? j + j ?

!

X

m

!

qjm f(m) = 0:

Substituting Eq. (5.4) into (5.2) and interchanging i and j yields 2 =

X

j

qj

X

i

qji(f(i) + i ? j )2 ;

(5:5)

which is the generalization of Frechet's formula for the variance. Now consider the chain X~ de ned in section 3. For i; j 2 S we write pi = P (X1 = i) and pij = P (X2 = j j X1 = i), so that X~ has stationary distribution (qij ) = (pi pij ) and transition probabilities (qij;kl ) = (jk pkl ). Let f be de ned as in section 3. Since here ij = j is independent of i, using Eq. (5.5) we get

2 =

X

i;j )2T

(

=

X

i;j )2T

(

pi pij

X

k;l)2T

(

pi pij

jk pkl (f(k; l) + kl ? ij )2

X

l2S :pjl>0

pjl (f(j; l) + l ? j )2 :

For any (j; l) 2 T we have pjl > 0 and the result stated in the beginning of this section follows, with vi = 2?i , i 2 S . The converse is obvious. 2

ACKNOWLEDGEMENTS

The author wishes to thank Tom Cover and Tze Leung Lai for their helpful comments and suggestions, Amir Dembo for his encouragement and Benjamin Weiss for pointing out an error in an earlier version of our statement of Yushkevich's Theorem.

REFERENCES 1. Aldous, D. and Shields, P. C. (1988). A di usion limit for a class of randomly-growing binary trees. Prob. Th. Rel. Fields 79, 509{542. 2. Algoet, P. H. and Cover, T. M. (1988). A sandwich proof of the Shannon-McMillan-Breiman theorem. Ann. Probab. 16, 876{898. 3. Barron, A. R. (1985a). Logically smooth density estimation. Ph.D. Thesis, Dept. of Electrical Engineering, Stanford University.

14

4. Barron, A. R. (1985b). The strong ergodic theorem for densities: Generalized Shannon-McMillan-Breiman theorem. Ann. Probab. 13, 1992{1303. 5. Breiman, L. (1957). The individual ergodic theorem of information theory. Ann. Math. Statist. 28, 809{811. See also, Breiman, L. (1960). A correction to \The individual ergodic theorem of information theory". Ann. Math. Statist. 31, 809{810. 6. Frechet, M. (1938). Reserches theoriques modernes sur le calcul des probabilites, vol. II, Theorie des evenements en cha^ne dans le cas d'un nombre ni d'etats possibles. Gauthier-Villars, Paris (French). 7. Guibas, L. and Odlyzko, A. M. (1981). Periods in strings. J. Combin. Theory Ser. A 31, 19{42. 8. Ibragimov, I. A. (1962). Some limit theorems for stationary processes. Theory Prob. Appl. 7, 349{382. 9. Jacquet, P. and Szpankowski, W. (1994). Autocorrelation of words and its applications. J. Combin. Theory Ser. A 66, 237{269. 10. Kac, M. (1947). On the notion of recurrence in discrete stochastic processes. Bull. Amer. Math. Soc. 53, 1002{1010. 11. Marton, K. and Shields, P. C. (1995). Almost-sure waiting time results for weak and very weak Bernoulli processes. Ergod. Th. & Dynam. Sys. 15, 951{960. 12. Nobel, A. and Wyner, A. D. (1992). A recurrence theorem for dependent processes with applications to data compression. IEEE Trans. Inform. Theory 38, 1561{1564. 13. Ornstein, D. S. and Weiss, B. (1993). Entropy and data compression schemes. IEEE Trans. Inform. Theory 39, 78{83. 14. Pevzner, P., Borodovsky, M. and Mironov, A. (1991). Linguistic of nucleotide sequences: the signi cance of deviations from mean statistical characteristics and prediction of the frequency of occurrence of words. J. Biomol. Struct. Dynam. 6, 1013{1026. 15. Philipp, W. and Stout, W. (1975). Almost sure invariance principles for partial sums of weakly dependent random variables. Memoirs of the AMS, vol. 2, issue 2, no. 161. 16. Pittel, B. (1985). Asymptotical growth of a class of random trees. Ann. Probab. 13, 414{427. 17. Shields, P. C. (1993). Waiting times: positive and negative results on the Wyner-Ziv problem. J. Theor. Probab. 6, 499{519. 18. Strassen, V. (1964). An almost sure invariance principle for the law of the iterated logarithm. Gabiete 3, 23{32.

Z. Wahrsch. Verw.

19. Szpankowski, W. (1993). Asymptotic properties of data compression and sux trees. IEEE Trans. Inform. Theory 39, 1647{1659. 20. Willems, F. M. J. (1989). Universal data compression and repetition times. IEEE Trans. Inform. Theory 35, 54{58. 21. Wyner, A. J. (1993). String matching theorems and applications to data compression and statistics. Ph.D. Thesis, Dept. of Statistics, Stanford University. 22. Wyner, A. D. and Ziv, J. (1989). Some asymptotic properties of the entropy of a stationary ergodic data source with applications to data compression. IEEE Trans. Inform. Theory 35, 1250{1258.

15

23. Yushkevich, A. A. (1953). On limit theorems connected with the concept of the entropy of Markov chains. Uspehi Mat. Nauk 8, 177{180 (Russian).

16

Suggest Documents