On the relationship between process and pattern entropy rate.pdf

1 downloads 0 Views 253KB Size Report
Jan 20, 2010 - Abstract—The recent study of pattern sequences and their compressibility properties has led to interest in the entropy rate of pattern sequences ...
On the Relationship Between Process and Pattern Entropy Rate George M. Gemelos* Tsachy Weissman Department of Electrical Engineering Stanford University 350 Serra Mall Stanford, CA 94305, USA Email: {ggemelos, tsachy}@stanford.edu

Abstract— The recent study of pattern sequences and their compressibility properties has led to interest in the entropy rate of pattern sequences of stochastic processes, and its relationship to the entropy rate of the original process. In this work we give a complete characterization of this relationship for general discrete Markov processes as well as for a broad family of Markov and additive noise processes over a continuous alphabet.

I. I NTRODUCTION In their recent work [5], Orlitsky et. al consider the compression of sequences with unknown alphabet size. This work, among others, has created interest in examining random processes with arbitrary alphabets which may a priori be unknown. One can think of this as a problem of reading a foreign language for the first time. As one begins to parse characters, one’s knowledge of the alphabet grows. Since the characters in the alphabet have initially no meaning beyond the order in which they appear, one can relabel these characters by the order of their first appearance. Given a string, we refer to the relabeled string as the pattern associated with the original string. Example 1: Assume that the following English word was being parsed into a pattern by a non-English speaker. information The associated pattern would be 1, 2, 3, 4, 5, 6, 7, 8, 1, 4, 2. We abstract this as follows: given a stochastic process {Xi }i≥1 , we create a pattern process {Zi }i≥1 . The question we focus on in this work is how the entropy rate of a sequence and that of its pattern relate. More specifically, our goal is to study the relationship between the entropy rate H(X) of the original process1 {Xi }, and the entropy rate H(Z) of the associated pattern process. ∗ This

work was partly supported by NSF grant CCR-0311633. n this work, Xm will denote the sequence Xm , Xm+1 , . . . , Xn . If not specified, m will be assumed to be 1. Furthermore, H(X) will denote entropy rate throughout this work, regardless of the discreteness of the distributions of {X n } (it can thus be regarded as ∞ when these are not discrete). 1 Throughout

The connection between the entropy rate of the pattern and that of the original process was first studied for finite alphabet i.i.d. processes by Shamir et. al in [9]. The results in [9] gave bounds on the entropy rate of the pattern with respect to the entropy rate of the original process. These bounds were further extended to infinite alphabet i.i.d. processes in [8], [7]. These results were in the form of bounds, and did not completely characterize the relationship between the entropy rate of an i.i.d. process and that of its associated pattern. The first complete characterization of this relationship for the general i.i.d. case as well as discrete Markov, noise-corrupted and finite alphabet stationary ergodic processes, was given in [2]. Subsequent independent work by Orlitsky et. al in [4] also characterized this relationship for i.i.d. processes. The finite alphabet stationary ergodic result of [2] was later extended to general discrete stationary ergodic processes in [3], and independently in [6]. In this work we present the discrete Markov and additive noise results of [2], as well as treat the case of a family of continuous Markov processes. Although discrete Markov processes fall under the more general Theorem 2 of [3] which deals with discrete stationary ergodic processes, there is insight to be gained by examining the Markov case on its own. In particular, while the proof of Theorem 2 of [3] relies heavily on a version of the ShannonMcMillan-Breiman theorem for countably infinite-alphabets found in [1], no such heavy machinery is necessary for the proof of Theorem 1. This fact is due to the inherent structure of a Markov process and makes the Markov case an interesting example on its own. It is also this structure which allows for the extension of the results to a broad family of Markov processes with uncountable alphabets which we present here. Such extensions have not yet been found for generalized stationary ergodic processes with uncountable alphabets. In Section II we characterize this relationship for the case of a general discrete Markov process as well as for a broad family of Markov processes over an uncountable alphabet. In Section III we look at additive noise-corrupted processes with

Authorized licensed use limited to: Stanford University. Downloaded on January 20, 2010 at 12:37 from IEEE Xplore. Restrictions apply.

uncountable alphabets. We conclude in Section IV with a brief summary of our results. II. M ARKOV P ROCESSES The entropy rate of Markov processes is well-known. What can be said about the entropy rate of the associated pattern process? We first look at the case of a first order Markov process with components in a countable alphabet. Theorem 1: Let {Xi } be a stationary ergodic first order Markov process on the countable alphabet A and let {Zi } be the associated pattern process. Then H(X) = H(Z). Proof: Let µ be the stationary distribution of the Markov process and let Px (y) = Pr{Xt+1 = y|Xt = x} for all x, y ∈ A. The data processing inequality implies H(X n ) ≥ H(Z n ) for all n. Hence H(X) ≥ H(Z). To complete the proof it remains to show H(X) ≤ H(Z), for which we will need the following two lemmas. The first, whose elementary proof we omit, is Lemma 1: Let {An } and {Bn } be two sequences of events such that limn→∞ Pr{An } = 1 and limn→∞ Pr{Bn } = b. Then limn→∞ Pr{An ∩ Bn } = b. The second is Lemma 2: Given any B ⊆ A such that |B| < ∞  H(Z) ≥ µ(b)H(ΦB [Pb ]), b∈B

where we define the distribution, probability mass function, ΦB [Pb ](x)  1B (x)Pb (x) + Pb (B c )δxo (x), for an arbitrary xo ∈ B. 2 Here δxo is used to denote the distribution which places unit mass on xo . Proof of lemma: Let A(xn )  {a ∈ A : ∃ i ∈ [1, n], xi = a}. For B ⊆ A finite and b ∈ B, define DnB (b) = {xn ∈ An : B ⊆ A(xn ), xn = b}. H(Z) = lim H(Zn+1 |Z n ) n→∞

≥ lim H(Zn+1 |X n ) n→∞  ≥ lim H(Zn+1 |X n = xn ) dPX n n→∞

≥ lim

n→∞



= ≥



B (b) Dn

Pr{DnB (b)}H(ΦB [Pb ])

b∈B

H(ΦB [Pb ]) lim Pr{DnB (b)}

b∈B (a)

b∈B



n→∞

µ(b)H(ΦB [Pb ]),

b∈B

where (a) follows by Lemma 1 since, by ergodicity Pr{B ⊆ A(X n )} → 1. 2 2 Throughout this work, 1 will denote the indicator function on the set A, A while ½A will denote the indicator random variable on the event A.

Now let {Bk } be a sequence of sets such that Bk ⊆ A, |Bk | < ∞ for all k, and   lim −µ(b)Pb (a) log Pb (a) k→∞

=

b∈Bk a∈Bk



−µ(b)Pb (a) log Pb (a),

(1)

b∈A a∈A

regardless of the finiteness of both sides of the equation. Note that since the above summands are all positive, such a sequence {Bk } can always be found. Lemma 2 gives us  H(Z) ≥ µ(b)H(ΦBk [Pb ]) ∀ k. b∈Bk

Hence, by taking k → ∞, we get  H(Z) ≥ lim µ(b)H(ΦBk [Pb ]) k→∞

≥ lim

k→∞

= lim

k→∞

(a)

=

b∈Bk





µ(b)

b∈Bk

 

−Pb (a) log Pb (a)

a∈Bk

−µ(b)Pb (a) log Pb (a)

b∈Bk a∈Bk



−µ(b)Pb (a) log Pb (a)

b∈A a∈A

= H(Xi+1 |Xi ) (b)

= H(X),

(2)

where (a) comes from (1) and (b) from the fact that {Xi } is a stationary first-order Markov process. 2 One should note that the proof of Theorem 1 can easily be extended to the case of Markov processes of any order. Hence, without going through the proof, we state the following: Theorem 2: Let {Xi } be a stationary ergodic Markov process (of any order) on the countable alphabet A, and let {Zi } be the associated pattern process. Then H(X) = H(Z). Having looked at the general discrete Markov process, a natural question that arises is whether one can say anything about Markov processes with uncountable alphabets. Although we are unable to completely characterize the relationship between the original process and the entropy rate of its pattern process for the general Markov process, the following theorem looks at a family of Markov processes with uncountable alphabets. Before we state the theorem, let us generalize some of the notation used in Theorem 1. Given an mth order Markov process {Xi } on R, for xm ∈ Rm let fxm be the kernel associated with the state xm . We will denote the set of point masses of fxm as Sxm , Sxm = {y ∈ R : Pr{Xm+1 = y|X m = xm } > 0}. Theorem 3: Let {Xi } be a Markov process on R with order m such that there exists S ⊂ R with Sxm = S for all xm ∈ Rm and ΦS [fxm ] = ΦS [fym ] for all xm , y m ∈ S m . Let {Zi } be the pattern process associated with {Xi }. Define

Authorized licensed use limited to: Stanford University. Downloaded on January 20, 2010 at 12:37 from IEEE Xplore. Restrictions apply.

˜ i } by the process {X ˜ i = Xi ½{X ∈S} + xo ½{X ∈S} X i i ˜ i } is stationary for an arbitrary xo ∈ S c . If |S| < ∞ and {X ergodic then ˜ H(Z) = H(X). ˜ i } be stationary ergodic is Note that the condition that {X implied by {Xi } being stationary ergodic. It is therefore a weaker condition. Proof: If |S| = 0, w.p. 1 the process {Xi } does not repeat and ˜ i } is therefore H(Z) = 0. Similarly if |S| = 0, the process {X ˜ ˜ a constant and therefore H(X) = 0. Hence H(Z) = H(X). We now look at the case where |S| > 0. We observe that ˜ i } is a since ΦS [fxm ] = ΦS [fym ] for all xm , y m ∈ S m , {X finite alphabet Markov process of order m. Hence Theorem 2 ˜ i } is stationary ergodic gives and the fact that {X ˜ = H(Z), ˜ H(X)

(3)

˜ i }. where {Z˜i } is the pattern process associated with {X For x ∈ S define the waiting time Ix = inf{i : Xi = x}. Given {Ix }x∈S we know the first appearance of every point in S. Hence we know the first appearance of every point but those which are assigned zero probability by every kernel, i.e., all but those that appear at most once w.p. 1. Therefore given Z˜ n and {Ix }x∈S we can reconstruct Z n w.p. 1 for all n. Similarly, given Z n and {Ix }x∈S we can reconstruct Z˜ n for all n. Hence H(Z n ) ≤ H(Z˜ n ) +



H(Ix ) ∀n

(4)

x∈S

and H(Z n ) +



H(Ix ) ≥ H(Z˜ n )

∀n.

(5)

x∈S

To complete the proof, we shall need to establish: Claim 1: H(Ix ) < ∞ ∀x ∈ S Proof of Claim 1: Given x ∈ S, define dmin = min{Pr{Xm+1 = x|X m = y m } : y m ∈ Rm } and dmax = max{Pr{Xm+1 = x|X m = y m } : y m ∈ Rm }. Note that since |S| ∈ (0, ∞) and ΦS [fxm ] = ΦS [fym ] for all xm , y m ∈ S m , dmin and dmax are well defined. By the definition of S, dmin > 0. First let us consider the case where dmax = 1. Then there exists a x ∈ S and y m ∈ Rm such that Pr{Xm+1 = x|X m = y m } = 1 and S = Sym = {x}. We will first look at the case where y m ∈ S m . Since S = {x}, if y m ∈ S m , then y m = xm . Therefore fxm = δx and once the state xm is ˜ i } to be reached it cannot be exited. Hence in order for {X irreducible, which is required for the process to be ergodic, it must place zero or unit probability on being in state xm . By the

construction of S and the fact that x ∈ S, Pr{X m = xm } > 0 ˜ m = xm } > 0. Hence Xi = x w.p. 1 and and therefore Pr{X H(Ix ) = 0. Let us now examine the case where y m ∈ S m . ˆm ∈ S m , Since ΦS [fxm ] = ΦS [fym ] for all xˆm , y m ∈ S m , if x m m ˆ } = 1. Noting that S = {x}, then Pr{Xm+1 = x|X = x we can conclude that if Xi = x, then w.p. 1 Xi+1 = x. Hence Ix ≤ 2 and H(Ix ) ≤ log 2. We now consider the case where dmax < 1. Since regardless of the state y m , Pr{Xm+1 = x|X m = y m } ∈ [dmin , dmax ], then for all i ∈ N   Pr{Ix = i} ∈ dmin (1 − dmax )i−1 , dmax (1 − dmin )i−1 . Hence H(Ix ) =

∞ 

 Pr{Ix = i} log

1



Pr{Ix = i}   1 dmax (1 − dmin )i−1 log ≤ dmin (1 − dmax )i−1 i=1 i=1 ∞ 

= −dmax

∞ 

(1 − dmin )i−1 log(dmin )

i=1

− dmax

∞ 

(i − 1)(1 − dmin )i−1 log(1 − dmax )

i=1

= −dmax log(dmin )

∞ 

(1 − dmin )i

i=0 ∞ 

− dmax log(1 − dmax )

i(1 − dmin )i

i=0

dmax log(dmin ) =− dmin dmax (1 − dmin ) log(1 − dmax ) − . d2min Since dmin , dmax ∈ (0, 1), (6) implies H(Ix ) < ∞. Claim 1 therefore gives

(6) 2

H(Ix ) = 0 ∀n ∈ N, x ∈ S. (7) n Combining (4), (5), and (7) and noting that |S| < ∞ gives ˜ Equation (3) then completes the proof H(Z) = H(Z). Theorem 3. 2 Example 2: Let {Xi } be a first order Markov process on [0, 1] with the following transition kernels represented as generalized densities: 3 1 f0 (y) = δ0 (y) + δ1 (y) 4 4 1 1 1 f1 (y) = δ0 (y) + δ1 (y) + 4 2 4 and for x ∈ (0, 1) lim

n→∞

1 1 3 δ0 (y) + δ1 (y) + 1{(x−1/2)(y−1/2)>0} (y) 4 4 4 1 + 1{(x−1/2)(y−1/2)≤0} (y). 4

fx (y) =

Authorized licensed use limited to: Stanford University. Downloaded on January 20, 2010 at 12:37 from IEEE Xplore. Restrictions apply.

The processes is readily seen to satisfy the conditions of Theorem 3. The stationary distribution of the above kernels is easily verified to be

Since Ino is the waiting time for the first appearance of an c element from SN in the i.i.d. process {Ni }, it is geometrically distributed, and in particular has finite entropy. Therefore

1 1 1 δ0 (x) + δ1 (x) + . 2 3 6 Applying Theorem 3 gives

H(Ino ) =0 n which combined with (8) and (9) gives

µ(x) =

lim

n→∞

˜ H(Z) = H(X) 7 3 = − log(3) = 1.1556. 4 8 III. A DDITIVE W HITE N OISE -C ORRUPTED P ROCESSES We now consider the case of a noise-corrupted process. Let {Xi } be a stationary ergodic process and {Yi } be its noisecorrupted version. Here we assume i.i.d. additive noise, {Ni }. With Xi , Yi , and Ni taking values in R. Let SY and SN denote the set of point masses for Yi and Ni respectively. We will also define the process

˜ ≥ H(Z). H(Z)

Defining C(n)i = ½{Z(n) , we make the ˆ i =yo }∩{Yi ∈SY } following observations: w.p. 1 Yi = Y˜i iff C(n)i = 1 or ˆ i = yo and Y˜i = no iff C(n)i = 0 and Z(n) ˆ i = yo . Z(n) From these observations we conclude that given C(n) and ˆ Z(n) we can reconstruct Z˜ n w.p. 1 for all n > 0. Hence, for all n > 0, ˆ C(n)) H(Z˜ n ) ≤ H(Z(n), ˆ ≤ H(Z(n)) + H(C(n)) (a)

˜i = Ni ½{N ∈S } + (no − Xi )½{N ∈S c } , N i N i N for an arbitrary point no ∈ SY . Theorem 4: Let {Xi } be a finite alphabet stationary ergodic process and assume |SN | < ∞. Let {Yi } and {Y˜i } denote, respectively, the process {Xi } corrupted by the additive noise ˜i }. Then the entropy rates of the pattern processes {Ni } and {N associated, respectively, with {Yi } and {Y˜i }, {Zi } and {Z˜i }, are equal. The main idea behind the proof of Theorem 4 is that by ergodicity and the fact that {Xi } is a discrete process, as n grows large, the only symbols to be seen once should be those c . Hence by grouping the symbols in Z n which where Ni ∈ SN appear only once into a super symbol, we asymptotically generate {Z˜i }. Proof of Theorem 4: ˜i and there is nothing to If Pr{Ni ∈ SN } = 1 then Ni = N prove, so we will assume that Pr{Ni ∈ SN } < 1. Define ˆ i = Zi ½{∃j∈[1,n]\i: Z(n)

Zi =Zj }

+ yo ½{∃j∈[1,n]\i:

Zi =Zj }c

ˆ for some arbitrary yo ∈ SY . Clearly Z(n) uniquely determines n Z and vice versa so, in particular, ˆ H(Z ) = H(Z(n)) n

∀ n > 0.

We also observe the following: if Z˜i = Z˜Ino then Yi = Y˜i and if Z˜i = Z˜Ino then w.p. 1 Yi = Yj for all j = i. Hence we ˆ can construct Z(n) from Z˜ n and Ino w.p. 1. Therefore

where (a) comes from (8). Let P ei

(n)

= Pr{Yi ∈ SY , Yj = Yi ∀ j ∈ [1, n]\i}

(n) Pi Pim,n (y)

= Pr{Yj = Yi ∀ j ∈ [1, n]\i |Yi ∈ SY } = Pr{Yj = y ∀ j ∈ [m, n]\i |Yi = y}.

Then (n)

P ei

(n)

= Pr{Yi ∈ SY }Pi  1,n = Pr{Yi ∈ SY } Pi (y) Pr{Yi = y |Yi ∈ SY } y∈SY



≤ Pr{Yi ∈ SY }

Pi1,n (y).

y∈SY

Without loss of generality assume that i < n/2  i+1,i+n/2−1 (n) Pi (y) P ei ≤ Pr{Yi ∈ SY } y∈SY



(a)

≤ Pr{Y1 ∈ SY }

2,n/2

P1

(y),

y∈SY

(12)

y∈SY

Therefore we have (n)

P e(n) ≥ P ei

∀i.

(13)

By ergodicity, for all y ∈ SY we have

∀n>0

2,n/2

lim P1

and consequently,

n→∞

∀ n > 0.

(11)

i=1

where (a) follows from the stationarity of {Yi }. Let  2,n/2 P1 (y). P e(n) = Pr{Y1 ∈ SY }

c }. Ino = inf{i > 0 : Ni ∈ SN

ˆ H(Z˜ n ) + H(Ino ) ≥ H(Z(n))

= H(Z n ) + H(C(n)) n  ≤ H(Z n ) + H(C(n)i ),

(8)

Define

ˆ H(Z˜ n , Ino ) ≥ H(Z(n))

(10)

(9)

(y) = lim Pr{Yj = y ∀ j ∈ [2, n/2]|Y1 = y} n→∞

=0

Authorized licensed use limited to: Stanford University. Downloaded on January 20, 2010 at 12:37 from IEEE Xplore. Restrictions apply.

and since |SY | < ∞, (12) gives limn→∞ P e(n) = 0. Hence there exists an N such that P e(n) < 1/2 for all n > N and (13) implies that (n) HB (P ei )

≤ HB (P e(n) )

∀n ≥ N,

(14) (n)

where HB is the binary entropy function. Substituting P ei (n) into (11) and noting that E (C(n)i ) = P ei gives H(Z n ) 1 H(Z˜ n ) ≤ + n n n

n 

(n) HB (P ei )

∀n ≥ 0

H(Z n ) H(Z˜ n ) ≤ + HB (P e(n) ) ∀n ≥ N. n n Since limn→∞ P e(n) = 0, (15) 2

The following is directly implied by Theorem 2 of [3] and Theorem 4. Corollary 5: Let {Xi } be a finite alphabet stationary ergodic process, and let Yi = Xi + Ni , where {Ni } is an i.i.d. sequence and SN and {Y˜i } are defined ˜ as in Theorem 4. If |SN | < ∞, then H(Z) = H(Y). Note that in the case where {Ni } is a discrete i.i.d. process, Corollary 5 agrees with Theorem 2 of [3] and in the case where Ni has no point masses then, as our intuition would suggest, Corollary 5 gives H(Z) = 0. Having verified that Corollary 5 is in agreement with previous results, let us examine a case where previous theorems do not apply. Example 3: Let {Xi } be a first order Markov process on the set {1, 2} and let {Ni } be i.i.d., independent of {Xi }, distributed according to the density 2 1 1 fN (n) = √ e−n /2 + δ0 (n), 2 8π where δ0 denotes a unit mass on 0. Further let Yi = Xi + Ni and {Zi } be its associated pattern process. Since {Yi } is a hidden Markov process with memory on a continuous alphabet, previous results fail to capture the relationship between H(Y) and H(Z). However, Corollary 5 gives

˜ H(Z) = H(Y),

(b)

= H(In ) + Pr{In = 0}H(Xn |Y˜ n−1 , In = 0)

≥ H(In ) + Pr{In = 0}H(Xn |X n−1 , In = 0) 1 (d) = 1 + H(Xn |X n−1 ) (17) 2 where (a) follows from the fact that given Y˜i we know Ii , (b) from the fact that given Ii = 1, Y˜i is a constant and given Ii = 0, Y˜i = Xi , (c) from a combination of the fact

which combined with (14) gives

Combining (10) and (15) completes the proof.

(a)

H(Y˜n |Y˜ n−1 ) = H(Y˜n , In |Y˜ n−1 ) = H(In ) + H(Y˜n |Y˜ n−1 , In ) (c)

i=1

˜ ≤ H(Z). H(Z)

is simply {Xi } with erasures, we let Ii denote the event of erasure at time i. Then

(16)

where Y˜i is the ternary hidden Markov process given by Xi with probability 1/2 and an arbitrary no ∈ {1, 2} with probability 1/2. We can also use Corollary 5 to relate H(Z) to the entropy rate of the process H(X). Noting that {Y˜i }

that Xn is independent of I n and that conditioning decreases entropy, and finally, (d) follows from the fact that {Ii } is an i.i.d. Bernoulli 1/2 process, independent of the process {Xi }. Combining (16) and (17) we get H(Z) ≥ 12 H(X) + 1. Note that this holds with equality when {Xi } is i.i.d., as is readily seen to be implied by Theorem 1 of [3]. IV. C ONCLUSION Although previous results characterize the relationship between the entropy rate of a source and that of its pattern process for discrete stationary ergodic processes, we explore the special case of discrete Markov processes. We observe that the inherent structure of the Markov process allows for the characterization of this relationship without the need for the countably infinite-alphabet Shannon-McMillan-Breiman Theorem like statements used in the proof of the more general case. This observation allowed us to extend this relationship to a broad family of Markov processes with uncountable alphabets. We also characterize this relationship for certain noise-corrupted sources. R EFERENCES [1] K. L. Chung, A note on the ergodic theorem of information theory, Ann. Math. Stat., 32, 1961, p. 612-614. [2] G. Gemelos, T. Weissman, On the Entropy Rate of Pattern Processes, Hewlett Packard Tech Report HPL-2004-159, September 2004. [3] G. Gemelos, T. Weissman, On the entropy rate of pattern processes, Proceedings of the 2005 Data Compression Conference, Snowbird, Utah, March 2005, pp 233-242. [4] A. Orlitsky, N. P. Santhanam, K. Viswanathan, J. Zhang, Limit results on pattern entropy, Proceedings of the IEEE Information Theory Workshop, San Antonio, Texas, October 2004. [5] A. Orlitsky, N.P. Santhanam, and J. Zhang, Universal compression of memoryless sources over unknown alphabets, IEEE Transactions on Information Theory, IT-50:7 (July 2004), p. 1469-1481. [6] A. Orlitsky, N. P. Santhanam, K. Viswanathan, J. Zhang, Limit Results on Pattern Entropy, Submitted to IEEE Trans. on Information Theory. [7] G. I. Shamir, Sequence-patterns entropy and infinite alphabets, Proceedings of the 2004 Allerton Conference on Communication, Control and Computing. [8] G. I. Shamir, Sequential universal lossless techniques for compression of patterns and their description length, Proceedings of the 2004 Data Compression Conference, p. 419-428. [9] G. I. Shamir and L. Song, On the entropy of patterns of i.i.d. sequences, Proceedings of the 2003 Allerton Conference on Communication, Control, and Computing.

Authorized licensed use limited to: Stanford University. Downloaded on January 20, 2010 at 12:37 from IEEE Xplore. Restrictions apply.

Suggest Documents