Generalized Source Coding Theorems and Hypothesis Testing: Part I ...

1 downloads 0 Views 235KB Size Report
Expressions for "-entropy rate, "-mutual information rate and "-divergence rate are introduced. These quantities, which consist of the quantiles of the asymptotic ...
Generalized Source Coding Theorems and Hypothesis Testing: Part I { Information Measures Po-Ning Chen Dept. of Communications Engineering National Chiao Tung University 1001, Ta-Hsueh Road, Hsin Chu Taiwan 30050, R.O.C.

Fady Alajaji Dept. of Mathematics and Statistics Queen's University Kingston, Ontario K7L 3N6 Canada

Key Words: entropy, mutual information, divergence, "-capacity

Abstract Expressions for "-entropy rate, "-mutual information rate and "-divergence rate are introduced. These quantities, which consist of the quantiles of the asymptotic information spectra, generalize the inf/sup-entropy/information/divergence rates of Han and Verdu. The algebraic properties of these information measures are rigorously analyzed, and examples illustrating their use in the computation of the "-capacity are presented. In Part II of this work, these measures are employed to prove general source coding theorems for block codes and the general formula of the NeymanPearson hypothesis testing type-II error exponent subject to upper bounds on the type-I error probability.

I. Introduction and Motivation Entropy, divergence and mutual information are without a doubt the most important information theoretic quantities. They constitute the fundamental measures upon which information theory is founded. Given a discrete random variable X with distribution PX , its entropy is de ned by [3] X H (X )=4 PX (x) log PX (x) = EPX [ log PX (X )] : 2

x

H (X ) is a measure of the average amount of uncertainty in X . The divergence, on the other hand, measures the relative distance between the distributions of two random variables X and X^ that are de ned on the same alphabet : 4

D(X kX^ )=EPX

"

#

log PX (X ) : PX (X ) 2

^

As for the mutual information I (X ; Y ) between random variables X and Y , it represents the average amount of information that Y contains about X . It is de ned as the divergence between the joint distribution PXY and the product distribution PX PY : 4

I (X ; Y )=D(PXY kPX PY ) = EPXY

#

"

log PXY (X; Y ) : PX (X )PY (Y ) 2

More generally, consider an input process X de ned by a sequence of nite dimensional 4 n distributions [11]: X = fX = (X n ; : : : ; Xnn )g1n : Let Y =4 fY n = (Y n ; : : : ; Yn n )g1n be the 4 n corresponding output process induced by X via the channel W = fW = PY njX n : X n ! Y ng1n ; which is an arbitrary sequence of n-dimensional conditional distributions from X n to Y n, where X and Y are the input and output alphabets respectively. The entropy rate for the source X is de ned by [2], [3] 1 n H (X )=4 nlim !1 n E [ log PX n (X )] ; ( ) 1

( )

( ) 1

=1

( )

=1

=1

assuming the limit exists. Similarly the expressions for the divergence and mutual information rates are given by # " n P 1 4 X n (X ) ^ D(X kX )= nlim !1 n E log P (X n ) ; ^n X

1

and

#

"

1 E log PX n Y n (X n; Y n) ; I (X ; Y )= nlim !1 n (PX n (X n)PY n )(Y n) 4

respectively.

The above quantities have an operational signi cance established via Shannon's coding theorems when the stochastic systems under consideration satisfy certain regularity conditions (such as stationarity and ergodicity, or information stability) [9], [11]. However, in more complicated situations such as when the systems are non-stationary (with time-varying statistics), these information rates are no longer valid and lose their operational signi cance. This results in the need to establish new information measures which appropriately characterize the operational limits of arbitrary stochastic systems. This is achieved in [10] and [11] where Han and Verdu introduce the notions of inf/supentropy/information rates and illustrate the key role these information measures play in proving a general lossless (block) source coding theorem and a general channel coding theorem. More speci cally, they demonstrate that for an arbitrary nite-alphabet source X , the expression for the minimum achievable (block) source coding rate is given by the sup-entropy rate H (X ), de ned as the limsup in probability of (1=n) log 1=PX n (X n) [10]. They also establish in [11] the formulas of the "-capacity C" and capacity C of arbitrary single-user channels without feedback (not necessarily information stable, stationary, ergodic, etc.). More speci cally, they show that 1

sup supfR : FX (R) < "g  C"  sup supfR : FX (R)  "g;

X

X

and

C = sup I (X ; Y ); X

De nition ([8],[11]): Given 0 < " < 1, an (n; M; ) code for the channel W has blocklength n, M codewords and average (decoding) error probability not larger than ". A non-negative number R is an "-achievable rate if for every  > 0, there exist, for all n suciently large, (n; M; ) codes with rate n1 log M > R . The supremum of all "-achievable rates is called the "-capacity, C" . The capacity C is the supremum of rates that are "-achievable for all 0 < " < 1 and hence C = lim"#0 C" . In other words, C" is the largest rate at which information can be conveyed over the channel such that the probability of decoding error is below a xed threshold ", for suciently large blocklengths. Furthermore, C represents the largest rate at which information can be transmitted over the channel with asymptotically vanishing error probability. 1

2

where

FX (R)=4 lim sup Pr[(1=n) iX n Y n (X n; Y n)  R]; n!1

(1=n) iX n Y n (X n; Y n) is the sequence of normalized information densities de ned by n n iX n Y n (xn; yn) = log PY nPjX nn ((yynj)x ) ; Y

and I (X ; Y ) is inf-information rate between X and Y , which is de ned as the liminf in probability of (1=n) iX n Y n (X n; Y n). By adopting the same technique as in [10] (also in [11]), general expressions for the capacity of single-user channels with feedback and for Neyman-Pearson type-II error exponents are derived in [5] and [4], respectively. Furthermore, an application of the type-II error exponent formula to the non-feedback and feedback channel reliability functions is demonstrated in [4] and [6]. The above inf/sup-entropy/information rates are expressed in terms of the liminf/limsup in probability of the normalized entropy/information densities. The liminf in probability of a sequence of random variables is de ned as follows [10]: if An is a sequence of random variables, then its liminf in probability is the largest extended real number U such that for all  > 0, lim Pr[An  U  ] = 0:

n!1

(1.1)

Similarly, its limsup in probability is the smallest extended real number U such that for all  > 0, lim Pr[An  U +  ] = 0:

n!1

(1.2)

Note that these two quantities are always de ned; if they are equal, then the sequence of random variables converges in probability to a constant. It is straightforward to deduce that equations (1.1) and (1.2) are respectively equivalent to lim inf Pr[An  U  ] = lim sup Pr[An  U  ] = 0; n!1

(1.3)

lim inf Pr[An  U +  ] = lim sup Pr[An  U +  ] = 0: n!1

(1.4)

n!1

and

n!1

3

We can observe however that there might exist cases of interest where only the liminfs of the probabilities in (1.3) and (1.4) are equal to zero; while the limsups do not vanish. There are also other cases where both the liminfs and limsups in (1.3)-(1.4) do not vanish; but they are upper bounded by a prescribed threshold. Furthermore, there are situations where the interval [U; U ] does not contain only one point; for e.g., when An converges in distribution to another random variable. Hence, those points within the interval [U; U ] might possess a Shannon-theoretic operational meaning when for example An consists of the normalized entropy density of a given source. The above remarks constitute the motivation for this work in which we generalize Han and Verdu's information rates and prove general data compression and hypothesis testing theorems that are the counterparts of their "-capacity channel coding theorem [11]. In Part I, we propose generalized versions of the inf/sup-entropy/information/divergence rates. We analyze in detail the algebraic properties of these information measures, and we illustrate their use in the computation of the "-capacity of arbitrary additive-noise channels. In Part II of this paper [7], we utilize these quantities to establish general source coding theorems for arbitrary nite-alphabet sources, and the general expression of the Neyman-Pearson type-II error exponent.

II. Generalized Information Measures De nition 2.1 (Inf/sup-spectrum) If fAng1n is a sequence of random variables, then its inf-spectrum u() and its sup-spectrum u() are de ned by =1

inf PrfAn  g; u() =4 lim n!1 and

u() =4 lim sup PrfAn  g: n!1

In other words, u() and u() are respectively the liminf and the limsup of the cumulative distribution function (CDF) of An. Note that by de nition, the CDF of An { PrfAn  g { is 4

non-decreasing and right-continuous. However, for u() and u(), only the non-decreasing property remains . 2

De nition 2.2 (Quantile of inf/sup-spectrum) For any 0    1, the quantiles U  and U of the sup-pectrum and the inf-spectrum are de ned by ( if f : u()  g = ;; 4 U  = sup1f; : u()  g; otherwise ; and ( 1; if f : u()  g = ;; 4 U = supf : u()  g; otherwise; respectively. If follows from the above de nitions that U  and U are right-continuous and non3

decreasing in .

Note that the liminf in probability U and the limsup in probability U of An satisfy

U =U ; 0

and

U = U ; 1

respectively, where the superscript \-" denotes a strict inequality in the de nition of U ; i.e., 1

U =4 supf : u() < g: Note also that

 U  U   U  U:

Remark that U  and U always exist. Furthermore, if U  = U 8  2 [0; 1], then the sequence of random variables An converges in distribution to a random variable A, provided the distribution sequence of An is tight. It is pertinent to also point out that even if we do not require right-continuity as a fundamental property of a CDF, the spectrums u() and u() are not necessarily legitimate CDFs of (conventional real-valued) random variables since there might exist cases where the \probability mass escapes to in nity" (cf. [1, page 346]). A necessary and sucient condition for u() and u() to be conventional CDFs (without requiring right-continuity) is that the sequence of distribution functions of An be tight [1, page 346]. Tightness is actually guaranteed if the alphabet of An is nite. 3 Note that the usual de nition of the quantile function () of a non-decreasing function F () is slightly di erent from our de nition [1, page 190]: () = supf : F () < g: Remark that if F () is strictly increasing, then the quantile is nothing but the inverse of F (): () = F 1 (). 2

5

For a better understanding of the quantities de ned above, we depict them in Figure 1.

6 1

u()

u()



0

U

U

0

U

U U

1

U

-

Figure 1: The asymptotic CDFs of a sequence of random variables fAng1 () = sup-spectrum of An; u() = inf-spectrum n . u of An. =1

In the above de nitions, if we let the random variable An equal the normalized entropy density of an arbitrary source X , we obtain two generalized entropy measures for X : the -inf-entropy rate H  (X ) and the -sup-entropy rate H  (X ) as described in Table 1. Note that the infentropy-rate H (X ) and the sup-entropy-rate H (X ) introduced in [10] are special cases of the -inf/sup-entropy rate measures:

H (X ) = H (X ); and H (X ) = H (X ): 1

0

Analogously, for an arbitrary channel W =4PY jX with input X and output Y (or respectively for two observations X and X^ ), if we replace An by the normalized information density (resp. by the normalized log-likelihood ratio), we get the -inf/sup-information rates (resp. -inf/supdivergences rates) as shown in Table 1. The algebraic properties of these newly de ned information measures are investigated in the next section. 6

III. Properties of the Generalized Information Measures Lemma 3.1 Consider two arbitrary random sequences, fAng1n and fBng1n . Let u() and u() denote respectively the sup-spectrum and inf-spectrum of fAng1 () and v() n . Similarly, let v 4 denote respectively the sup-spectrum and inf-spectrum of fBng1 ()  g, n . De ne U  = supf : u U =4 supf : u()  g, V  =4 supf : v()  g, V =4 supf : v()  g, =1

=1

=1

=1

4 (U + V ) = supf : (u + v)()   + g; +

4 supf : (u + v)()   + g; (U + V ) = +

4 (u + v)()= lim sup PrfAn + Bn  g; n!1

and

4 lim inf PrfAn + Bn  g: (u + v)()= n!1

Then the following statements hold. 1. U  and U are both non-decreasing functions of  2 [0; 1]. 2. For   0,  0, and 1   + , (U + V )  U  + V ;

(3.5)

(U + V )  U  + V :

(3.6)

+

and

+

3. For   0,  0, and 1 >  + , (U + V )  U  + V

)

;

(3.7)

(U + V )  U + V

)

:

(3.8)

+

and

+

7

(1

(1

P roof : The proof of property 1 follows directly from the de nitions of U  and U and the fact that the inf-spectrum and the sup-spectrum are non-decreasing in . To show (3.5), we rst observe that

PrfAn + Bn  U  + V g  PrfAn  U  g + PrfBn  V g: Then lim sup PrfAn + Bn  U  + V g n!1





 lim sup PrfAn  U  g + PrfBn  V g n!1  lim sup PrfAn  U  g + lim sup PrfBn  V g n!1 n!1   + ; which, by de nition of (U + V ) , yields (3.5). +

Similarly, we have

PrfAn + Bn  U  + V g  PrfAn  U  g + PrfBn  V g: Then lim inf PrfAn + Bn  U  + V g n!1 



 lim inf PrfAn  U  g + PrfBn  V g n!1 inf PrfBn  V g  limn!1 sup PrfAn  U  g + lim n!1   + ; which, by de nition of (U + V ) , proves (3.6). +

To show (3.7), we remark from (3.5) that (U + V ) + ( V )  (U + V Hence, (U + V )  U  ( V ) :

V )

+

= U . +

+

(Note that the cases "+ = 1 or = 1 are not allowed here because they result in U = V = 1, and the subtraction of two in nite terms is unde ned. That is why the condition for property 2, 1   + , is replaced by 1 >  + in property 3.) 1

8

1

The proof is completed by showing that ( V )  V

)

(1

:

(3.9)

By de nition, 4 ( v)() = lim sup Pr f Bn  g n!1

= 1 lim inf Pr fBn < g n!1 = 1 v (  ): +

So v(  ) = 1 ( v)(). Then +

V

(1

)

4 = supf : v() < 1 g

 supf : v( ) < 1 g = supf ^ : v( ^ ) < 1 g = supf ^ : 1 ( v)() < 1 g = inf f^ : ( v)() > g = supf^ : ( v )()  g +

=

( V ) ;

where the inequality follows from v()  v( ). Finally, to show (3.8), we observe from (3.6) that (U + V ) + ( V )  (U + V V ) = U . Hence, +

+

(U + V )  U

+

( V ) :

2

Using (3.9), we have the desired result. If we take  = = 0 in (3.5) and (3.7), we obtain (U + V )  U + V and (U + V )  U + V ;

which mean that the liminf in probability of a sequence of random variables An + Bn is upper [resp. lower] bounded by the liminf in probability of An plus the limsup [resp. liminf] in probability of Bn. This fact is used in [11] to show that

H (Y ) H (Y jX )  I (X ; Y )  H (Y ) H (Y jX ); 9

which is a special case of property 3 in Lemma 3.2 . The next lemmas will show some of the analogous properties of the generalized information measures.

Lemma 3.2 For , ,  + 2 [0; 1), the following statements hold. is ultimately 1. H  (X )  0. H  (X ) = 0 if and only if the sequence fX n = (X n ; : : : ; Xnn )g1 n deterministic (in probability). ( ) 1

( )

=1

(This property also applies to H  (X ), I (X ; Y ), I  (X ; Y ), D  (X kX^ ), and D (X kX^ ).) 2. I  (X ; Y ) = I  (Y ; X ) and I (X ; Y ) = I (Y ; X ). 3.

I  (X ; Y )  H  (Y ) H (Y jX );

(3.10)

I  (X ; Y )  H  (Y ) H (Y jX );

(3.11)

I (X ; Y )  H  (Y ) H  (Y jX );

(3.12)

+

+

+

I  (X ; Y )  H  (Y ) H

(1

)

(Y jX );

(3.13)

(1

)

(Y jX ):

(3.14)

+

and

I (X ; Y )  H  (Y ) H +

4. 0  H  (X )  H  (X )  log jXj, where each Xi n 2 X , i = 1; : : : ; n and n = 1; 2; : : :, and X is nite. ( )

5. I  (X ; Y ; Z )  I  (X ; Z ).

P roof : Property 1 holds because

Pr (



1 log P n (X n) < 0 = 0; X n )

dPX n (X n) <   expf ng; Pr n1 log dP ^n X

10

and

(

)

Pr n1 log d(PdPnXnYPn n ) (X n; Y n) <   expf ng: Y X

Property 2 is an immediate consequence of the de nition. To show the inequalities in property 3 we rst remark that 1 h n (Y n ) = 1 i n n (X n ; Y n ) + 1 h n n (Y n jX n ); n Y n X ;Y n X ;Y (

)

(

)

4 1 n n where n1 h X n ;Y n (Y njX n)= n log PY njX n (Y jX ). With this fact, (3.10) follows directly from (3.5), (3.11) and (3.12) follow from (3.6), (3.13) follows from (3.7), and (3.14) follows from (3.8). (

)

Property 4 follows from the fact that H  () is non-decreasing in : H  (X )  H = H (X ), and that H (X ) is the minimum achievable (i.e., with asymptotically negligible probability of decoding error) xed-length coding rate for X as seen in [7, Theorem 3.2] and [10]. 1

Property 5 can be proved using the fact that 1i

n

(X

n ;Y n ;Z n ) (X n; Y n ; Z n ) =

1i

n

(X

n ;Z n) (X n; Z n ) +

1i

n

(X

n ;Y n ;Z n ) (Y n ; Z n

jX n ): 2

By applying (3.5), and letting = 0, we obtain the desired result.

Lemma 3.3 (Data processing lemma) Fix  2 [0; 1). Suppose that for every n, X n and X n 1

3

are conditionally independent given X n. Then 2

I  (X ; X )  I  (X ; X ): 1

3

1

2

P roof : By property 5, we get

I  (X ; X )  I  (X ; X ; X ) = I  (X ; X ); 1

3

1

2

3

1

2

where the equality holds because 1 log

n

dPX1nX2n X3n 1 log dPX1nX2n (xn; xn): n n n ( x ; x ; x ) = d(PX1n  PX2n X3n ) n d(PX1n  PX2n ) 1

2

3

1

2

2 11

Lemma 3.4 (Optimality of independent inputs) Fix  2 [0; 1). Consider a nite alphabet,

discrete memoryless channel { i.e., PY njX n = Qni PYijXi , for all n. For any input X and its corresponding output Y , I  (X ; Y )  I  (X ; Y ) = I (X ; Y ); where Y is the output due to X , which is an independent process with the same rst order statistics as X , i.e., PX n = Qni PXi . =1

=1

P roof : First, we observe that 1 log dPY n jX n (X n; Y n) + 1 log dPY n (X n; Y n) = 1 log dPY n jX n (X n; Y n): n dPY n n dPY n n dPY n

In other words, 1 log dPX nY n (X n; Y n) + 1 log dPY n (X n; Y n) = 1 log dPXn Y n (X n; Y n): n d(PX n  PY n ) n dPY n n d(PX n  PY n ) By evaluating the above terms under PX nY n and letting ( ) n n 1 dP 4 n n z()= lim sup PX n Y n n log d(P nXYP n ) (X ; Y )   n!1 X Y and Z  (X ; Y )=4 supf : z()  g; we obtain from (3.5) (with = 0) that

Z  (X ; Y )  I  (X ; Y ) + D(Y kY )  I  (X ; Y ); since D(Y kY )  0 by property 1 of Lemma 3.2 . Note that the summable property of (1=n) log[dPX n Y n =d(PXn  PY n )](X n; Y n) (i.e., it is equal to (1=n) Pni log[dPX i Y i =d(PXi  PY i )](Xi ; Yi)), the Chebyshev inequality and the niteness of the channel alphabets imply =1

I (X ; Y ) = I  (X ; Y ) and Z (X ; Y ) = Z  (X ; Y ): It nally remains to show that

I (X ; Y )  Z (X ; Y );

2

which is proved in [11, Theorem 10]. 12

IV. Examples for the Computation of "-Capacity In [11], Verdu and Han establish the general formulas for channel capacity and "-capacity. In terms of the "-inf-information rate, the expression of the "-capacity becomes sup I " (X ; Y )  C"  sup I "(X ; Y );

X

X

where " 2 (0; 1).

We now provide examples for the computation of C". They are basically an extension of some of the examples provided in [11] for the computation of channel capacity. Let the alphabet be binary X = Y = f0; 1g, and let every output be given by

Yi = Xi  Zi where  represents the addition operation modulo-2 and Z is an arbitrary binary random process independent of X . To compute the "-capacity we use the results of property 3 in Lemma 3.2 :

I " (X ; Y )  H (Y ) H and

(Y jX ) = H (Y ) H

(Y jX );

(4.15)

I "(X ; Y )  minfH " (Y ) H (Y jX ); H " (Y ) H (Y jX )g:

(4.16)

0

(1

"

)

(1

0

")

+

+

where "  0,  0 and 1 > " + . The lower bound in (4.15) follows directly from (3.13) (by taking  = 0 and = " ). The upper bounds in (4.16) follow from (3.10) and (3.11) respectively.

C"  sup I "(X ; Y )

Xn  sup H X

o

H (Y jX ) :

(Y )

"+

Since the above inequality holds for all 0  < 1 ", we have: n

o

inf sup H " (Y ) H (Y jX ) :  < "

C" 

0



0

(X

1

inf