Lecture Notes in Information Theory Volume II - Website of Professor ...

11 downloads 945 Views 1MB Size Report
23 Sep 2009 ... As shown from the list, the lecture notes are divided into two volumes. ... In addition, you may obtain the latest version of the lecture notes from.
Lecture Notes in Information Theory Volume II by Po-Ning Chen† and Fady Alajaji‡



Department of Electrical Engineering Institute of Communication Engineering National Chiao Tung University 1001, Ta Hsueh Road Hsin Chu, Taiwan 30056 Republic of China Email: [email protected]

Department of Mathematics & Statistics, Queen’s University, Kingston, ON K7L 3N6, Canada Email: [email protected] December 7, 2014

c Copyright by

Po-Ning Chen† and Fady Alajaji‡ December 7, 2014

Preface The reliable transmission of information bearing signals over a noisy communication channel is at the heart of what we call communication. Information theory—founded by Claude E. Shannon in 1948—provides a mathematical framework for the theory of communication; it describes the fundamental limits to how efficiently one can encode information and still be able to recover it with negligible loss. This course will examine the basic concepts of this theory. What follows is a tentative list of topics to be covered. 1. Volume I: (a) Fundamentals of source coding (data compression): Discrete memoryless sources, entropy, redundancy, block encoding, variable-length encoding, Kraft inequality, Shannon code, Huffman code. (b) Fundamentals of channel coding: Discrete memoryless channels, mutual information, channel capacity, coding theorem for discrete memoryless channels, weak converse, channel capacity with output feedback, the Shannon joint source-channel coding theorem. (c) Source coding with distortion (rate distortion theory): Discrete memoryless sources, rate-distortion function and its properties, rate-distortion theorem. (d) Other topics: Information measures for continuous random variables, capacity of discrete-time and band-limited continuous-time Gaussian channels, rate-distortion function of the memoryless Gaussian source, encoding of discrete sources with memory, capacity of discrete channels with memory. (e) Fundamental backgrounds on real analysis and probability (Appendix): The concept of set, supremum and maximum, infimum and minimum, boundedness, sequences and their limits, equivalence, probability space, random variable and random process, relation between a source and a random process, convergence of sequences of random variables, ergodicity and laws of large numbers, central limit theorem, concavity and convexity, Jensen’s inequality.

ii

2. Volumn II: (a) General information measure: Information spectrum and Quantile and their properties, R´enyi’s informatino measures. (b) Advanced topics of losslesss data compression: Fixed-length lossless data compression theorem for arbitrary channels, Variable-length lossless data compression theorem for arbitrary channels, entropy of English, Lempel-Ziv code. (c) Measure of randomness and resolvability: Resolvability and source coding, approximation of output statistics for arbitrary channels. (d) Advanced topics of channel coding: Channel capacity for arbitrary single-user channel, optimistic Shannon coding theorem, strong capacity, ε-capacity. (e) Advanced topics of lossy data compressing (f) Hypothesis testing: Error exponent and divergence, large deviations theory, Berry-Esseen theorem. (g) Channel reliability: Random coding exponent, expurgated exponent, partitioning exponent, sphere-packing exponent, the asymptotic largest minimum distance of block codes, Elias bound, Varshamov-Gilbert bound, Bhattacharyya distance. (h) Information theory of networks: Distributed detection, data compression over distributed source, capacity of multiple access channels, degraded broadcast channel, Gaussian multiple terminal channels. As shown from the list, the lecture notes are divided into two volumes. The first volume is suitable for a 12-week introductory course as that given at the Department of Mathematics and Statistics, Queen’s University at Kingston, Canada. It also meets the need of a fundamental course for senior undergraduates as that given at the Department of Computer Science and Information Engineering, National Chi Nan University, Taiwan. For an 18-week graduate course as given in Department of Communications Engineering, National Chiao-Tung University, Taiwan, the lecturer can selectively add advanced topics covered in the second volume to enrich the lecture content, and provide a more complete and advanced view on information theory to students. The authors are very much indebted to all people who provided insightful comments on these lecture notes. Special thanks are devoted to Prof. Yunghsiang S. Han with the Department of Computer Science and Information Engineering in National Chi Nan University, Taiwan, for his enthusiasm in testing these lecture notes at his school, and providing the authors valuable feedback.

iii

Notes to readers. In these notes, all the assumptions, claims, conjectures, corollaries, definitions, examples, exercises, lemmas, observations, properties, and theorems are numbered under the same counter for ease of their searching. For example, the lemma that immediately follows Theorem 2.1 will be numbered as Lemma 2.2, instead of Lemma 2.1. In addition, you may obtain the latest version of the lecture notes from http://shannon.cm.nctu.edu.tw. Interested readers are welcome to return comments to [email protected].

iv

Acknowledgements Thanks are given to our families for their full support during the period of writing these lecture notes.

v

Table of Contents

Chapter

Page

List of Tables

x

List of Figures

xi

1 Introduction 1.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1

2 Generalized Information Measures for Arbitrary System Statistics 2.1 Spectrum and Quantile . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Properties of quantile . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Generalized information measures . . . . . . . . . . . . . . . . . . 2.4 Properties of generalized information measures . . . . . . . . . . . 2.5 Examples for the computation of general information measures . . 2.6 R´enyi’s information measures . . . . . . . . . . . . . . . . . . . .

3 5 8 11 12 19 24

3 General Lossless Data Compression Theorems 3.1 Fixed-length data compression codes for arbitrary sources . 3.2 Generalized AEP theorem . . . . . . . . . . . . . . . . . . 3.3 Variable-length lossless data compression codes that minimizes the exponentially weighted codeword length 3.3.1 Criterion for optimality of codes . . . . . . . . . . . 3.3.2 Source coding theorem for R´enyi’s entropy . . . . . 3.4 Entropy of English . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Markov estimate of entropy rate of English text . . 3.4.2 Gambling estimate of entropy rate of English text . A) Sequential gambling . . . . . . . . . . . . . . . . 3.5 Lempel-Ziv code revisited . . . . . . . . . . . . . . . . . .

38 38 39 39 40 42 42 44

vi

28 . . . . 29 . . . . 35 . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

4 Measure of Randomness for Stochastic Processes 4.1 Motivation for resolvability : measure of randomness of random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Notations and definitions regarding to resolvability . . . . . . . 4.3 Operational meanings of resolvability and mean-resolvability . . 4.4 Resolvability and source coding . . . . . . . . . . . . . . . . . .

53 . . . .

5 Channel Coding Theorems and Approximations of Output Statistics for Arbitrary Channels 5.1 General models for channels . . . . . . . . . . . . . . . . . . . . . 5.2 Variations of capacity formulas for arbitrary channels . . . . . . . 5.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 ε-capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 General Shannon capacity . . . . . . . . . . . . . . . . . . 5.2.4 Strong capacity . . . . . . . . . . . . . . . . . . . . . . . . 5.2.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Structures of good data transmission codes . . . . . . . . . . . . . 5.4 Approximations of output statistics: resolvability for channels . . 5.4.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Notations and definitions of resolvability for channels . . . 5.4.3 Results on resolvability and mean-resolvability for channels

53 54 58 65

76 76 76 76 85 87 88 88 91 93 93 93 95

6 Optimistic Shannon Coding Theorems for Arbitrary Single-User Systems 98 6.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.2 Optimistic source coding theorems . . . . . . . . . . . . . . . . . 99 6.3 Optimistic channel coding theorems . . . . . . . . . . . . . . . . . 102 6.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.4.1 Information stable channels . . . . . . . . . . . . . . . . . 106 6.4.2 Information unstable channels . . . . . . . . . . . . . . . . 106 7 Lossy Data Compression 111 7.1 General lossy source compression for block codes . . . . . . . . . . 111 8 Hypothesis Testing 8.1 Error exponent and divergence . . . . . . . . . . . . . 8.1.1 Composition of sequence of i.i.d. observations 8.1.2 Divergence typical set on composition . . . . . 8.1.3 Universal source coding on compositions . . . 8.1.4 Likelihood ratio versus divergence . . . . . . . 8.1.5 Exponent of Bayesian cost . . . . . . . . . . . 8.2 Large deviations theory . . . . . . . . . . . . . . . . .

vii

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

119 . 119 . 122 . 127 . 128 . 132 . 133 . 135

8.3

8.4

8.5

8.2.1 Tilted or twisted distribution . . . . . . . . . . . . . . . 8.2.2 Conventional twisted distribution . . . . . . . . . . . . . 8.2.3 Cramer’s theorem . . . . . . . . . . . . . . . . . . . . . . 8.2.4 Exponent and moment generating function: an example . Theories on Large deviations . . . . . . . . . . . . . . . . . . . . 8.3.1 Extension of G¨artner-Ellis upper bounds . . . . . . . . . 8.3.2 Extension of G¨artner-Ellis lower bounds . . . . . . . . . 8.3.3 Properties of (twisted) sup- and inf-large deviation rate functions. . . . . . . . . . . . . . . . . . . . . . . . . . . Probabilitic subexponential behavior . . . . . . . . . . . . . . . 8.4.1 Berry-Esseen theorem for compound i.i.d. sequence . . . 8.4.2 Berry-Esseen Theorem with a sample-size dependent multiplicative coefficient for i.i.d. sequence . . . . . . . . . . 8.4.3 Probability bounds using Berry-Esseen inequality . . . . Generalized Neyman-Pearson Hypothesis Testing . . . . . . . .

. . . . . . .

135 135 136 137 140 140 147

. 157 . 159 . 159 . 168 . 170 . 176

9 Channel Reliability Function 181 9.1 Random-coding exponent . . . . . . . . . . . . . . . . . . . . . . 181 9.1.1 The properties of random coding exponent . . . . . . . . . 185 9.2 Expurgated exponent . . . . . . . . . . . . . . . . . . . . . . . . . 187 9.2.1 The properties of expurgated exponent . . . . . . . . . . . 193 9.3 Partitioning bound: an upper bounds for channel reliability . . . . 194 9.4 Sphere-packing exponent: an upper bound of the channel reliability201 9.4.1 Problem of sphere-packing . . . . . . . . . . . . . . . . . . 201 9.4.2 Relation of sphere-packing and coding . . . . . . . . . . . 201 9.4.3 The largest minimum distance of block codes . . . . . . . . 205 A) Distance-spectrum formula on the largest minimum distance of block codes . . . . . . . . . . . . . . . 208 B) Determination of the largest minimum distance for a class of distance functions . . . . . . . . . . . . . 214 C) General properties of distance-spectrum function . . . . 219 D) General Varshamov-Gilbert lower bound . . . . . . . . 227 9.4.4 Elias bound: a single-letter upper bound formula on the largest minimum distance for block codes . . . . . . . . . . 233 9.4.5 Gilbert bound and Elias bound for Hamming distance and binary alphabet . . . . . . . . . . . . . . . . . . . . . . . . 241 9.4.6 Bhattacharyya distance and expurgated exponent . . . . . 242 9.5 Straight line bound . . . . . . . . . . . . . . . . . . . . . . . . . . 243 10 Information Theory of Networks 252 10.1 Lossless data compression over distributed sources for block codes 253 10.1.1 Full decoding of the original sources . . . . . . . . . . . . . 253

viii

10.2

10.3 10.4 10.5

10.1.2 Partial decoding of the original sources . . . . . . . . . . Distributed detection . . . . . . . . . . . . . . . . . . . . . . . . 10.2.1 Neyman-Pearson testing in parallel distributed detection 10.2.2 Bayes testing in parallel distributed detection systems . . Capacity region of multiple access channels . . . . . . . . . . . . Degraded broadcast channel . . . . . . . . . . . . . . . . . . . . Gaussian multiple terminal channels . . . . . . . . . . . . . . .

ix

. . . . . . .

258 260 267 274 275 276 278

List of Tables

Number 2.1 2.2 2.3

Page

Generalized entropy measures where δ ∈ [0, 1]. . . . . . . . . . . . 13 Generalized mutual information measures where δ ∈ [0, 1]. . . . . 14 Generalized divergence measures where δ ∈ [0, 1]. . . . . . . . . . 15

x

List of Figures

Number 2.1

2.2 2.3 2.4 2.5 3.1 3.2 3.3 4.1

4.2 5.1 5.2 5.3 5.4

Page

The asymptotic CDFs of a sequence of random variables, {An }∞ n=1 . u¯(·) = sup-spectrum of An ; u(·) = inf-spectrum of An ; U1− = ¯ ¯ limξ↑1 Uξ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ¯ The spectrum h(θ) for Example 2.7. . . . . . . . . . . . . . . . . The spectrum ¯i(θ) for Example 2.7. . . . . . . . . . . . . . . . . The limiting spectrums of (1/n)hZ n (Z n ) for Example 2.8 . . . . The possible limiting spectrums of (1/n)iX n ,Y n (X n ; Y n ) for Example 2.8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 8 . 22 . 22 . 23 . 24

Behavior of the probability of block decoding error as block length n goes to infinity for an arbitrary source X. . . . . . . . . . . . . 35 ¯ ε (X) + Illustration of generalized AEP Theorem. Fn (δ; ε) , Tn [H ¯ ε (X) − δ] is the dashed region. . . . . . . . . . . . . . . . 37 δ] \ Tn [H Notations used in Lempel-Ziv coder. . . . . . . . . . . . . . . . . 48 Source generator: {Xt }t∈I (I = (0, 1)) is an independent random process with PXt (0) = 1 − PXt (1) = t, and is also independent of the selector Z, where Xt is outputted if Z = t. Source generator of each time instance is independent temporally. . . . . . . . . . . 72 The ultimate CDF of −(1/n) log PX n (X n ): P r{hb (Z) ≤ t}. . . . . 74 The ultimate CDFs of (1/n) log PN n (N n ). . . . . . . . . . . . . The ultimate CDF of the normalized information density for Example 5.14-Case B). . . . . . . . . . . . . . . . . . . . . . . . . The communication system. . . . . . . . . . . . . . . . . . . . . The simulated communication system. . . . . . . . . . . . . . .

. 89 . 91 . 94 . 94

7.1 7.2

λZ,f (Z) (D + γ) > ε ⇒ sup[θ : λZ,f (Z) (θ) ≤ ε] ≤ D + γ. . . . . . . . 115 The CDF of (1/n)ρn (Z n , fn (Z n )) for the probability-of-error distortion measure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

8.1

The geometric meaning for Sanov’s theorem. . . . . . . . . . . . . 126

xi

8.2 8.3 8.4

The divergence view on hypothesis testing. . . . . . . . . . . . . . 134 Function of (π/6)u − h(u). . . . . . . . . . . . . . . . . . . . . . . 168 The Berry-Esseen constant as a function of the sample size n. The sample size n is plotted in log-scale. . . . . . . . . . . . . . . . . . 170

9.1

BSC channel with crossover probability ε and input distribution (p, 1 − p). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Random coding exponent for BSC with crossover probability 0.2. Also plotted is s∗ = arg sup0≤s≤1 [−sR − E0 (s)]. Rcr = 0.056633. 9.3 Expurgated exponent (solid line) and random coding exponent (dashed line) for BSC with crossover probability 0.2 (over the range of (0, 0.192745)). . . . . . . . . . . . . . . . . . . . . . . . 9.4 Expurgated exponent (solid line) and random coding exponent (dashed line) for BSC with crossover probability 0.2 (over the range of (0, 0.006)). . . . . . . . . . . . . . . . . . . . . . . . . . 9.5 Partitioning exponent (thick line), random coding exponent (thin line) and expurgated exponent (thin line) for BSC with crossover probability 0.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . c ; (b) The shaded area is Acm,m0 . . . . . 9.6 (a) The shaded area is Um ¯ X (R) asymptotically lies between ess inf(1/n)µ(X ˆ n , X n ) and 9.7 Λ ˆ n , X n )|µn (X ˆ n , X n )] for R ¯ p (X) < R < R ¯ 0 (X). . . . (1/n)E[µn (X ¯ X (R). . . . . . . . . . . . . . . . . . . . . . . 9.8 General curve ofΛ   9.9 Function of sup −sR − s log 2s 1 − e−1/2s . . . . . . . . . . s>0     9.10 Function of sup −sR − s log 2 + e−1/s /4 . . . . . . . . . .

. 187 . 188

. 195

. 196

. 201 . 247 . 248 . 248 . 249 . 249

s>0

10.1 The multi-access channel. . . . . . . . . . . . . . . . . . . . . . . 10.2 Distributed detection with n senders. Each observations Yi may come from one of two categories. The final decision D ∈ {H0 , H1 }. 10.3 Distributed detection in Sn . . . . . . . . . . . . . . . . . . . . . . 10.4 Bayes error probabilities associated with gˆ and g¯. . . . . . . . . . 10.5 Upper and lower bounds on e∗NP (α) in Case B. . . . . . . . . . . .

xii

253 254 260 266 275

Chapter 1 Introduction This volume will talk about some advanced topics in information theory. The mathematical background on which these topics are based can be found in Appendices A and B of Volume I.

1.1

Notations

Here, we clarify some of the easily confused notations used in Volume II of the lecture notes. For a random variable X, we use PX to denote its distribution. For convenience, we will use interchangeably the two expressions for the probability of X = x, i.e., Pr{X = x} and PX (x). Similarly, for the probability of a set characterizing through an inequality, such as f (x) < a, its probability mass will be expressed by either PX {x ∈ X : f (x) < a} or Pr {f (X) < a} . In the second expression, we view f (X) as a new random variable defined through X and a function f (·). Obviously, the above expressions can be applied to any legitimate function f (·) defined over X , including any probability function PXˆ (·) (or log PXˆ (x)) of a ˆ Therefore, the next two expressions denote the probability random variable X. of f (x) = PXˆ (x) < a evaluated under distribution PX : PX {x ∈ X : f (x) < a} = PX {x ∈ X : PXˆ (x) < a} and Pr {f (X) < a} = Pr {PXˆ (X) < a} .

1

As a result, if we write  PX, ˆ Yˆ (x, y) U ¯δ + γ for some γ > 0. Then by definition of U ¯ +− , lim inf n→∞ Pr[An < Suppose U δ δ ¯ ¯ ¯δ +γ] < δ ≤ δ Uδ +γ] < δ, which implies lim inf n→∞ Pr[An ≤ Uδ +γ/2] ≤ lim inf n→∞ Pr[An < U + ¯ ¯ ¯ and violates the definition of Uδ . This completes the proof of Uδ− ≤ Uδ . To prove that ¯+ ≤ U ¯δ , note that from the definition of U ¯ + , we have that for any ε > 0, lim inf n→∞ Pr[An < U δ δ + ¯ ¯ + − 2ε] ≤ δ, implying that U ¯ + − 2ε ≤ U ¯δ . The Uδ − ε] ≤ δ and hence lim inf n→∞ Pr[An ≤ U δ δ + + ¯ ¯ ¯ latter yields that Uδ ≤ inf ε>0 [Uδ − 2ε] = Uδ , which completes the proof. Proving that ¯ +− ≤ U ¯δ− follows a similar argument. U δ ¯δ− = limξ↑δ U ¯ξ . Their equality can be proved by first observing It is worth noting that U  ¯δ− ≥ limξ↑δ U ¯ξ by their definitions, and then assuming that γ , U ¯δ− − limξ↑δ U ¯ξ > 0. that U ¯ξ < U ¯ξ + γ/2 ≤ (U ¯δ− ) − γ/2 implies that u((U ¯δ− ) − γ/2) > ξ for ξ arbitrarily close to δ Then U ¯δ− ) − γ/2) ≥ δ, contradicting to the definition of U ¯δ − . from below, which in turn implies u((U ¯δ− and limξ↑δ U ¯δ for convenience. In the lecture notes, we will interchangeably use U ¯ +− and U ¯ + will not be used in defining our general information The final note is that U δ δ measures. They are introduced only for mathematical interests. 3

Note that the usual definition of the quantile function φ(δ) of a non-decreasing function F (·) is slightly different from our definition [1, pp. 190], where φ(δ) , sup{θ : F (θ) < δ}. Remark that if F (·) is strictly increasing, then the quantile is nothing but the inverse of F (·): φ(δ) = F −1 (δ).

6

satisfies4 U = lim Uδ = U0 . δ↓0

Also, the limsup in probability U¯ (cf. [4]), defined as the smallest extended real number such that for all ξ > 0, lim Pr[An ≥ U¯ + ξ] = 0,

n→∞

is exactly5 U¯ = lim U¯δ = sup{θ : u(θ) < 1}, δ↑1

Straightforwardly by their definitions, U ≤ Uδ ≤ U¯δ ≤ U¯ for δ ∈ [0, 1). Remark that Uδ and U¯δ always exist. Furthermore, if Uδ = U¯δ for all δ in [0, 1], the sequence of random variables An converges in distribution to a random variable A, provided the sequence of An is tight. For a better understanding of the quantities defined above, we depict them in Figure 2.1. 4

It is obvious from their definitions that lim Uδ ≥ U0 ≥ U. δ↓0

The equality of limδ↓0 Uδ and U can be proved by contradiction by first assuming γ , lim Uδ − U > 0. δ↓0

Then u ¯(U + γ/2) ≤ δ for arbitrarily small δ > 0, which immediately implies u ¯(U + γ/2) = 0, contradicting to the definition of U . ¯ + ξ} ≤ limn→∞ Pr{An ≤ U ¯ + ξ} = u(U ¯ + ξ), it is straightSince 1 = limn→∞ Pr{An < U forward that ¯ ≥ sup{θ : u(θ) < 1} = lim U ¯δ . U 5

δ↑1

¯ and limδ↑1 U ¯δ can be proved by contraction by first assuming that The equality of U ¯ − lim U ¯δ > 0. γ,U δ↑1

¯ −γ/2) > δ for δ arbitrarily close to 1, which implies u(U ¯ −γ/2) = 1. Accordingly, Then 1 ≥ u(U by ¯ − γ/4} ≥ lim inf Pr{An ≤ U ¯ − γ/2} = u(U ¯ − γ/2) = 1, 1 ≥ lim inf Pr{An < U n→∞

n→∞

we obtain the desired contradiction.

7

6

1 u¯(·)

u(·)

δ

0

U

U¯0



U¯δ U1−



-

Figure 2.1: The asymptotic CDFs of a sequence of random variables, {An }∞ ¯(·) = sup-spectrum of An ; u(·) = inf-spectrum of An ; U1− = n=1 . u ¯ ¯ limξ↑1 Uξ . ¯

2.2

Properties of quantile

∞ ¯(·) Lemma 2.3 Consider two random sequences, {An }∞ n=1 and {Bn }n=1 . Let u ∞ and u(·) be respectively the sup- and inf-spectrums of {An }n=1 . Similarly, let v¯(·) and v(·) denote respectively the sup- and inf-spectrums of {Bn }∞ n=1 . Define ∞ ¯ Uδ and Uδ be the quantiles of the sup- and inf-spectrums of {An }n=1 ; also define Vδ and V¯δ be the quantiles of the sup- and inf-spectrums of {Bn }∞ n=1 . Now let (u + v)(·) and (u + v)(·) denote the sup- and inf-spectrums of sum sequence {An + Bn }∞ n=1 , i.e.,

(u + v)(θ) , lim sup Pr{An + Bn ≤ θ}, n→∞

and (u + v)(θ) , lim inf Pr{An + Bn ≤ θ}. n→∞

Again, define (U + V )δ and (U + V )δ be the quantiles with respect to (u + v)(·) and (u + v)(·). Then the following statements hold. 1. Uδ and U¯δ are both non-decreasing and right-continuous functions of δ for δ ∈ [0, 1].

8

2. limδ↓0 Uδ = U0 and limδ↓0 U¯δ = U¯0 . 3. For δ ≥ 0, γ ≥ 0, and δ + γ ≤ 1, (U + V )δ+γ ≥ Uδ + Vγ ,

(2.2.1)

(U + V )δ+γ ≥ Uδ + V¯γ .

(2.2.2)

and 4. For δ ≥ 0, γ ≥ 0, and δ + γ ≤ 1, (U + V )δ ≤ Uδ+γ + V¯(1−γ) ,

(2.2.3)

(U + V )δ ≤ U¯δ+γ + V¯(1−γ) .

(2.2.4)

and

Proof: The proof of property 1 follows directly from the definitions of Uδ and U¯δ and the fact that the inf-spectrum and the sup-spectrum are non-decreasing in δ. The proof of property 2 can be proved by contradiction as follows. Suppose limδ↓0 Uδ > U0 + ε for some ε > 0. Then for any δ > 0, u¯(U0 + ε/2) ≤ δ. Since the above inequality holds for every δ > 0, and u¯(·) is a non-negative function, we obtain u¯(U0 + ε/2) = 0, which contradicts to the definition of U0 . We can prove limδ↓0 U¯δ = U¯0 in a similar fashion. To show (2.2.1), we observe that for α > 0, lim sup Pr{An + Bn ≤ Uδ + Vγ − 2α} n→∞

≤ lim sup (Pr{An ≤ Uδ − α} + Pr{Bn ≤ Vγ − α}) n→∞

≤ lim sup Pr{An ≤ Uδ − α} + lim sup Pr{Bn ≤ Vγ − α} n→∞

n→∞

≤ δ + γ, which, by definition of (U + V )δ+γ , yields (U + V )δ+γ ≥ Uδ + Vγ − 2α. The proof is completed by noting that α can be made arbitrarily small.

9

Similarly, we note that for α > 0, lim inf Pr{An + Bn ≤ Uδ + V¯γ − 2α} n→∞

 ≤ lim inf Pr{An ≤ Uδ − α} + Pr{Bn ≤ V¯γ − α} n→∞

≤ lim sup Pr{An ≤ Uδ − α} + lim inf Pr{Bn ≤ V¯γ − α} n→∞

n→∞

≤ δ + γ, which, by definition of (U + V )δ+γ and arbitrarily small α, proves (2.2.2). To show (2.2.3), we first observe that (2.2.3) trivially holds when γ = 1 (and δ = 0). It remains to prove its validity under γ < 1. Remark from (2.2.1) that (U + V )δ + (−V )γ ≤ (U + V − V )δ+γ = Uδ+γ . Hence, (U + V )δ ≤ Uδ+γ − (−V )γ . (Note that the case γ = 1 is not allowed here because it results in U1 = (−V )1 = ∞, and the subtraction between two infinite terms is undefined. That is why we need to exclude the case of γ = 1 for the subsequent proof.) The proof is then completed by showing that −(−V )γ ≤ V¯(1−γ) . By definition, (−v)(θ) , lim sup Pr {−Bn ≤ θ} n→∞

= 1 − lim inf Pr {Bn < −θ} n→∞

Then V¯(1−γ) , sup{θ : v(θ) ≤ 1 − γ} = sup{θ : lim inf Pr[Bn ≤ θ] ≤ 1 − γ} n→∞

≥ sup{θ : lim inf Pr[Bn < θ] < 1 − γ} (cf. footnote 1) n→∞

= sup{−θ : lim inf Pr[Bn < −θ] < 1 − γ} n→∞

= sup{−θ : 1 − lim sup Pr[−Bn ≤ θ] < 1 − γ} n→∞

= sup{−θ : lim sup Pr[−Bn ≤ θ] > γ} n→∞

= − inf{θ : lim sup Pr[−Bn ≤ θ] > γ} n→∞

= − sup{θ : lim sup Pr[−Bn ≤ θ] ≤ γ} n→∞

= − sup{θ : (−v)(θ) ≤ γ} = −(−V )γ .

10

(2.2.5)

Finally, to show (2.2.4), we again note that it is trivially true for γ = 1. We then observe from (2.2.2) that (U + V )δ + (−V )γ ≤ (U + V − V )δ+γ = U¯δ+γ . Hence, (U + V )δ ≤ U¯δ+γ − (−V )γ . 2

Using (2.2.5), we have the desired result.

2.3

Generalized information measures

In Definitions 2.1 and 2.2, if we let the random variable An equal the normalized entropy density6 1 1 hX n (X n ) , − log PX n (X n ) n n of an arbitrary source n  o∞ (n) (n) n (n) X = X = X1 , X2 , . . . , Xn , n=1

we obtain two generalized entropy measures for X, i.e., δ-inf-entropy rate Hδ (X) = quantile of sup-spectrum of

1 hX n (X n ) n

¯ δ (X) = quantile of inf-spectrum of 1 hX n (X n ). δ-sup-entropy rate H n ¯ Note that the inf-entropy-rate H(X) and the sup-entropy-rate H(X) introduced in [4] are special cases of the δ-inf/sup-entropy rate measures: H(X) = H0 (X),

¯ ¯ δ (X). and H(X) = lim H δ↑1

In concept, we can image that if the random variable (1/n)h(X n ) exhibits a limiting distribution, then the sup-entropy rate is the right-margin of the support of that limiting distribution. For example, suppose that the limiting distribution ¯ of (1/n)hX n (X n ) is positive over (−2, 2), and zero, otherwise. Then H(X) = 2. Similarly, the inf-entropy rate is the left margin of the support of the limiting random variable lim supn→∞ (1/n)hX n (X n ), which is −2 for the same example. Analogously, for an arbitrary channel n  o∞ (n) W = (Y |X) = W n = W1 , . . . , Wn(n) , n=1

6

The random variable hX n (X n ) , − log PX n (X n ) is named the entropy density. Hence, the normalized entropy density is equal to hX n (X n ) divided or normalized by the blocklength n.

11

or more specifically, PW = PY |X = {PY n |X n }∞ n=1 with input X and output Y , if we replace An in Definitions 2.1 and 2.2 by the normalized information density 1 1 1 PX n ,Y n (X n , Y n ) n n n n n n n n iX W (X ; Y ) = iX ,Y (X ; Y ) , log , n n n PX n (X n )PY n (Y n ) we get the δ-inf/sup-information rates, denoted respectively by Iδ (X; Y ) and I¯δ (X; Y ), as: Iδ (X; Y ) = quantile of sup-spectrum of

1 iX n W n (X n ; Y n ) n

1 I¯δ (X; Y ) = quantile of inf-spectrum of iX n W n (X n ; Y n ). n Similarly, for a simple hypothesis testing system with arbitrary observation statistics for each hypothesis,  H0 : PX H1 : PXˆ we can replace An in Definitions 2.1 and 2.2 by the normalized log-likelihood ratio n 1 ˆ n ) , 1 log PX n (X ) dX n (X n kX n n PXˆ n (X n )

ˆ and to obtain the δ-inf/sup-divergence rates, denoted respectively by Dδ (XkX) ˆ ¯ Dδ (XkX), as ˆ = quantile of sup-spectrum of 1 dX n (X n kX ˆ n) Dδ (XkX) n ˆ = quantile of inf-spectrum of 1 dX n (X n kX ¯ δ (XkX) ˆ n ). D n The above replacements are summarized in Tables 2.1–2.3.

2.4

Properties of generalized information measures

In this section, we will introduce the properties regarding the general information measured defined in the previous section. We begin with the generalization of a simple property I(X; Y ) = H(Y ) − H(Y |X). By taking δ = 0 and letting γ ↓ 0 in (2.2.1) and (2.2.3), we obtain (U + V ) ≥ U0 + lim Vγ ≥ U + V γ↓0

12

Entropy Measures system

arbitrary source X

norm. entropy density entropy sup-spectrum entropy inf-spectrum

1 1 hX n (X n ) , − log PX n (X n ) n n  1 n ¯ hX n (X ) ≤ θ h(θ) , lim sup Pr n n→∞   1 n hX n (X ) ≤ θ h(θ) , lim inf Pr n→∞ n

δ-inf-entropy rate

¯ Hδ (X) , sup{θ : h(θ) ≤ δ}

δ-sup-entropy rate

¯ δ (X) , sup{θ : h(θ) ≤ δ} H

sup-entropy rate

¯ ¯ δ (X) H(X) , limδ↑1 H

inf-entropy rate

H(X) , H0 (X)

Table 2.1: Generalized entropy measures where δ ∈ [0, 1].

and (U + V ) ≤ lim Uγ + lim V¯(1−γ) = U + V¯ , γ↓0

γ↓0

which mean that the liminf in probability of a sequence of random variables An + Bn is upper bounded by the liminf in probability of An plus the limsup in probability of Bn , and is lower bounded by the sum of the liminfs in probability of An and Bn . This fact is used in [5] to show that ¯ |X), I(X; Y ) +H(Y |X) ≤ H(Y ) ≤ I(X; Y ) + H(Y or equivalently, ¯ |X) ≤ I(X; Y ) ≤ H(Y ) −H(Y |X). H(Y ) − H(Y Other properties of the generalized information measures are summarized in the next Lemma. Lemma 2.4 For finite alphabet X , the following statements hold. ¯ δ (X) ≥ 0 for δ ∈ [0, 1]. 1. H ˆ ¯ δ (XkX), (This property also applies to Hδ (X), I¯δ (X; Y ), Iδ (X; Y ), D ˆ and Dδ (XkX).)

13

Mutual Information Measures arbitrary channel PW = PY |X with input X and output Y 1 norm. information density iX n W n (X n ; Y n ) n PX n ,Y n (X n , Y n ) 1 log , n PX n (X n ) × PY n (Y n )   1 n n information sup-spectrum ¯i(θ) , lim sup Pr iX n W n (X ; Y ) ≤ θ n n→∞   1 n n information inf-Spectrum i(θ) , lim inf Pr iX n W n (X ; Y ) ≤ θ n→∞ n system

δ-inf-information rate

Iδ (X; Y ) , sup{θ : ¯i(θ) ≤ δ}

δ-sup-information rate

I¯δ (X; Y ) , sup{θ : i(θ) ≤ δ}

sup-information rate

¯ I(X; Y ) , limδ↑1 I¯δ (X; Y )

inf-information rate

I(X; Y ) , I0 (X; Y )

Table 2.2: Generalized mutual information measures where δ ∈ [0, 1].

2. Iδ (X; Y ) = Iδ (Y ; X) and I¯δ (X; Y ) = I¯δ (Y ; X) for δ ∈ [0, 1]. 3. For 0 ≤ δ < 1, 0 ≤ γ < 1 and δ + γ ≤ 1, Iδ (X; Y ) ≤ Hδ+γ (Y ) −Hγ (Y |X),

(2.4.1)

¯ δ+γ (Y ) − H ¯ γ (Y |X), Iδ (X; Y ) ≤ H ¯ δ+γ (Y ) −Hδ (Y |X), I¯γ (X; Y ) ≤ H

(2.4.2)

¯ (1−γ) (Y |X), Iδ+γ (X; Y ) ≥ Hδ (Y ) − H

(2.4.4)

¯ δ (Y ) − H ¯ (1−γ) (Y |X). I¯δ+γ (X; Y ) ≥ H

(2.4.5)

(2.4.3)

and (Note that the case of (δ, γ) = (1, 0) holds for (2.4.1) and (2.4.2), and the case of (δ, γ) = (0, 1) holds for (2.4.3), (2.4.4) and (2.4.5).) ¯ δ (X) ≤ log |X | for δ ∈ [0, 1), where each X (n) takes values 4. 0 ≤ Hδ (X) ≤ H i in X for i = 1, . . . , n and n = 1, 2, . . ..

14

Divergence Measures ˆ arbitrary sources X and X

system An : norm. log-likelihood ratio divergence sup-spectrum divergence inf-spectrum

n 1 ˆ n ) , 1 log PX n (X ) dX n (X n kX n n PXˆ n (X n )   1 n ˆn ¯ d(θ) , lim sup Pr dX n (X kX ) ≤ θ n n→∞   1 n ˆn dX n (X kX ) ≤ θ d(θ) , lim inf Pr n→∞ n

δ-inf-divergence rate

¯ ≤ δ} ˆ , sup{θ : d(θ) Dδ (XkX)

δ-sup-divergence rate

ˆ , sup{θ : d(θ) ≤ δ} ¯ δ (XkX) D

sup-divergence rate

ˆ , limδ↑1 D ˆ ¯ ¯ δ (XkX) D(Xk X)

inf-divergence rate

ˆ , D0 (XkX) ˆ D(XkX)

Table 2.3: Generalized divergence measures where δ ∈ [0, 1].

5. Iδ (X, Y ; Z) ≥ Iδ (X; Z) for δ ∈ [0, 1]. Proof: Property 1 holds because   1 n Pr − log PX n (X ) < 0 = 0, n     1 1 PX n (X n ) PX n (xn ) n n Pr < −ν = PX n x ∈ X : < −ν log log n PXˆ n (X n ) n PXˆ n (xn ) X PX n (xn ) =  n −nν } xn ∈X n : PX n (xn )γ n PXi (Xi )PYi (Yi ) PXi (Xi )PYi (Yi ) i=1

→ 0, for any γ > 0, where the convergence to zero follows from the Chebyshev’s inequality and the finiteness of the channel alphabets (or more directly, the finiteness of individual variance). Consequently, z¯(θ) = 1 for   n n 1X PXi Wi (Xi , Yi ) 1X EXi Wi log = lim inf I(Xi ; Yi ); θ > lim inf n→∞ n n→∞ n P (X )P (Y ) X i Y i i i i=1 i=1 Pn and z¯(θ) = 0 for θ < lim inf n→∞ (1/n) i=1 I(Xi ; Yi ), which implies n

X ¯ Y¯ ) = Zδ (X; ¯ Y¯ ) = lim inf 1 Z(X; I(Xi ; Yi ) n→∞ n i=1 for any δ ∈ [0, 1). Similarly, we can show that n

X ¯ Y¯ ) = Iδ (X; ¯ Y¯ ) = lim inf 1 I(X; I(Xi ; Yi ) n→∞ n i=1 for any δ ∈ [0, 1). Accordingly, ¯ Y¯ ) = Z(X; ¯ Y¯ ) ≥ Iδ (X; Y ). I(X; 2 7

The claim can be justified as follows. X X PY1 (y1 ) = PX n (xn )PW n (y n |xn ) y2n ∈Y n−1 xn ∈X n

=

X

=

X

X

X

y2n ∈Y n−1

n−1 xn 2 ∈X

PX1 (x1 )PW1 (y1 |x1 )

x1 ∈X

=

X

PX2n |X1 (xn2 |x1 )PW2n (y2n |xn2 )

n−1 y2n ∈Y n−1 xn 2 ∈X

x1 ∈X

X

X

PX1 (x1 )PW1 (y1 |x1 )

PX2n W2n |X1 (xn2 , y2n |x1 )

PX1 (x1 )PW1 (y1 |x1 ).

x1 ∈X

Hence, for a channel with PW n (y n |xn ) = on the respective input marginal.

Qn

i=1

PWi (yi |xi ), the output marginal only depends

18

2.5

Examples for the computation of general information measures

Let the alphabet be binary X = Y = {0, 1}, and let every output be given by (n)

Yi

(n)

= Xi

(n)

⊕ Zi

where ⊕ represents the modulo-2 addition operation, and Z is an arbitrary binary random process, independent of X. Assume that X is a Bernoulli uniform input, i.e., an i.i.d. random process with equal-probably marginal distribution. Then it can be derived that the resultant Y is also Bernoulli uniform distributed, no matter what distribution Z has. To compute Iε (X; Y ) for ε ∈ [0, 1) ( I1 (X; Y ) = ∞ is known), we use the results of property 3 in Lemma 2.4: ¯ (1−ε) (Y |X), Iε (X; Y ) ≥ H0 (Y ) − H

(2.5.1)

¯ ε+γ (Y ) − H ¯ γ (Y |X). Iε (X; Y ) ≤ H

(2.5.2)

and where 0 ≤ ε < 1, 0 ≤ γ < 1 and ε + γ ≤ 1. Note that the lower bound in (2.5.1) and the upper bound in (2.5.2) are respectively equal to −∞ and ∞ for ε = 0 and ε + γ = 1, which become trivial bounds; hence, we further restrict ε > 0 and ε + γ < 1 respectively for the lower and upper bounds. Thus for ε ∈ [0, 1), Iε (X; Y ) ≤

inf

0≤γ 0, the typeI R´enyi’s mutual information of order α is defined by:  ! i X Xh  1  1−α α min PX (x) log PY |X (y|x)PYˆ (y) , if α 6= 1; I(X; Y ; α) , PYˆ α − 1 x∈X y∈Y    lim I(X; Y ; α) = I(X; Y ), if α = 1, α→1

where the minimization is taken over all PYˆ under fixed PX and PY |X . Definition 2.12 (type-II R´ enyi’s mutual information) For α > 0, the type-II R´enyi’s mutual information of order α is defined by: J(X; Y ; α) , min D(PX,Y kPX × PYˆ ; α) PYˆ   !1/α   X X   α log   , for α 6= 1; PX (x)PYα|X (y|x) α − 1 = y∈Y x∈X    lim J(X; Y ; α) = I(X; Y ), for α = 1. α→1

25

Some elementary properties of R´enyi’s entropy and divergence are summarized below. Lemma 2.13 For finite alphabet X , the following statements hold. 1. 0 ≤ H(X; α) ≤ log |X |; the first equality holds if, and only if, X is deterministic, and the second equality holds if, and only if, X is uniformly distributed over X . 2. H(X; α) is strictly decreasing in α unless X is uniformly distributed over its support {x ∈ X : PX (x) > 0}. 3. limα↓0 H(X; α) = log |{x ∈ X : PX (x) > 0}|. 4. limα→∞ H(X; α) = − log maxx∈X PX (x). ˆ α) ≥ 0 with equality holds if, and only if, PX = P ˆ . 5. D(XkX; X ˆ α) = ∞ if, and only if, either 6. D(XkX; {x ∈ X : PX (x) > 0 and PXˆ (x) > 0} = ∅ or {x ∈ X : PX (x) > 0} 6⊂ {x ∈ X : PXˆ (x) > 0} for α ≥ 1. ˆ α) = − log P ˆ {x ∈ X : PX (x) > 0}. 7. limα↓0 D(XkX; X 8. If PX (x) > 0 implies PXˆ (x) > 0, then ˆ α) =  lim D(XkX;

α→∞

max

PX (x) log P (x) .

x∈X : PX ˆ (x)>0

ˆ X

9. I(X; Y ; α) ≥ J(X; Y ; α) for 0 < α < 1, and I(X; Y ; α) ≤ J(X; Y ; α) for α > 1. Lemma 2.14 (data processing lemma for type-I R´ enyi’s mutual information) Fix α > 0. If X → Y → Z, then I(X; Y ; α) ≥ I(X; Z; α).

26

Bibliography [1] P. Billingsley. Probability and Measure, 2nd edition, New York, NY: John Wiley and Sons, 1995. [2] L. L. Cambell, “A coding theorem and R´enyi’s entropy,” Informat. Contr., vol. 8, pp. 423–429, 1965. [3] R. M. Gray, Entropy and Information Theory, Springer-Verlag, New York, 1990. [4] T. S. Han and S. Verd´ u, “Approximation theory of output statistics,” IEEE Trans. on Information Theory, vol. IT–39, no. 3, pp. 752–772, May 1993. [5] S. Verd´ u and T. S. Han, “A general formula for channel capacity,” IEEE Trans. on Information Theory, vol. IT–40, no. 4, pp. 1147–1157, Jul. 1994.

27

Chapter 3 General Lossless Data Compression Theorems In Volume I of the lecture notes, we already know that the entropy rate 1 H(X n ) n→∞ n lim

is the minimum data compression rate (nats per source symbol) for arbitrarily small data compression error for block coding of the stationary-ergodic source. We also mentioned that for a more complicated situations where the sources becomes non-stationary, the quantity limn→∞ (1/n)H(X n ) may not exist, and can no longer be used to characterize the source compression. This results in the need to establish a new entropy measure which appropriately characterizes the operational limits of arbitrary stochastic systems, which was done in the previous chapter. The role of a source code is to represent the output of a source efficiently. Specifically, a source code design is to minimize the source description rate of the code subject to a fidelity criterion constraint. One commonly used fidelity criterion constraint is to place an upper bound on the probability of decoding error Pe . If Pe is made arbitrarily small, we obtain a traditional (almost) errorfree source coding system1 . Lossy data compression codes are a larger class of codes in the sense that the fidelity criterion used in the coding scheme is a general distortion measure. In this chapter, we only demonstrate the boundederror data compression theorems for arbitrary (not necessarily stationary ergodic, information stable, etc.) sources. The general lossy data compression theorems will be introduced in subsequent chapters. 1

Recall that only for variable-length codes, a complete error-free data compression is required. A lossless data compression block codes only dictates that the compression error can be made arbitrarily small, or asymptotically error-free.

28

3.1

Fixed-length data compression codes for arbitrary sources

Equipped with the general information measures, we herein demonstrate a generalized Asymptotic Equipartition Property (AEP) Theorem and establish expressions for the minimum ε-achievable (fixed-length) coding rate of an arbitrary source X. Here, we have made an implicit assumption in the following derivation, which is the source alphabet X is finite2 . Definition 3.1 (cf. Definition 4.2 and its associated footnote in volume I) An (n, M ) block code for data compression is a set ∼Cn , {c1 , c2 , . . . , cM } consisting of M sourcewords3 of block length n (and a binary-indexing codeword for each sourceword ci ); each sourceword represents a group of source symbols of length n. Definition 3.2 Fix ε ∈ [0, 1]. R is an ε-achievable data compression rate for a source X if there exists a sequence of block data compression codes {∼Cn = (n, Mn )}∞ n=1 with 1 lim sup log Mn ≤ R, n→∞ n and lim sup Pe (∼Cn ) ≤ ε, n→∞ n

where Pe (∼Cn ) , Pr (X ∈ / ∼Cn ) is the probability of decoding error. The infimum of all ε-achievable data compression rate for X is denoted by Tε (X). Lemma 3.3 (Lemma 1.5 in [3]) Fix a positive integer n. There exists an (n, Mn ) source block code ∼Cn for PX n such that its error probability satisfies   1 1 n Pe (∼Cn ) ≤ Pr hX n (X ) > log Mn . n n 2

Actually, the theorems introduced also apply for sources with countable alphabets. We ¯ ε (X) = ∞) that might assume finite alphabets in order to avoid uninteresting cases (such as H arise with countable alphabets. 3

In Definition 4.2 of volume I, the (n, M ) block data compression code is defined by M codewords, where each codeword represents a group of sourcewords of length n. However, we can actually pick up one source symbol from each group, and equivalently define the code using these M representative sourcewords. Later, it will be shown that this viewpoint does facilitate the proving of the general source coding theorem.

29

Proof: Observe that X

1 ≥

PX n (xn )

{xn ∈X n : (1/n)hX n (xn )≤(1/n) log Mn }

X

≥ {xn ∈X n : (1/n)h

n (xn )≤(1/n) log Mn }

1 Mn

X 1 n 1 1 n n . ≥ {x ∈ X : hX n (x ) ≤ log Mn } n n Mn

Therefore, |{xn ∈ X n : (1/n)hX n (xn ) ≤ (1/n) log Mn }| ≤ Mn . We can then choose a code   1 1 n n n ∼Cn ⊃ x ∈ X : hX n (x ) ≤ log Mn n n with |∼Cn | = Mn and 

 1 1 n Pe (∼Cn ) = 1 − PX n {∼Cn } ≤ Pr hX n (X ) > log Mn . n n 2 Lemma 3.4 (Lemma 1.6 in [3]) Every (n, Mn ) source block code ∼Cn for PX n satisfies   1 1 n Pe (∼Cn ) ≥ Pr hX n (X ) > log Mn + γ − exp{−nγ}, n n for every γ > 0. Proof: It suffices to prove that  1 1 n 1 − Pe (∼Cn ) = Pr {X ∈ ∼Cn } < Pr hX n (X ) ≤ log Mn + γ + exp{−nγ}. n n n



30

Clearly, n

Pr {X ∈ ∼Cn } =



=

=

< =

  1 1 n n Pr X ∈ ∼Cn and hX n (X ) ≤ log Mn + γ n n   1 1 n n + Pr X ∈ ∼Cn and hX n (X ) > log Mn + γ n n   1 1 Pr hX n (X n ) ≤ log Mn + γ n n   1 1 n n + Pr X ∈ ∼Cn and hX n (X ) > log Mn + γ n n   1 1 Pr hX n (X n ) ≤ log Mn + γ n n   X 1 1 n n hX n (x ) > log Mn + γ + PX n (x ) · 1 n n xn ∈ ∼ Cn   1 1 n Pr hX n (X ) ≤ log Mn + γ n n   X 1 n n + PX n (x ) · 1 PX n (x ) < exp{−nγ} Mn xn ∈ ∼ C  n  1 1 1 n Pr hX n (X ) ≤ log Mn + γ + |∼Cn | exp{−nγ} n n Mn   1 1 n hX n (X ) ≤ log Mn + γ + exp{−nγ}. Pr n n 2

We now apply Lemmas 3.3 and 3.4 to prove a general source coding theorems for block codes. Theorem 3.5 (general source coding theorem) For any source X,  ¯ δ (X), for ε ∈ [0, 1);  lim H δ↑(1−ε) Tε (X) = 0, for ε = 1. Proof: The case of ε = 1 follows directly from its definition; hence, the proof only focus on the case of ε ∈ [0, 1). ¯ δ (X) 1. Forward part (achievability): Tε (X) ≤ limδ↑(1−ε) H We need to prove the existence of a sequence of block codes {∼Cn }n≥1 such that for every γ > 0, lim sup n→∞

1 ¯ δ (X) + γ log |∼Cn | ≤ lim H δ↑(1−ε) n

31

and

lim sup Pe (∼Cn ) ≤ ε. n→∞

Lemma 3.3 ensures the existence (for any γ > 0) of a source block code ¯ δ (X) + γ)}e) with error probability ∼Cn = (n, Mn = dexp{n(limδ↑(1−ε) H   1 1 n hX n (X ) > log Mn Pe (∼Cn ) ≤ Pr n n   1 n ¯ hX n (X ) > lim Hδ (X) + γ . ≤ Pr δ↑(1−ε) n Therefore, 

 1 n ¯ δ (X) + γ lim sup Pe (∼Cn ) ≤ lim sup Pr hX n (X ) > lim H δ↑(1−ε) n n→∞ n→∞   1 n ¯ δ (X) + γ hX n (X ) ≤ lim H = 1 − lim inf Pr n→∞ δ↑(1−ε) n ≤ 1 − (1 − ε) = ε, where the last inequality follows from     1 n ¯ δ (X) = sup θ : lim inf Pr hX n (X ) ≤ θ < 1 − ε . lim H n→∞ δ↑(1−ε) n

(3.1.1)

¯ δ (X) 2. Converse part: Tε (X) ≥ limδ↑(1−ε) H ¯ δ (X) > 0. We will prove Assume without loss of generality that limγ↑(1−ε) H ¯ δ (X). Then the converse by contradiction. Suppose that Tε (X) < limδ↑(1−ε) H ¯ (∃ γ > 0) Tε (X) < limδ↑(1−ε) Hδ (X) − 4γ. By definition of Tε (X), there exists a sequence of codes ∼Cn such that   1 ¯ δ (X) − 4γ + γ lim sup log |∼Cn | ≤ lim H (3.1.2) δ↑(1−ε) n→∞ n and lim sup Pe (∼Cn ) ≤ ε.

(3.1.3)

n→∞

(3.1.2) implies that 1 ¯ δ (X) − 2γ log |∼Cn | ≤ lim H δ↑(1−ε) n for all sufficiently large n. Hence, for those n satisfying the above inequality and also by Lemma 3.4,   1 1 n Pe (∼Cn ) ≥ Pr hX n (X ) > log |∼Cn | + γ − e−nγ n n     1 n ¯ ≥ Pr hX n (X ) > lim Hδ (X) − 2γ + γ − e−nγ . δ↑(1−ε) n

32

Therefore, 

 1 n ¯ δ (X) − γ > ε, lim sup Pe (∼Cn ) ≥ 1 − lim inf Pr hX n (X ) ≤ lim H n→∞ δ↑(1−ε) n n→∞ where the last inequality follows from (3.1.1). Thus, a contradiction to (3.1.3) is obtained. 2 A few remarks are made based on the previous theorem. ¯ δ (X) = H(X). ¯ • Note that as ε = 0, limδ↑(1−ε) H Hence, the above theorem generalizes the block source coding theorem in [4], which states that the minimum achievable fixed-length source coding rate of any finite-alphabet ¯ source is H(X). • Consider the special case where −(1/n) log PX n (X n ) converges in probability to a constant H, which holds for all information stable sources4 . In this case, both the inf- and sup-spectrums of X degenerate to a unit step function: ( 1, if θ > H; u(θ) = 0, if θ < H, ¯ ε (X) = H for all ε ∈ [0, 1). where H is the source entropy rate. Thus, H Hence, general source coding theorem reduces to the conventional source coding theorem. • More generally, if −(1/n) log PX n (X n ) converges in probability to a random variable Z whose cumulative distribution function (cdf) is FZ (·), then the minimum achievable data compression rate subject to decoding error being no greater than ε is ¯ δ (X) = sup {R : FZ (R) < 1 − ε} . lim H

δ↑(1−ε)

Therefore, the relationship between the code rate and the ultimate optimal error probability is also clearly defined. We further explore the case in the next example. 4

n  o∞ (n) (n) A source X = X n = X1 , . . . , Xn

is said to be information stable if H(X n ) =

n=1

E [− log PX n (xn )] > 0 for all n, and   − log PX n (xn ) − 1 > ε = 0, lim Pr n→∞ H(X n ) for every ε > 0. By the definition, any stationary-ergodic source with finite n-fold entropy is information stable; hence, it can be viewed a generalized source model for stationary-ergodic sources.

33

Example 3.6 Consider a binary source X with each X n is Bernoulli(Θ) distributed, where Θ is a random variable defined over (0, 1). This is a stationary but non-ergodic source [1]. We can view the source as a mixture of Bernoulli(θ) processes where the parameter θ ∈ Θ = (0, 1), and has distribution PΘ [1, Corollary 1]. Therefore, it can be shown by ergodic decomposition theorem (which states that any stationary source can be viewed as a mixture of stationary-ergodic sources) that −(1/n) log PX n (X n ) converges in probability to a random variable Z = hb (Θ) [1], where hb (x) , −x log2 (x)−(1−x) log2 (1−x) is the binary entropy function. Consequently, the cdf of Z is FZ (z) = Pr{hb (Θ) ≤ z}; and the minimum achievable fixedlength source coding rate with compression error being no larger than ε is sup{R : FZ (R) < 1 − ε} = sup{R : Pr{hb (Θ) ≤ R} < 1 − ε}. • From the above example, or from Theorem 3.5, it shows that the strong converse theorem (which states that codes with rate below entropy rate will ultimately have decompression error approaching one, cf. Theorem 4.6 of Volume I of the lecture notes) does not hold in general. However, one can always claim the weak converse statement for arbitrary sources. Theorem 3.7 (weak converse theorem) For any block code sequence ¯ of ultimate rate R < H(X), the probability of block decoding failure Pe cannot be made arbitrarily small. In other words, there exists ε > 0 such that Pe is lower bounded by ε infinitely often in block length n. The possible behavior of the probability of block decompression error of an arbitrary source is depicted in Figure 3.1. As shown in the Figure, ¯ ¯ 0 (X), where H(X) ¯ there exist two bounds, denoted by H(X) and H is the tight bound for lossless data compression rate. In other words, it is possible to find a sequence of block codes with compression rate larger than ¯ H(X) and the probability of decoding error is asymptotically zero. When ¯ ¯ 0 (X), the minimum the data compression rate lies between H(X) and H probability of decoding error achievable is bounded below by a positive absolute constant in (0, 1) infinitely often in blocklength n. In the case that ¯ 0 (X), the probability of decoding the data compression rate is less than H error of all codes will eventually go to 1 (for n infinitely often). This fact tells the block code designer that all codes with long blocklength are ¯ 0 (X). From the strong bad when data compression rate is smaller than H converse theorem, the two bounds in Figure 3.1 coincide for memoryless sources. In fact, these two bounds coincide even for stationary-ergodic sources.

34

n (i.o.)

n→∞ Pe is lower Pe −−−→0 bounded (i.o. in n) ¯ 0 (X) ¯ H H(X)

Pe −−−→1

R

Figure 3.1: Behavior of the probability of block decoding error as block length n goes to infinity for an arbitrary source X.

We close this section by remarking that the definition that we adopt for the ε-achievable data compression rate is slightly different from, but equivalent to, the one used in [4, Def. 8]. The definition in [4] also brings the same result, which was separately proved by Steinberg and Verd´ u as a direct consequence of Theorem 10(a) (or Corollary 3) in [5]. To be precise, they showed that Tε (X), ¯ v (2ε) (cf. Def. 17 in [5]). By a simple denoted by Te (ε, X) in [5], is equal to R derivation, we obtain: ¯ v (2ε) Te (ε, X) = R     1 n = inf θ : lim sup PX n − log PX n (X ) > θ ≤ ε n n→∞     1 n = inf θ : lim inf PX n − log PX n (X ) ≤ θ ≥ 1 − ε n→∞ n     1 n = sup θ : lim inf PX n − log PX n (X ) ≤ θ < 1 − ε n→∞ n ¯ δ (X). = lim H δ↑(1−ε)

Note that Theorem 10(a) in [5] is a lossless data compression theorem for arbitrary sources, which the authors show as a by-product of their results on finite-precision resolvability theory. Specifically, they proved T0 (X) = S(X) [4, ¯ Thm. 1] and S(X) = H(X) [4, Thm. 3], where S(X) is the resolvability5 of an arbitrary source X. Here, we establish Theorem 3.5 in a different and more direct way.

3.2

Generalized AEP theorem

For discrete memoryless sources, the data compression theorem is proved by choosing the codebook ∼Cn to be the weakly δ-typical set and applying the Asymp5

The resolvability, which is a measure of randomness for random variables, will be introduced in subsequent chapters.

35

totic Equipartition Property (AEP) which states that (1/n)hX n (X n ) converges to H(X) with probability one (and hence in probability). The AEP – which implies that the probability of the typical set is close to one for sufficiently large n – also holds for stationary-ergodic sources. It is however invalid for more general sources – e.g., non-stationary, non-ergodic sources. We herein demonstrate a generalized AEP theorem. Theorem 3.8 (generalized asymptotic equipartition property for arbitrary sources) Fix ε ∈ [0, 1). Given an arbitrary source X, define   1 n n n Tn [R] , x ∈ X : − log PX n (x ) ≤ R . n Then for any δ > 0, the following statements hold. 1.  ¯ ε (X) − δ] ≤ ε lim inf Pr Tn [H

(3.2.1)

 ¯ ε (X) + δ] > ε lim inf Pr Tn [H

(3.2.2)

n→∞

2. n→∞

3. The number of elements in ¯ ε (X) + δ] \ Tn [H ¯ ε (X) − δ], Fn (δ; ε) , Tn [H denoted by |Fn (δ; ε)|, satisfies  ¯ ε (X) + δ) , |Fn (δ; ε)| ≤ exp n(H

(3.2.3)

where the operation A \ B between two sets A and B is defined by A \ B , A ∩ B c with B c denoting the complement set of B. 4. There exists ρ = ρ(δ) > 0 and a subsequence {nj }∞ j=1 such that  ¯ ε (X) − δ) . |Fn (δ; ε)| > ρ · exp nj (H

(3.2.4)

Proof: (3.2.1) and (3.2.2) follows from the definitions. For (3.2.3), we have X 1 ≥ PX n (xn ) xn ∈Fn (δ;ε)



X

 ¯ ε (X) + δ) exp −n (H

xn ∈Fn (δ;ε)

=

 ¯ ε (X) + δ) . |Fn (δ; ε)| exp −n (H

36

Fn (δ; ε)

¯ ε (X) + δ] Tn [H

¯ ε (X) − δ] T n [H

Figure 3.2: Illustration of generalized AEP Theorem. ¯ ε (X) + δ] \ Tn [H ¯ ε (X) − δ] is the dashed region. Tn [H

Fn (δ; ε) ,

It remains to show (3.2.4). (3.2.2) implies that there exist ρ = ρ(δ) > 0 and N1 such that for all n > N1 ,  ¯ ε (X) + δ] > ε + 2ρ. Pr Tn [H Furthermore, (3.2.1) implies that for the previously chosen ρ, there exist a subsequence {n0j }∞ j=1 such that o n ¯ ε (X) − δ] < ε + ρ. Pr Tn0j [H Therefore, for all n0j > N1 , o n ¯ ε (X) − δ] ¯ ε (X) + δ] \ Tn0 [H ρ < Pr Tn0j [H j  ¯ ε (X) − δ] exp −n0j (H ¯ ε (X) − δ) ¯ ε (X) + δ] \ Tn0 [H < Tn0j [H j  0 0 ¯ ε (X) − δ) . = Fnj (δ; ε) exp −nj (H 0 The desired subsequence {nj }∞ j=1 is then defined as n1 is the first nj > N1 , and n2 is the second n0j > N1 , etc. 2 With the illustration depicted in Figure 3.2, we can clearly deduce that Theorem 3.8 is indeed a generalized version of the AEP since:

• The set ¯ (X) + δ] \ Tn [H ¯ ε (X) − δ] Fn (δ; ε) , Tn [H  ε  n n 1 n ¯ ε (X) ≤ δ = x ∈ X : − log PX n (x ) − H n is nothing but the weakly δ-typical set.

37

• (3.2.1) and (3.2.2) imply that qn , Pr{Fn (δ; ε)} > 0 infinitely often in n. • (3.2.3) and (3.2.4) imply that the number of  sequences in Fn (δ; ε) (the ¯ dashed region) is approximately equal to exp nHε (X) , andthe probabi ¯ ε (X) . lity of each sequence in Fn (δ; ε) can be estimated by qn · exp −nH ¯ ε (X) is indepen• In particular, if X is a stationary-ergodic source, then H ¯ ε (X) = Hε (X) = H for all ε ∈ [0, 1), where H is dent of ε ∈ [0, 1) and, H the source entropy rate H = lim

n→∞

1 E [− log PX n (X n )] . n

¯ ε (X) = Hε (X) for all ε ∈ In this case, (3.2.1)-(3.2.2) and the fact that H [0, 1) imply that the probability of the typical set Fn (δ; ε) is close to one (for n sufficiently large), and (3.2.3) and (3.2.4) imply that there are about enH typical sequences of length n, each with probability about e−nH . Hence we obtain the conventional AEP. • The general source coding theorem can also be proved in terms of the generalized AEP theorem. For details, readers can refer to [2].

3.3

3.3.1

Variable-length lossless data compression codes that minimizes the exponentially weighted codeword length Criterion for optimality of codes

In the usual discussion of the source coding theorem, one chooses the criterion to minimize the average codeword length. Implicit in the use of average code length as a criterion of performance is the assumption that cost varies linearly with codeword length. This is not always the case. In some papers, another cost/penalty function of codeword length is introduced which implies that the cost is an exponential function of codeword length. Obviously, linear cost is a limiting case of the exponential cost function. One of the exponential cost function introduced is: ! X 1 PX (x)et·`(cx ) , L(t) , log t x∈X where t is a chosen positive constant, PX is the distribution function of source X, cx is the binary codeword for source symbol x, and `(·) is the length of the

38

binary codeword. The criterion for optimal code now becomes that a code is said to be optimal if its cost L(t) is the smallest among all possible codes. The physical meaning of the aforementioned cost function is roughly disP cussed below. When t → 0, L(0) = x∈X PX (x)`(cx ) which is the average codeword length. In the case of t → ∞, L(∞) = maxx∈X `(cx ), which is the maximum codeword length for all binary codewords. As you may have noticed, longer codeword length has larger weight in the sum of L(t). P In other words, the minimization of L(t) is equivalent to the minimization of x∈X PX (x)et`(cx ) ; hence, the weight for codeword cx is et`(cx ) . For the minimization operation, it is obvious that events with smaller weight is more preferable since it contributes less in the sum of L(t). Therefore, with the minimization of L(t), it is less likely to have codewords with long code lengths. In practice, system with long codewords usually introduce complexity in encoding and decoding, and hence is somewhat non-feasible. Consequently, the new criterion, to some extent, is more suitable to physical considerations.

3.3.2

Source coding theorem for R´ enyi’s entropy

Theorem 3.9 (source coding theorem for R´ enyi’s entropy) The minimum cost L(t) attainable for uniquely decodable codes6 is the R´enyi’s entropy of order 1/(1 + t), i.e., !   X 1/(1+t) 1+t 1 log PX (x) . = H X; (1 + t) t x∈X Example 3.10 Given a source X. If we want to design an optimal lossless code with maxx∈X `(cx ) being smallest, the cost of the optimal code is H(X; 0) = log |X |.

3.4

Entropy of English

The compression of English text is not only a practical application but also an interesting research topic. One of the main problem in the compression of English text (in principle) is that its statistical model is unclear. Therefore, its entropy rate cannot be immediately computable. To estimate the data compression bound of English text, various stochastic approximations to English have been proposed. One can 6

For its definition, please refer to Section 4.3.1 of Volume I of the lecture notes.

39

then design a code, according to the estimated stochastic model, to compress English text. It is obvious that the better the stochastic approximation, the better the approximation. Assumption 3.11 For data compression, we assume that the English text contains only 26 letters and the space symbol. In other words, the upper case letter is treated the same as its lower case counterpart, and special symbols, such as punctuation, will be ignored.

3.4.1

Markov estimate of entropy rate of English text

According to the source coding theorem, the first thing that a data compression code designer shall do is to estimate the (δ-sup) entropy rate of the English text. One can start from modeling the English text as a Markov source, and compute the entropy rate according to the estimated Markov statistics. zero-order Markov approximation. It has been shown that the frequency of letters in English is far from uniform; for example, the most common letter, E, has Pempirical (E) ≈ 0.13 but the least common letters, Q and Z, have Pempirical (Q) ≈ Pempirical (Z) ≈ 0.001. Therefore, zero-order Markov approximation apparently does not fit our need in estimating the entropy rate of the English text. 1st-order Markov approximate. The frequency of pairs of letters is also far from uniform; the most common pair TH has frequency about 0.037. It is fun to know that Q is always followed by U. A higher order approximation is possible. However, the database may be too large to be handled. For example, a 2nd-order approximation requires 273 = 19683 entries, and one may need millions of samples to make an accurate estimate of its probability. Here are some examples of Markov approximations to English from Shannon’s original paper. Note that the sequence of English letters is generated according to the approximated statistics. 1. Zero-order Markov approximation: The symbols are drawn independently with equiprobable distribution. Example: XFOML RXKHRJEFJUJ ZLPWCFWKCYJ EFJEYVKCQSGXYD QPAAMKBZAACIBZLHJQD

40

2. 1st-order Markov approximation: The symbols are drawn independently. Frequency of letters matches the 1st-order Markov approximation of English text. Example: OCRO HLI RGWR NMIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH BRL 3. 2nd-order Markov approximation: Frequency of pairs of letters matches English text. Example: ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN D ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN AN DY TOBE SEACE CTISBE 4. 3rd-order Markov approximation: Frequency of triples of letters matches English text. Example: IN NO IST LAT WHEY CRATICT FROURE BERS GROCID PONDENOME OF DEMONSTURES OF THE REPTAGIN IS REGOACTIONA OF CRE 5. 4th-order Markov approximation: Frequency of quadruples of letters matches English text. Example: THE GENERATED JOB PROVIDUAL BETTER TRAND THE DISPLAYED CODE, ABOVERY UPONDULTS WELL THE CODE RST IN THESTICAL IT DO HOCK BOTHE MERG. Another way to simulate the randomness of English text is to use wordapproximation. 1. 1st-order word approximation. Example: REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE. 2. 2nd-order word approximation. Example: THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED

41

From the above results, it is obvious that the approximations get closer and closer to resembling English for higher order approximation. Using the above model, we can then compute the empirical entropy rate of English text. order of the letter approximation model zero order 1st order 4th order

entropy rate log2 27 = 4.76 bits per letter 4.03 bits per letter 2.8 bits per letter

One final remark on Markov estimate of English statistics is that the results are not only useful in compression but also helpful in decryption.

3.4.2

Gambling estimate of entropy rate of English text

In this section, we will show that a good gambler is also a good data compressor! A) Sequential gambling Given an observed sequence of letters, x1 , x2 , . . . , xk , a gambler needs to bet on the next letter Xk+1 (which is now a random variable) with all the money in hand. It is not necessary for him to put all the money on the same outcome (there are 27 of them, i.e., 26 letters plus “space”). For example, he is allowed to place part of his money on one possible outcome and the rest of the money on another. The only constraint is that he should bet all his money. Let b(xk+1 |x1 , . . . , xk ) be the ratio of his money, which be bet on the letter xk+1 , and assume that at first, the amount of money that the gambler has is 1. Then X b(xk+1 |x1 , . . . , xk ) = 1, xk+1 ∈X

and (∀ xk+1 ∈ {a, b, . . . , z, SPACE}) b(xk+1 |x1 , . . . , xk ) ≥ 0. When the next letter appears, the gambler will be paid 27 times the bet on the letter.

42

Let Sn be the wealth of the gambler after n bets. Then S1 = 27 · b1 (x1 ) S2 = 27 · [b2 (x2 |x1 ) × S1 ] S3 = 27 · [b3 (x3 |x1 , x2 ) × S2 ] .. . n Y n Sn = 27 · bk (xk |x1 , . . . , xk−1 ) k=1 n

= 27 · b(x1 , . . . , xn ), where n

b(x ) = b(x1 , . . . , xn ) =

n Y

bk (xk |x1 , . . . , xk−1 ).

k=1

We now wish to show that high value of Sn lead to high data compression. Specifically, if a gambler with some gambling policy yields wealth Sn , the data can be saved up to log2 Sn bits. Lemma 3.12 If a proportional gambling policy results in wealth E[log2 Sn ], there exists a data compression code for English-text source X n which yields average codeword length being smaller than log2 (27) −

2 1 E [log2 Sn ] + bits, n n

where X n represents the random variables of the n bet outcomes. Proof: 1. Ordering : Index the English letter as index(a) index(b) index(c)

= = = .. .

index(z) = index(SPACE) =

0 1 2 25 26.

For any two sequences xn and xˆn in X , {a, b, . . . , z, SPACE}, we say xn ≥ xˆn if n X

i−1

index(xi ) × 27

i=1



n X i=1

43

index(ˆ xi ) × 27i−1 .

2. Shannon-Fano-Elias coder : Apply Shannon-Fano-Elias coder (cf. Section 4.3.3 of Volume I of the lecture notes) to the gambling policy b(xn ) according to the ordering defined in step 1. Then the codeword length for xn is (d− log2 (b(xn ))e + 1) bits. 3. Data compression : Now observe that E [log2 Sn ] = E [log2 (27n b(xn ))] = n log2 (27) + E[log2 b(X n )]. Hence, the average codeword length `¯ is upper bounded by ! X 1 − PX n (xn ) log2 b(xn ) + 2 `¯ ≤ n xn ∈X n 1 2 = − E [log2 b(X n )] + n n 2 1 = log2 (27) − E[log2 Sn ] + . n n 2 According to the concept behind the source coding theorem, the entropy rate of the English text should be upper bounded by the average codeword length of any (variable-length) code, which in turns should be bounded above by log2 (27) −

1 2 E[log2 Sn ] + . n n

Equipped with the proportional gambling model, one can find the bound of the entropy rate of English text by a properly designed gambling policy. An experiment using the book, Jefferson the Virginian by Dumas Malone, as the database resulted in an estimate of 1.34 bits per letter for the entropy rate of English.

3.5

Lempel-Ziv code revisited

In Section 4.3.4 of Volume I of the lecture notes, we have introduced the famous Lempel-Ziv coder, and states that the coder is universally good for stationary sources. In this section, we will establish the concept behind it. For simplicity, we assume that the source alphabet is binary, i.e., X = {0, 1}. The optimality of the Lempel-Ziv code can actually be extended to any stationary source with finite alphabet.

44

Lemma 3.13 The number c(n) of distinct strings in the Lempel-Ziv parsing of a binary sequence satisfies √

2n − 1 ≤ c(n) ≤

2n , log2 n

where the upper bound holds for n ≥ 213 , and the lower bound is valid for every n. Proof: The upper bound can be proved as follows. For fixed n, the number of distinct strings is maximized when all the phrases are as short as possible. Hence, in the extreme case, c(n) of them

z }| { n = 1 + 2 + 2 + 3 + 3 + 3 + 3 + · · ·, which implies that k+1 X

j

j2 ≥ n ≥

j=1

k X

j2j ,

(3.5.1)

j=1

where k is the integer satisfying 2k+1 − 1 > c(n) ≥ 2k − 1.

(3.5.2)

Observe that k+1 X

j

k+2

j2 = k2

+ 2 and

k X

j=1

j2j = (k − 1)2k+1 + 2.

j=1

Now from (3.5.1), we obtain for k ≥ 7 (which will be justified later by n ≥ 213 ), n ≤ k2k+2 + 2 < 22(k−1)

and n ≥ (k − 1)2k+1 + 2 ≥ (k − 1)2k+1 .

The proof of the upper bound is then completed by noting that the maximum c(n) for fixed n should satisfy c(n) < 2k+1 − 1 ≤ 2k+1 ≤

n 2n ≤ . k−1 log2 (n)

Again for the lower bound, we note that the number of distinct strings is minimized when all the phrases are as long as possible. Hence, in the extreme case, c(n) of them

c(n) z }| { X c(n)[c(n) + 1] [c(n) + 1]2 n = 1 + 2 + 3 + 4 + ··· ≤ j= ≤ , 2 2 j=1

45

which implies that c(n) ≥



2n − 1.

Note that when n ≥ 213 , c(n) ≥ 27 − 1, which implies that the assumption of k ≥ 7 in (3.5.2) is valid for n ≥ 213 . 2 The condition n ≥ 213 is equivalent to compressing a binary file of size larger than 1K bytes, which, in practice, is a frequently encountered situation. Since what concerns us is the asymptotic optimality of the coder as n goes to infinity, n ≥ 213 certainly becomes insignificant in such consideration. Lemma 3.14 (entropy upper bound by a function of its mean) A nonnegative integer-valued source X with mean µ and entropy H(X) satisfies H(X) ≤ (µ + 1) log2 (µ + 1) − µ log2 µ. Proof: The lemma follows directly from the result that the geometric distribution maximizes the entropy of non-negative integer-valued source with given mean, which is proved as follows. For geometric distribution with mean µ,  z µ 1 , for z = 0, 1, 2, . . . , PZ (z) = 1+µ 1+µ its entropy is H(Z) =

∞ X

−PZ (z) log2 PZ (z)

z=0 ∞ X



1+µ PZ (z) log2 (1 + µ) + z · log2 = µ z=0



1+µ = log2 (1 + µ) + µ log2 µ   ∞ X 1+µ = PX (z) log2 (1 + µ) + z · log2 , µ z=0 where the last equality holds for any non-negative integer-valued source X with mean µ. So, H(X) − H(Z) = =

∞ X x=0 ∞ X

PX (x)[− log2 PX (x) + log2 PZ (x)] PX (x) log2

x=0

PZ (x) PX (x)

= −D(PX kPZ ) ≤ 0,

46

with equality holds if, and only if, X ≡ Z. 2 Before the introduction of the main theorems, we address some notations used in their proofs. Give the source x−(k−1) , . . . , x−1 , x0 , x1 , . . . , xn , and suppose x1 , . . . , xn is Lempel-Ziv-parsed into c distinct strings, y 1 , . . . , y c . Let νi be the location of the first bit of y i , i.e., y i , xνi , . . . , xνi+1 −1 . 1. Define si = xνi −k , . . . , xνi −1 as the k bits preceding y i . 2. Define c`,s be the number of strings in y 1 , . . . , y c with length ` and proceeding state s. 3. Define Qk be the k-th order Markov approximation of the stationary source X, i.e., 0 Qk (x1 , . . . , xn |x0 , . . . , x−(k−1) ) , PX1n |X−(k−1) (xn , . . . , x1 |x0 , . . . , x−(k−1) ), 0 where PX1n |X−(k−1) is the true (stationary) distribution of the source.

For ease of understanding, these notations are graphically illustrated in Figure 3.3. It is easy to verify that n X X

c`,s = c,

and

n X X

`=1 s∈X k

` × c`,s = n.

(3.5.3)

`=1 s∈X k

Lemma 3.15 For any Lempel-Ziv parsing of the source x1 . . . xn , we have log2 Qk (x1 , . . . , xn |s) ≤

n X `=1

47

−c`,s log2 c`,s .

(3.5.4)

s2 s3 s1 z }| { }| { z }| { z x−(k−1) , . . . , x−1 (x1 , . . . , xν2 −(k−1) , . . . , xν2 −1 )(xν2 , . . . , xν3 −(k−1) , . . . , xν3 −1 ) | {z } | {z } y1 y2 s z }|c { . . . (xνc−1 , . . . , xνc −(k−1) , . . . , xνc −1 )(xνc , . . . , xn ) | {z } | {z } yc y c−1 Figure 3.3: Notations used in Lempel-Ziv coder. Proof: By the k-th order Markov property of Qk , log2 Qk (x1 , . . . , xn |s) = log2 Q(y 1 , . . . , y c |x) c Y = log2 PX νi+1 −νi |X 0 1

i=1

=

c X

=

1



(y i |si ) 

X

X

log2 PX1` |X−(k−1) (y i |si ) 0

 n X X `=1 s∈X k



−(k−1)

(y i |si )



`=1 s∈X k

=

−(k−1)

log2 PX νi+1 −νi |X 0

i=1 n X

!

n X

X

`=1 s∈X k n X X `=1 s∈X k

{i : νi+1 −νi =` and si =s}

 c`,s 

 1 c`,s

X {i : νi+1 −νi =` and si =s}

X

c`,s log2

{i : νi+1 −νi =` and si =s}

 c`,s log2

log2 PX1` |X−(k−1) (y i |si ) 0

1 c`,s



 1 P ` 0 (y |si ) (3.5.5) c`,s X1 |X−(k−1) i

 (3.5.6)

where (3.5.5) follows from Jensen’s inequality and the concavity of log2 (·), and (3.5.6) holds since probability sum is no greater than one. 2 Theorem 3.16 Fix a stationary source X. Given any observations x1 , x2 , x3 , . . ., c log2 c lim sup ≤ H(Xk+1 |Xk , . . . , X1 ), n n→∞ for any integer k, where c = c(n) is the number of distinct Lempel-Ziv parsed strings of x1 , x2 , . . . , xn .

48

Proof: Lemma 3.15 gives that log2 Qk (x1 , . . . , xn |s) ≤ =

n X X

−c`,s log2 c`,s

`=1 s∈X k n X X `=1 s∈X n X

= −c

n c`,s X X −c`,s log2 + −c`,s log2 c c k k `=1 s∈X

X c`,s c`,s log2 − c log2 c. c c k

(3.5.7)

`=1 s∈X

Denote by L and S the random variables with distribution PL,S (`, s) =

c`,s , c

for which the sum-to-one property of the distribution is justified by (3.5.3). Also, from (3.5.3), we have E[L] =

n X X `=1 s∈X k



n c`,s = . c c

Therefore, by independent bound for entropy (cf. Theorem 2.19 in Volume I of the lecture notes) and Lemma 3.14, we get H(L, S) ≤ H(L) + H(S) ≤ {(E[L] + 1) log2 (E[L] + 1) − E[L] log2 E[L]} + log2 |X |k h n n   n ni = + 1 log2 + 1 − log2 +k c c c  c    n n n/c + 1 = log2 + 1 + log2 +k c c n/c  n  n c + 1 + log2 1 + + k, = log2 c c n which, together with Lemma 3.13, implies that for n ≥ 213 , n   c c c c H(L, S) ≤ log2 + 1 + log2 1 + + k n n c  n n  2 2 2 n ≤ log2 √ + 1 + log 2 1 + + k. log2 n log2 n log2 n 2n − 1 Finally, we can re-write (3.5.7) as c log2 c 1 c ≤ − log2 Qk (x1 , . . . , xn |x) + H(L, S). n n n

49

n on As a consequence, by taking the expectation value with respect to X−(k−1) both sides of the above inequality, we obtain

lim sup n→∞

 c log2 c 1  ≤ lim sup E − log2 Qk (X1 , . . . , Xn |X−(k−1) , . . . , X0 ) n n→∞ n = H(Xk+1 |Xk , . . . , X1 ). 2

Theorem 3.17 (main result) Let `(x1 , . . . , xn ) be the Lempel-Ziv codeword length of an observatory sequence x1 , . . . , xn , which is drawn from a stationary source X. Then 1 1 lim sup `(x1 , . . . , xn ) ≤ lim H(X n ). n→∞ n n→∞ n Proof: Let c(n) be the number of the parsed distinct strings, then 1 1 `(x1 , . . . , xn ) = c(n) (dlog2 c(n)e + 1) n n 1 c(n) (log2 c(n) + 2) ≤ n c(n) log2 c(n) c(n) = +2 . n n From Lemma 3.13, we have lim sup n→∞

c(n) 2 ≤ lim sup = 0. n n→∞ log2 n

From Theorem 3.16, we have for any integer k, lim sup n→∞

c(n) log2 c(n) ≤ H(Xk+1 |Xk , . . . , X1 ). n

Hence, 1 lim sup `(x1 , . . . , xn ) ≤ H(Xk+1 |Xk , . . . , X1 ) n→∞ n for any integer k. The theorem is completed by applying Theorem 4.10 in Volume I of the lecture notes, which states that for a stationary source X, its entropy rate always exists and is equal to 1 H(X n ) = lim H(Xk+1 |Xk , . . . , X1 ). n→∞ n k→∞ lim

2 We conclude the discussion in the section into the next corollary.

50

Corollary 3.18 The Lempel-Ziv coders asymptotically achieves the entropy rate of any (unknown) stationary source. The Lempel-Ziv code is often used in practice to compress data which cannot be characterized in a simple statistical model, such as English text or computer source code. It is simple to implement, and has an asymptotic rate approaching the entropy rate (if it exists) of the source, which is known to be the lower bound of the lossless data compression code rate. This code can be used without knowledge of the source distribution provided the source is stationary. Some well-known examples of its implementation are the compress program in UNIX and the arc program in DOS, which typically compresses ASCII text files by about a factor of 2.

51

Bibliography [1] F. Alajaji and T. Fuja, “A communication channel modeled on contagion,” IEEE Trans. on Information Theory, vol. IT–40, no. 6, pp. 2035–2041, Nov. 1994. [2] P.-N. Chen and F. Alajaji, “Generalized source coding theorems and hypothesis testing,” Journal of the Chinese Institute of Engineering, vol. 21, no. 3, pp. 283-303, May 1998. [3] T. S. Han, Information-Spectrum Methods in Information Theory, (in Japanese), Baifukan Press, Tokyo, 1998. [4] T. S. Han and S. Verd´ u, “Approximation theory of output statistics,” IEEE Trans. on Information Theory, vol. IT–39, no. 3, pp. 752–772, May 1993. [5] Y. Steinberg and S. Verd´ u, “Simulation of random processes and ratedistortion theory,” IEEE Trans. on Information Theory, vol. IT–42, no. 1, pp. 63–86, Jan. 1996.

52

Chapter 4 Measure of Randomness for Stochastic Processes In the previous chapter, it is shown that the sup-entropy rate is indeed the minimum lossless data compression ratio achievable for block codes. Hence, to find an optimal block code becomes a well-defined mission since for any source with well-formulated statistical model, the sup-entropy rate can be computed and such quantity can be used as a criterion to evaluate the optimality of the designed block code. In the very recent work of Verd´ u and Han [2], they found that, other than the minimum lossless data compression ratio, the sup-entropy rate actually has another operational meaning, which is called resolvability. In this chapter, we will explore the new concept in details.

4.1

Motivation for resolvability : measure of randomness of random variables

In simulations of statistical communication systems, generation of random variables by a computer algorithm is very essential. The computer usually has the access to a basic random experiment (through pre-defined Application Programing Interface), which generates equally likely random values, such as rand( ) that generates a real number uniformly distributed over (0, 1). Conceptually, random variables with complex models are more difficult to generate by computerw than random variables with simple models. Question is how to quantify the “complexity” of generating the random variables by computers. One way to define such “complexity” measurement is: Definition 4.1 The complexity of generating a random variable is defined as the number of random bits that the most efficient algorithm requires in order

53

to generate the random variable by computer that has the access to a equally likely random experiment. To understand the above definition quantitatively, a simple example is demonstrated below. Example 4.2 Consider the generation of the random variable with probability masses PX (−1) = 1/4, PX (0) = 1/2, and PX (1) = 1/4. An algorithm is written as: Flip-a-fair-coin; \\ one random bit If “Head”, then output 0; else { Flip-a-fair-coin; \\ one random bit If “Head”, then output −1; else output 1; } On the average, the above algorithm requires 1.5 coin flips, and in the worstcase, 2 coin flips are necessary. Therefore, the complexity measure can take two fundamental forms: worst-case or average-case over the range of outcomes of the random variables. Note that we did not show in the above example that the algorithm is the most efficient one in the sense of using minimum number of random bits; however, it is indeed an optimal algorithm because it achieves the lower bound of the minimum number of random bits. Later, we will show that such bound for average minimum number of random bits required for generating the random variables is the entropy, which is exactly 1.5 bits in the above example. As for the worse-case bound, a new terminology, resolution, will be introduced. As a result, the above algorithm also achieves the lower bound of the worst-case complexity, which is the resolution of the random variable.

4.2

Notations and definitions regarding to resolvability

Definition 4.3 (M-type) For any positive integer M , a probability distribution P is said to be M -type if   1 2 for all ω ∈ Ω. P (ω) ∈ 0, , , . . . , 1 M M

54

Definition 4.4 (resolution of a random variable) The resolution1 R(X) of a random variable X is the minimum log(M ) such that PX is M -type. If PX is not M -type for any integer M , then R(X) = ∞. As revealed previously, a random source needs to be resolved (meaning, can be generated by a computer algorithm with access to equal-probable random experiments). As anticipated, a random variable with finite resolution is resolvable by computer algorithms. Yet, it is possible that the resolution of a random variable is infinity. A quick example is the random variable X with distribution PX (0) = 1/π and PX (1) = 1 − 1/π. (X does not belong to any M -type for finite M .) In such case, one can alternatively choose another computer-resolvable random variable, which resembles the true source within some acceptable range, to simulate the original one. One criterion that can be used as a measure of resemblance of two random variables is the variational distance. As for the same example in the above e with distribution P e (0) = 1/3 and paragraph, choose a random variable X X e ≈ 0.03, and X e is 3-type, which is computerPXe (1) = 2/3. Then kX − Xk resolvable2 . Definition 4.5 (variational distance) The variational distance (or `1 distance) between tow distributions P and Q defined on common measurable space (Ω, F) is X kP − Qk , |P (ω) − Q(ω)|. ω∈Ω 1

If the base of the logarithmic operation is 2, the resolution is measured in bits; however, if natural logarithm is taken, nats becomes the basic measurement unit of resolution. 2

A program that generates M -type random variable for any M satisfying log2 (M ) being a e is as follows (in C positive integer is straightforward. A program that generates the 3-type X language). even = False; while (1) {Flip-a-fair-coin; if (Head) {if (even==True) { output 0; even=False;} else {output 1; even = True;} } else {if (even==True) even=False; else even=True; } }

55

\\ one random bit

(Note that an alternative way to formulate the variational distance is: X kP − Qk = 2 · sup |P (E) − Q(E)| = 2 [P (x) − Q(x)].  E∈F x∈X : P (x)≥Q(x)

This two definitions are actually equivalent.) Definition 4.6 (ε-achievable resolution) Fix ε ≥ 0. R is an ε-achievable ˜ satisfies resolution for input X if for all γ > 0, there exists X e 0, there exists X satisfies 1 e n ) < R + γ and kX n − X e n k < ε, R(X n for all sufficiently large n. Definition 4.8 (ε-resolvability for X) Fix ε > 0. The ε-resolvability for input X, denoted by Sε (X), is the minimum ε-achievable resolution rate of the same input, i.e., n ˜ and N )(∀ n > N ) Sε (X) , min R : (∀ γ > 0)(∃X  1 n n n e e R(X ) < R + γ and kX − X k < ε . n 3

Note that our definition of resolution rate is different from its original form (cf. Definition 7 in [2] and the statements following Definition 7 of the same paper for its modified Definition for specific input X), which involves an arbitrary channel model W . Readers may treat our definition as a special case of theirs over identity channel.

56

Here, we define Sε (X) using the “minimum” instead of a more general “infimum” operation is simply because Sε (X) indeed belongs to the range of the minimum operation, i.e., n ˜ and N )(∀ n > N ) Sε (X) ∈ R : (∀ γ > 0)(∃X  1 n n n e e R(X ) < R + γ and kX − X k < ε . n Similar convention will be applied throughout the rest of this chapter. Definition 4.9 (resolvability for X) The resolvability for input X, denoted by S(X), is S(X) , lim Sε (X). ε↓0

From the definition of ε-resolvability, it is obvious non-increasing in ε. Hence, the resolvability can also be defined using supremum operation as: S(X) , sup Sε (X). ε>0

The resolvability is pertinent to the worse-case complexity measure for random variables (cf. Example 4.2, and the discussion following it). With the entropy function, the information theorists also define the ε-mean-resolvability and mean-resolvability for input X, which characterize the average-case complexity of random variables. Definition 4.10 (ε-mean-achievable resolution rate) Fix ε ≥ 0. R is an ˜ ε-mean-achievable resolution rate for input X if for all γ > 0, there exists X satisfies 1 e n ) < R + γ and kX n − X e n k < ε, H(X n for all sufficiently large n. Definition 4.11 (ε-mean-resolvability for X) Fix ε > 0. The ε-mean-resolvability for input X, denoted by S¯ε (X), is the minimum ε-mean achievable resolution rate for the same input, i.e., n ˜ and N )(∀ n > N ) S¯ε (X) , min R : (∀ γ > 0)(∃X  1 n n n e ) < R + γ and kX − X e k0

The only difference between resolvability and mean-resolvability is that the former employs resolution function, while the latter replaces it by entropy function. Since entropy is the minimum average codeword length for uniquely decodable codes, an explanation for mean-resolvability is that the new random e can be resolvable through realizing the optimal variable-length code variable X e is approxifor it. You can think of the probability mass of each outcome of X −` mately 2 where ` is the codeword length of the optimal lossless variable-length e Such probability mass can actually be generated by flipping fair code for X. coins ` times, and the average number of fair coin flipping for this outcome is indeed ` × 2−` . As you may expect, the mean-resolvability is shown to be the average complexity of a random variable.

4.3

Operational meanings of resolvability and mean-resolvability

The operational meanings for the resolution and entropy (a new operational meaning for entropy other than the one from source coding theorem) follow the next theorem. Theorem 4.13 For a single random variable X, 1. the worse-case complexity is lower-bounded by its resolution R(X) [2]; 2. the average-case complexity is lower-bounded by its entropy H(X), and is upper-bounded by entropy H(X) plus 2 bits [3]. Next, we reveal the operational meanings for resolvability and mean-resolvability in source coding. We begin with some useful lemmas that are useful in characterizing the resolvability. Lemma 4.14 (bound on variational distance) For every µ > 0,   P (x) >µ . kP − Qk ≤ 2µ + 2 · PX x ∈ X : log Q(x)

58

Proof: kP − Qk X = 2 

[P (x) − Q(x)]

x∈X : P (x)≥Q(x)



X

= 2 

[P (x) − Q(x)]

x∈X : log[P (x)/Q(x)]≥0

 X

 = 2

[P (x) − Q(x)]



x∈X : log[P (x)/Q(x)]>µ



X

+ 

 [P (x) − Q(x)]

x∈X : µ≥log[P (x)/Q(x)]≥0

 X

 ≤ 2 

P (x)

x∈X : log[P (x)/Q(x)]>µ

 

X

+ 

P (x) 1 −



Q(x)   P (x)

x∈X : µ≥log[P (x)/Q(x)]≥0

   P (x) ≤ 2 P x ∈ X : log >µ Q(x) X

+ 

   P (x)  P (x) log  Q(x)

x∈X : µ≥log[P (x)/Q(x)]≥0

(by fundamental inequality)    P (x)  ≤ 2 P x ∈ X : log >µ + Q(x) 

 X

x∈X : µ≥log[P (x)/Q(x)]≥0

   P (x) = 2 P x ∈ X : log >µ Q(x)   P (x) +µ · PX x ∈ X : µ ≥ log ≥ 0} Q(x)     P (x) = 2 P x ∈ X : log >µ +µ . Q(x)

59

 P (x) · µ

2 Lemma 4.15   1 1 n n n n e ) = 1, PXe n x ∈ X : − log PXe n (x ) ≤ R(X n n for every n. e n ), Proof: By definition of R(X e n )} PXe n (xn ) ≥ exp{−R(X for all xn ∈ X n . Hence, for all xn ∈ X n , 1 1 e n ). − log PXe n (xn ) ≤ R(X n n 2

The lemma then holds.

Theorem 4.16 The resolvability for input X is equal to its sup-entropy rate, i.e., ¯ S(X) = H(X). Proof: ¯ 1. S(X) ≥ H(X). ¯ It suffices to show that S(X) < H(X) contradicts to Lemma 4.15. ¯ Suppose S(X) < H(X). Then there exists δ > 0 such that ¯ S(X) + δ < H(X). Let

 D0 ,

 1 n x ∈ X : − log PX n (x ) ≥ S(X) + δ . n n

n

¯ By definition of H(X), lim sup PX n (D0 ) > 0. n→∞

Therefore, there exists α > 0 such that lim sup PX n (D0 ) > α, n→∞

60

which immediately implies PX n (D0 ) > α infinitely often in n. Select 0 < ε < min{α2 , 1} and observe that Sε (X) ≤ S(X), we can choose e n to satisfy X 1 e n ) < S(X) + δ R(X n 2 for sufficiently large n. Define

e nk < ε and kX n − X

D1 , {xn ∈ X n : PX n (xn ) > 0 √ and PX n (xn ) − PXe n (xn ) ≤ ε · PX n (xn ) . Then PX n (D1c ) = PX n {xn ∈ X n : PX n (xn ) = 0 √ or PX n (xn ) − PXe n (xn ) > ε · PX n (xn ) ≤ PX n {xn ∈ X n : PX n (xn ) = 0} √  +PX n xn ∈ X n : PX n (xn ) − PXe n (xn ) > ε · PX n (xn ) √  = PX n xn ∈ X n : PX n (xn ) − PXe n (xn ) > ε · PX n (xn ) X = PX n (xn )  √ n xn ∈X n : PX n (xn ) 0, which holds infinitely often in n; and every xn0 in D1 ∩ D0 satisfies √ PXe n (xn0 ) ≥ (1 − ε)PX n (xn0 ) and 1 1 1 1 √ − log PXe n (xn0 ) ≥ − log PX n (xn0 ) + log n n n 1+ ε 1 1 √ ≥ (S(X) + δ) + log n 1+ ε δ ≥ S(X) + , 2

61

(4.3.1)

√ for n > (2/δ) log(1 + ε). Therefore, for those n that (4.3.1) holds,   1 1 n n n n e ) PXe n x ∈ X : − log PXe n (x ) > R(X n n   δ 1 n n n ≥ PXe n x ∈ X : − log PXe n (x ) > S(X) + n 2 ≥ PXe n (D1 ∩ D0 ) ≥ (1 − ε1/2 )PX n (D1 ∩ D0 ) > 0, which contradicts to the result of Lemma 4.15. ¯ 2. S(X) ≤ H(X). ˜ for arbitrary γ > 0 such that It suffices to show the existence of X e nk = 0 lim kX n − X

n→∞

e n is an M -type distribution with and X    ¯ M = exp n(H(X) + γ) . en = X e n (X n ) be uniformly distributed over a set Let X G , {Uj ∈ X n : j = 1, . . . , M } which drawn randomly (independently) according to PX n . Define   1 µ n n n ¯ D , x ∈ X : − log PX n (x ) > H(X) + γ + . n n For each G chosen, we obtain from Lemma 4.14 that e nk kX n − X  PXe n (xn ) 2µ + 2 · PXe n x ∈ X : log >µ PX n (xn )   1/M n 2µ + 2 · PXe n x ∈ G : log > µ (since PXe n (G c ) = 0) PX n (xn )   1 µ n n ¯ 2µ + PXe n x ∈ G : − log PX n (x ) > H(X) + γ + n n 2µ + PXe n (G ∩ D) 1 2µ + |G ∩ D| . M 

≤ = = = =

n

n

62

Since G is chosen randomly, we can take the expectation values (w.r.t. the random G) of the above inequality to obtain: h i e n k ≤ 2µ + 1 EG [|G ∩ D|] . EG kX n − X M Observe that each Uj is either in D or not in D, and will contribute weight 1/M when it is in D. From the i.i.d. assumption of {Uj }M j=1 , we can then evaluate (1/M )EG [|G ∩ D|] by4

= =

= =

1 EG [|G ∩ D|] M M − 1 M −1 1 PXMn [D] + PX n [D]PX n [Dc ] + · · · + PX n [D]PXMn−1 [Dc ] M M 1 M PXMn [D] + (M − 1)PXMn−1 [D]PX n [Dc ] M  + · · · + PX n [D]PXMn−1 [Dc ] 1 (M PX n [D]) M PX n [D].

Hence, h i e n k ≤ 2µ + lim sup PX n [D] = 2µ, lim sup EG kX n − X n→∞

n→∞

which implies h i e nk = 0 lim sup EG kX n − X

(4.3.2)

n→∞

since µ can be chosen arbitrarily small. (4.3.2) therefore guarantees the ˜ existence of the desired X. 2 Theorem 4.17 For any X, 1 ¯ S(X) = lim sup H(X n ). n→∞ n Proof: 4

Readers may imagine that there are cases: where |G ∩ D| = M , |G ∩ D| = M − 1, . . ., M −1 M |G ∩D| = 1 and |G ∩D| = 0, respectively with drawing probability PX (D)PX n (Dc ), n (D), PX n M −1 c M c n . . ., PX (D)PX n (D ) and PX n (D ), and with expectation quantity (i.e., (1/M )|G ∩ D|) being M/M , (M − 1)/M , . . ., 1/M and 0.

63

¯ 1. S(X) ≤ lim supn→∞ (1/n)H(X n ). It suffices to prove that S¯ε (X) ≤ lim supn→∞ (1/n)H(X n ) for every ε > 0. ˜ such that This is equivalent to show that for all γ > 0, there exists X 1 e n ) < lim sup 1 H(X n ) + γ H(X n n→∞ n and e nk < ε kX n − X ˜ = X, for sufficiently large n. This can be trivially achieved by letting X since for sufficiently many n, 1 1 H(X n ) < lim sup H(X n ) + γ n n→∞ n and kX n − X n k = 0. ¯ 2. S(X) ≥ lim supn→∞ (1/n)H(X n ). ¯ Observe that S(X) ≥ S¯ε (X) for any 0 < ε < 1/2. Then for any γ > 0 e n such that and all sufficiently large n, there exists X 1 e n ) < S(X) ¯ H(X +γ n

(4.3.3)

and e n k < ε. kX n − X e n k ≤ ε ≤ 1/2 implies Using the fact [1, pp. 33] that kX n − X n

e n )| ≤ ε log |X | , |H(X n ) − H(X ε and (4.3.3), we obtain 1 1 1 e n ) < S(X) ¯ H(X n ) − ε log |X | + ε log ε ≤ H(X + γ, n n n which implies that 1 ¯ + γ. lim sup H(X n ) − ε log |X | < S(X) n→∞ n Since ε and γ can be taken arbitrarily small, we have 1 ¯ S(X) ≥ lim sup H(X n ). n→∞ n 2

64

4.4

Resolvability and source coding

In the previous chapter, we have proved that the lossless data compression rate ¯ for block codes is lower bounded by H(X). We also show in Section 4.3 that ¯ H(X) is also the resolvability for source X. We can therefore conclude that resolvability is equal to the minimum lossless data compression rate for block codes. We will justify their equivalence directly in this section. As explained in AEP theorem for memoryless source, the set Fn (δ) contains approximately enH(X) elements, and the probability for source sequences being outside Fn (δ) will eventually goes to 0. Therefore, we can binary-index those source sequences in Fn (δ) by  log enH(X) = nH(X) nats, and encode the source sequences outside Fn (δ) by a unique default binary codeword, which results in an asymptotically zero probability of decoding error. This is indeed the main idea for Shannon’s source coding theorem for block codes. By further exploring the above concept, we found that the key is actually the existence of a set An = {xn1 , xn2 , . . . , xnM } with M ≈ enH(X) and PX n (Acn ) → 0. Thus, if we can find such typical set, the Shannon’s source coding theorem for block codes can actually be generalized to more general sources, such as nonstationary sources. Furthermore, extension of the theorems to codes of some specific types becomes feasible. Definition 4.18 (minimum ε-source compression rate for fixed-length codes) R is the ε-source compression rate for fixed-length codes if there exists n a sequence of sets {An }∞ n=1 with An ⊂ X such that lim sup n→∞

1 log |An | ≤ R and n

lim sup PX n [Acn ] ≤ ε. n→∞

Tε (X) is the minimum of all such rates. Note that the definition of Tε (X) is equivalent to the one in Definition 3.2. Definition 4.19 (minimum source compression rate for fixed-length codes) T (X) represents the minimum source compression rate for fixed-length codes, which is defined as: T (X) , lim Tε (X). ε↓0

65

Definition 4.20 (minimum source compression rate for variable-length codes) R is an achievable source compression rate for variable-length codes if there exists a sequence of error-free prefix codes {∼Cn }∞ n=1 such that 1 lim sup `n ≤ R n→∞ n where `n is the average codeword length of ∼Cn . T¯(X) is the minimum of all such rates. Recall that for a single source, the measure of its uncertainty is entropy. Although the entropy can also be used to characterize the overall uncertainty of a random sequence X, the source coding however concerns more on the “average” entropy of it. So far, we have seen four expressions of “average” entropy: 1 1 X lim sup H(X n ) , lim sup −PX n (xn ) log PX n (xn ); n→∞ n n→∞ n n x ∈X n 1 X 1 H(X n ) , lim inf −PX n (xn ) log PX n (xn ); n→∞ n n→∞ n xn ∈X n     1 n ¯ H(X) , inf β : lim sup PX n − log PX n (X ) > β = 0 ; β∈< n n→∞     1 n H(X) , sup α : lim sup PX n − log PX n (X ) < α = 0 . n n→∞ α∈< lim inf

If lim

n→∞

1 1 1 H(X n ) = lim sup H(X n ) = lim inf H(X n ), n→∞ n n n→∞ n

¯ then limn→∞ (1/n)H(X n ) is named the entropy rate of the source. H(X) and H(X) are called the sup-entropy rate and inf-entropy rate, which were introduced in Section 2.3. ¯ ¯ Next we will prove that T (X) = S(X) = H(X) and T¯(X) = S(X) = n lim supn→∞ (1/n)H(X ) for a source X. The operational characterization of lim inf n→∞ (1/n)H(X n ) and H(X) will be introduced in Chapter 6. Theorem 4.21 (equality of resolvability and minimum source coding rate for fixed-length codes) ¯ T (X) = S(X) = H(X). ¯ Proof: Equality of S(X) and H(X) is already given in Theorem 4.16. Also, ¯ T (X) = H(X) can be obtained from Theorem 3.5 by letting ε = 0. Here, we provide an alternative proof for T (X) = S(X).

66

1. T (X) ≤ S(X). If we can show that, for any ε fixed, Tε (X) ≤ S2ε (X), then the proof is completed. This claim is proved as follows. ˜ • By definition of S2ε (X), we know that for any γ > 0, there exists X and N such that for n > N , 1 e n ) < S2ε (X) + γ and kX n − X e n k < 2ε. R(X n  e n ) < S2ε (X) + γ, • Let An , xn : PXe n (xn ) > 0 . Since (1/n)R(X e n )} < exp{n(S2ε (X) + γ)}. |An | ≤ exp{R(X Therefore, lim sup n→∞

1 log |An | ≤ S2ε (X) + γ. n

• Also, e n k = 2 sup |PX n (E) − P e n (E)| 2ε > kX n − X X E⊂X n

≥ 2|PX n (Acn ) − PXe n (Acn )| = 2PX n (Acn ), (sincePXe n (Acn ) = 0). Hence, lim supn→∞ PX n (Acn ) ≤ ε. • Since S2ε (X) + γ is just one of the rates that satisfy the conditions of the minimum ε-source compression rate, and Tε (X) is the smallest one of such rates, Tε (X) ≤ S2ε (X) + γ for any γ > 0. 2. T (X) ≥ S(X). Similarly, if we can show that, for any ε fixed, Tε (X) ≥ S3ε (X), then the proof is completed. This claim can be proved as follows. • Fix α > 0. By definition of Tε (X), we know that for any γ > 0, there exists N and a sequence of sets {An }∞ n=1 such that for n > N , 1 log |An | < Tε (X) + γ n

67

and PX n (Acn ) < ε + α.

• Choose Mn to satisfy exp{n(Tε (X) + 2γ)} ≤ Mn ≤ exp{n(Tε (X) + 3γ)}. Also select one element xn0 from Acn . Define a new random variable e n as follows: X   0, if xn 6∈ {xn0 } ∪ An ; k(xn ) PXe n (xn ) = , if xn ∈ {xn0 } ∪ An ,  Mn where

 n n (x )e, if xn ∈ An ;  dMn PXX k(xn ) , k(xn ), if xn = xn0 .  Mn − xn ∈An

e n satisfies the next four properties: It can then be easily verified that X e n is Mn -type; (a) X (b) PXe n (xn0 ) ≤ PX n (Acn ) < ε + α, since xn0 ∈ Acn ; (c) for all xn ∈ An , n P e n (xn ) − PX n (xn ) = dMn PX n (x )e − PX n (xn ) ≤ 1 . X Mn Mn

(d) PXe n (An ) + PXe n (xn0 ) = 1. • Consequently, 1 e n ) ≤ Tε (X) + 3γ, R(X n

68

and e nk = kX n − X

X P e n (xn ) − PX n (xn ) + P e n (xn0 ) − PX n (xn0 ) X

X

xn ∈An

X

+

P e n (xn ) − PX n (xn ) X

xn ∈Acn −{xn 0}



X   P e n (xn ) − PX n (xn ) + P e n (xn0 ) + PX n (xn0 ) X

X

X

P e n (xn ) − PX n (xn )

xn ∈An

+

X

xn ∈Acn −{xn 0}



1 + PXe n (xn0 ) + PX n (xn0 ) Mn xn ∈An X + PX n (xn ) X

xn ∈Acn −{xn 0}

=

X |An | + PXe n (xn0 ) + PX n (xn ) Mn xn ∈Ac n

exp{n(Tε (X) + γ)} ≤ + (ε + α) + PX n (Acn ) exp{n(Tε (X) + 2γ)} ≤ e−nγ + (ε + α) + (ε + α) ≤ 3(ε + α), for n ≥ − log(ε + α)/γ. • Since Tε (X) is just one of the rates that satisfy the conditions of 3(ε+ α)-resolvability, and S3(ε+α) (X) is the smallest one of such quantities, S3(ε+α) (X) ≤ Tε (X). The proof is completed by noting that α can be made arbitrarily small. 2 This theorem tells us that the minimum source compression ratio for fixedlength code is the resolvability, which in turn is equal to the sup-entropy rate. Theorem 4.22 (equality of mean-resolvability and minimum source coding rate for variable-length codes) 1 ¯ T¯(X) = S(X) = lim sup H(X n ). n→∞ n ¯ Proof: Equality of S(X) and lim supn→∞ (1/n)H(X n ) is already given in Theorem 4.17.

69

¯ 1. S(X) ≤ T¯(X). Definition 4.20 states that there exists, for all γ > 0 and all sufficiently large n, an error-free variable-length code whose average codeword length `n satisfies 1 `n < T¯(X) + γ. n Moreover, the fundamental source coding lower bound for a uniquely decodable code (cf. Theorem 4.18 of Volume I of the lecture notes) is H(X n ) ≤ `n . ˜ = X, we obtain kX n − X e n k = 0 and Thus, by letting X 1 e n ) = 1 H(X n ) < T¯(X) + γ, H(X n n which concludes that T¯(X) is an ε-achievable mean-resolution rate of X for any ε > 0, i.e., ¯ S(X) = lim S¯ε (X) ≤ T¯(X). ε→0

¯ 2. T¯(X) ≤ S(X). ¯ Observe that S¯ε (X) ≤ S(X) for 0 < ε < 1/2. Hence, by taking γ satisfying 2ε log |X | > γ > ε log |X | and for all sufficiently large n, there exists e n such that X 1 e n ) < S(X) ¯ H(X +γ n and e n k < ε. kX n − X (4.4.1) On the other hand, Theorem 4.22 of Volume I of the lecture notes proves the existence of an error-free prefix code for X n with average codeword length `n satisfies `n ≤ H(X n ) + 1 (bits). e n k ≤ ε ≤ 1/2 implies By the fact [1, pp. 33] that kX n − X e n )| ≤ ε log2 |H(X n ) − H(X

|X |n , ε

and (4.4.1), we obtain 1 1 1 `n ≤ H(X n ) + n n n 1 e n ) + ε log2 |X | − 1 ε log2 ε + 1 ≤ H(X n n n 1 1 ¯ ≤ S(X) + γ + ε log2 |X | − ε log2 ε + n n ¯ ≤ S(X) + 2γ,

70

if n > (1 − ε log2 ε)/(γ − ε log2 |X |). Since γ can be made arbitrarily small, ¯ S(X) is an achievable source compression rate for variable-length codes; and hence, ¯ T¯(X) ≤ S(X). 2 Again, the above theorem tells us that the minimum source compression ratio for variable-length code is the mean-resolvability, and the mean-resolvability is exactly lim supn→∞ (1/n)H(X n ). ¯ Note that lim supn→∞ (1/n)H(X n ) ≤ H(X), which follows straightforwardly by the fact that the mean of the random variable −(1/n) log PX n (X n ) is no greater than its right margin of the support. Also note that for stationaryergodic source, all these quantities are equal, i.e., 1 ¯ ¯ T (X) = S(X) = H(X) = T¯(X) = S(X) = lim sup H(X n ). n→∞ n We end this chapter by computing these quantities for a specific example. Example 4.23 Consider a binary random source X1 , X2 , . . . where {Xi }∞ i=1 are independent random variables with individual distribution PXi (0) = Zi and PXi (1) = 1 − Zi , where {Zi }∞ i=1 are pair-wise independent with common uniform marginal distribution over (0, 1). You may imagine that the source is formed by selecting from infinitely many binary number generators as shown in Figure 4.1. The selecting process Z is independent for each time instance. It can be shown that such source is not stationary. Nevertheless, by means of similar argument as AEP theorem, we can show that: −

log PX (X1 ) + log PX (X2 ) + · · · + log PX (Xn ) → hb (Z) in probability, n

where hb (a) , −a log2 (a) − (1 − a) log2 (1 − a) is the binary entropy function. To compute the ultimate average entropy rate in terms of the random variable hb (Z), it requires that −

log PX (X1 ) + log PX (X2 ) + · · · + log PX (Xn ) → hb (Z) in mean, n

71

Source Generator .. . source Xt1 Z

.. .

?

source Xt2

-

Selector

. . . , X2 , X1 -

.. . source Xt t∈I .. .

Figure 4.1: Source generator: {Xt }t∈I (I = (0, 1)) is an independent random process with PXt (0) = 1 − PXt (1) = t, and is also independent of the selector Z, where Xt is outputted if Z = t. Source generator of each time instance is independent temporally.

which is a stronger result than convergence in probability. With the fundamental properties for convergence, convergence-in-probability implies convergence-inmean provided the sequence of random variables is uniformly integrable, which

72

is true for −(1/n)

Pn

log PX (Xi ): # " n 1 X log PX (Xi ) sup E n n>0 i=1 i=1

n

≤ sup n>0

1X E [|log PX (Xi )|] n i=1

= sup E [|log PX (X)|] , because of i.i.d. of {Xi }ni=1 n>0

= E [|log PX (X)|]    = E E |log PX (X)| Z  Z 1  = E |log PX (X)| Z = z dz 0 Z 1  = z| log(z)| + (1 − z)| log(1 − z)| dz Z0 1 log(2)dz = log(2). ≤ 0

We therefore have:   E − 1 log PX n (X n ) − E[hb (Z)] n   1 ≤ E − log PX n (X n ) − hb (Z) → 0. n Consequently, 1 lim sup H(X n ) = E[hb (Z)] = 0.5 nats or 0.721348 bits. n→∞ n However, it can be shown that the ultimate cumulative distribution function of −(1/n) log PX n (X n ) is P r[hb (Z) ≤ t] for t ∈ [0, log(2)] (cf. Figure 4.2). The sup-entropy rate of X should be log(2) nats or 1 bit (which is the right-margin of the ultimate CDF of −(1/n) log PX n (X n )). Hence, for this unstationary source, the minimum average codeword length for fixed-length codes and variable-length codes are different, which are 0.859912 bit and 1 bit, respectively.

73

1

0

log(2) nats

0

Figure 4.2: The ultimate CDF of −(1/n) log PX n (X n ): P r{hb (Z) ≤ t}.

74

Bibliography [1] I. Csiszar and J. K¨orner, Information Theory: Coding Theorems for Discrete Memoryless Systems, Academic, New York, 1981. [2] T. S. Han and S. Verd´ u, “Approximation theory of output statistics,” IEEE Trans. on Information Theory, vol. IT–39, no. 3, pp. 752–772, May 1993. [3] D. E. Knuth and A. C. Yao, “The complexity of random number generation,” in Proceedings of Symposium on New Directions and Recent Results in Algorithms and Complexity. New York: Academic Press, 1976.

75

Chapter 5 Channel Coding Theorems and Approximations of Output Statistics for Arbitrary Channels The Shannon channel capacity in Volume I of the lecture notes is derived under the assumption that the channel is memoryless. With moderate modification of the proof, this result can be extended to stationary-ergodic channels for which the capacity formula becomes the maximization of the mutual information rate: 1 lim sup I(X n ; Y n ). n→∞ X n n Yet, for more general channels, such as non-stationary or non-ergodic channels, a more general expression for channel capacity needs to be derived.

5.1

General models for channels

The channel transition probability in its most general form is denoted by {W n = PY n |X n }∞ n=1 , which is abbreviated by W for convenience. Similarly, the input and output random processes are respectively denoted by X and Y .

5.2 5.2.1

Variations of capacity formulas for arbitrary channels Preliminaries

Now, similar to the definitions of sup- and inf- entropy rates for sequence of sources, the sup- and inf- (mutual-)information rates are respectively defined

76

by1 ¯ I(X; Y ) , sup{θ : i(θ) < 1} and I(X; Y ) , sup{θ : ¯i(θ) ≤ 0}, where

 i(θ) , lim inf P r n→∞

1 iX n W n (X n ; Y n ) ≤ θ n



is the inf-spectrum of the normalized information density,   1 n n ¯i(θ) , lim sup P r iX n W n (X ; Y ) ≤ θ n n→∞ is the sup-spectrum of the normalized information density, and iX n W n (xn ; y n ) , log

PY n |X n (y n |xn ) PY n (y n )

is the information density. In 1994, Verd´ u and Han [2] have shown that the channel capacity in its most general form is C , sup I(X; Y ). X

In their proof, they showed the achievability part in terms of Feinstein’s Lemma, and provide a new proof for the converse part. In this section, we will not adopt the original proof of Verd´ u and Han in the converse theorem. Instead, we will use a new and tighter bound [1] established by Poor and Verd´ u in 1995. Definition 5.1 (fixed-length data transmission code) An (n, M ) fixedlength data transmission code for channel input alphabet X n and output alphabet Y n consists of 1. M informational messages intended for transmission; 1

In the paper of Verd´ u and Han [2], these two quantities are defined by:     1 n n ¯ I(X; Y ) , inf β : (∀ γ > 0) lim sup PX n W n iX n W n (X ; Y ) > β + γ = 0 β∈< n n→∞

and 



I(X; Y ) , sup α : (∀ γ > 0) lim sup PX n W n α∈
0 and input distribution PX n on X n , there exists an (n, M ) block code for the transition probability PW n = PY n |X n that its average error probability Pe (∼Cn ) satisfies   1 1 n n iX n W n (X ; Y ) < log M + γ + e−nγ . Pe (∼Cn ) < P r n n Proof: Step 1: Notations. Define   1 1 n n n n n n G , (x , y ) ∈ X × Y : iX n W n (x ; y ) ≥ log M + γ . n n Let ν , e−nγ + PX n W n (G c ).

78

The Feinstein’s Lemma obviously holds if ν ≥ 1, because then   1 1 n n Pe (∼Cn ) ≤ 1 ≤ ν , P r iX n W n (X ; Y ) < log M + γ + e−nγ . n n So we assume ν < 1, which immediately results in PX n W n (G c ) < ν < 1, or equivalently, PX n W n (G) > 1 − ν > 0. Therefore, PX n (A , {xn ∈ X n : PY n |X n (Gxn |xn ) > 1 − ν}) > 0, where Gxn , {y n ∈ Y n : (xn , y n ) ∈ G}, because if PX n (A) = 0, (∀ xn with PX n (xn ) > 0) PY n |X n (Gxn |xn ) ≤ 1 − ν X ⇒ PX n (xn )PY n |X n (Gxn |xn ) = PX n W n (G) ≤ 1 − ν. xn ∈X n

Step 2: Encoder. Choose an xn1 in A (Recall that PX n (A) > 0.) Define Γ1 = Gxn1 . (Then PY n |X n (Γ1 |xn1 ) > 1 − ν.) Next choose, if possible, a point xn2 ∈ X n without replacement (i.e., xn2 cannot be identical to xn1 ) for which  PY n |X n Gxn2 − Γ1 xn2 > 1 − ν, and define Γ2 , Gxn2 − Γ1 . Continue in the following way as for codeword i: choose xni to satisfy ! i−1 [ PY n |X n Gxni − Γj xni > 1 − ν, j=1

and define Γi , Gxni −

Si−1

j=1

Γj .

Repeat the above codeword selecting procedure until either M codewords have been selected or all the points in A have been exhausted. Step 3: Decoder. Define the decoding rule as  i, if y n ∈ Γi n φ(y ) = arbitraryy, otherwise.

79

Step 4: Probability of error. For all selected codewords, the error probability given codeword i is transmitted, λe|i , satisfies λe|i ≤ PY n |X n (Γci |xni ) < ν. (Note that (∀ i) PX n |X n (Γi |xni ) ≥ 1 − ν by step 2.) Therefore, if we can show that the above codeword selecting procedures will not terminated before M , then M 1 X Pe (∼Cn ) = λe|i < ν. M i=1 Step 5: Claim. The codeword selecting procedure in step 2 will not terminated before M . proof: We will prove it by contradiction. Suppose the above procedure terminated before M , say at N < M . Define the set N [ F, Γi ∈ Y n . i=1

Consider the probability PX n W n (G) = PX n W n [G ∩ (X n × F)] + PX n W n [G ∩ (X n × F c )].

(5.2.1)

Since for any y n ∈ Gxni , PY n (y n ) ≤

PY n |X n (y n |xni ) , M · enγ

we have PY n (Γi ) ≤ PY n (Gxni ) 1 −nγ ≤ e PY n |X n (Gxni ) M 1 −nγ e . ≤ M So the first term of the right hand side in (5.2.1) can be upper bounded by PX n W n [G ∩ (X n × F)] ≤ PX n W n (X n × F) = PY n (F) N X = PY n (Γi ) i=1

≤ N×

80

1 −nγ N −nγ e = e . M M

As for the second term of the right hand side in (5.2.1), we can upper bound it by X PX n W n [G ∩ (X n × F c )] = PX n (xn )PY n |X n (Gxn ∩ F c |xn ) xn ∈X n

=

X

PX n (xn )PY n |X n

Gxn −

xn ∈X n



X

N [ i=1

! Γi xn

n

PX n (x )(1 − ν) ≤ (1 − ν),

xn ∈X n

where the last step follows since for all xn ∈ X n , ! N [ PY n |X n Gxn − Γi xn ≤ 1 − ν. i=1

(Because otherwise we could find the (N + 1)-th codeword.) Consequently, PX n W n (G) ≤ (N/M )e−nγ + (1 − ν). By definition of G, PX n W n (G) = 1 − ν + e−nγ ≤

N −nγ e + (1 − ν), M

which implies N ≥ M , a contradiction.

2

Lemma 5.4 (Poor-Verd´ u Lemma [1]) Suppose X and Y are random variables, where X taking values on a finite (or coutably infinite) set X = {x1 , x2 , x3 , . . .} . The minimum probability of error Pe in estimating X from Y satisfies  Pe ≥ (1 − α)PX,Y (x, y) ∈ X × Y : PX|Y (x|y) ≤ α for each α ∈ [0, 1]. Proof: It is known that the minimum-error-probability estimate e(y) of X when receiving y is e(y) = arg max PX|Y (x|y). (5.2.2) x∈X

Therefore, the error probability incurred in testing among the values of X is

81

given by 1 − Pe = P r{X = e(Y )}  Z  X = PX|Y (x|y) dPY (y) Y  x : x=e(y)

 Z  = max PX|Y (x|y) dPY (y) x∈X Y  Z  = max fx (y) dPY (y) x∈X Y   = E max fx (Y ) , x∈X

where fx (y) , PX|Y (x|y). Let hj (y) be the j-th element in the re-ordering set of {fx1 (y), fx2 (y), fx3 (y), . . .} according to ascending element values. In other words, h1 (y) ≥ h2 (y) ≥ h3 (y) ≥ · · · and {h1 (y), h2 (y), h3 (y), . . .} = {fx1 (y), fx2 (y), fx3 (y), . . .}. Then 1 − Pe = E[h1 (Y )].

(5.2.3)

For any α ∈ [0, 1], we can write Z PX,Y {(x, y) ∈ X × Y : fx (y) > α} =

PX|Y {x ∈ X : fx (y) > α} dPY (y). Y

Observe that PX|Y {x ∈ X : fx (y) > α} =

X

PX|Y (x|y) · 1[fx (y) > α]

x∈X

= =

X x∈X ∞ X j=1

82

fx (y) · 1[fx (y) > α] hj (y) · 1[hj (y) > α],

where 1(·) is the indicator function2 . Therefore, ∞ X

Z PXY {(x, y) ∈ X × Y : fx (y) > α} = Y

! hj (y) · 1[hj (y) > α] dPY (y)

j=1

Z h1 (y) · 1(h1 (y) > α)dPY (y)

≥ Y

= E[h1 (Y ) · 1(h1 (Y ) > α)]. It remains to relate E[h1 (Y ) · 1(h1 (Y ) > α)] with E[h1 (Y )], which is exactly 1 − Pe . For any α ∈ [0, 1] and any random variable U with P r{0 ≤ U ≤ 1} = 1, U ≤ α + (1 − α) · U · 1(U > α) holds with probability one. This can be easily proved by upper-bounding U in terms of α when 0 ≤ U ≤ α, and α + (1 − α)U , otherwise. Thus E[U ] ≤ α + (1 − α)E[U · 1(U > α)]. By letting U = h1 (Y ), together with (5.2.3), we finally obtain (1 − α)PXY {(x, y) ∈ X × Y : fx (y) > α} ≥ ≥ = =

(1 − α)E[h1 (Y ) · 1[h1 (Y ) > α]] E[h1 (Y )] − α (1 − Pe ) − α (1 − α) − Pe . 2

Corollary 5.5 Every ∼Cn = (n, M ) code satisfies    1 1 −nγ n n Pe (∼Cn ) ≥ 1 − e Pr iX n W n (X ; Y ) ≤ log M − γ n n for every γ > 0, where X n places probability mass 1/M on each codeword, and Pe (∼Cn ) denotes the error probability of the code. Proof: Taking α = e−nγ in Lemma 5.4, and replacing X and Y in Lemma 5.4 2

I.e., if fx (y) > α is true, 1(·) = 1; else, it is zero.

83

by its n-fold counterparts, i.e., X n and Y n , we obtain    Pe (∼Cn ) ≥ 1 − e−nγ PX n W n (xn , y n ) ∈ X n × Y n : PX n |Y n (xn |y n ) ≤ e−nγ    PX n |Y n (xn |y n ) e−nγ n n n n −nγ PX n W n (x , y ) ∈ X × Y : = 1−e ≤ 1/M 1/M   n n  PX n |Y n (x |y ) e−nγ n n n n −nγ PX n W n (x , y ) ∈ X × Y : ≤ = 1−e PX n (xn ) 1/M  −nγ n n n n = 1−e PX n W n [(x , y ) ∈ X × Y :  PX n |Y n (xn |y n ) 1 1 log ≤ log M − γ n PX n (xn ) n    1 1 −nγ n n = 1−e Pr iX n W n (X ; Y ) ≤ log M − γ . n n 2 Definition 5.6 (ε-achievable rate) Fix ε ∈ [0, 1]. R ≥ 0 is an ε-achievable rate if there exists a sequence of ∼Cn = (n, Mn ) channel block codes such that lim inf n→∞

1 log Mn ≥ R n

and lim sup Pe (∼Cn ) ≤ ε. n→∞

Definition 5.7 (ε-capacity Cε ) Fix ε ∈ [0, 1]. The supremum of ε-achievable rates is called the ε-capacity, Cε . It is straightforward for the definition that Cε is non-decreasing in ε, and C1 = log |X |. Definition 5.8 (capacity C) The channel capacity C is defined as the supremum of the rates that are ε-achievable for all ε ∈ [0, 1]. It follows immediately from the definition3 that C = inf 0≤ε≤1 Cε = limε↓0 Cε = C0 and that C is the 3

The proof of C0 = limε↓0 Cε can be proved by contradiction as follows. Suppose C0 + 2γ < limε↓0 Cε for some γ > 0. For any positive integer j, and by definition of C1/j , there exists Nj and a sequence of ∼Cn = (n, Mn ) code such that for n > Nj , eCn = (1/n) log Mn > C1/j − γ > C0 + γ and Pe (∼Cn ) < 2/j. Construct a sequence of codes ∼ j−1 j ˜ n ) as: ∼ eCn = ∼Cn , if max ˜ (n, M i=1 Ni ≤ n < maxi=1 Ni . Then lim supn→∞ (1/n) log Mn ≥ C0 +γ e and lim inf n→∞ Pe (∼Cn ) = 0, which contradicts to the definition of C0 .

84

supremum of all the rates R for which there exists a sequence of ∼Cn = (n, Mn ) channel block codes such that lim inf n→∞

1 log Mn ≥ R, n

and lim sup Pe (∼Cn ) = 0. n→∞

5.2.2

ε-capacity

Theorem 5.9 (ε-capacity) For 0 < ε < 1, the ε-capacity Cε for arbitrary channels satisfies Cε = sup Iε (X; Y ). X

Proof: 1. Cε ≥ supX Iε (X; Y ). Fix input X. It suffices to show the existence of ∼Cn = (n, Mn ) data transmission code with rate Iε (X; Y ) − γ
0. (Because if such code exists, then lim supn→∞ Pe (∼Cn ) ≤ ε and lim inf n→∞ (1/n) log Mn ≥ Iε (X; Y )−γ, which implies Cε ≥ Iε (X; Y ).) From Lemma 5.3, there exists an ∼Cn = (n, Mn ) code whose error probability satisfies   1 γ 1 n n iX n W n (X ; Y ) < log Mn + + e−nγ/4 Pe (∼Cn ) < P r n n 4    1 γ γ n n iX n W n (X ; Y ) < Iε (X; Y ) − + + e−nγ/4 ≤ Pr n 2 4   1 γ n n ≤ Pr iX n W n (X ; Y ) < Iε (X; Y ) − + e−nγ/4 . n 4 Since     1 n n Iε (X; Y ) , sup R : lim sup P r iW n W n (X ; Y ) ≤ R ≤ ε , n n→∞

85

we obtain 

 1 γ n n lim sup P r iX n W n (X ; Y ) < Iε (X; Y ) − ≤ ε. n 4 n→∞ Hence, the proof of the direct part is completed by noting that   1 γ n n iX n W n (X ; Y ) < Iε (X; Y ) − lim sup Pe (∼Cn ) ≤ lim sup P r n 4 n→∞ n→∞ + lim sup e−nγ/4 n→∞

= ε. 2. Cε ≤ supX Iε (X; Y ). Suppose that there exists a sequence of ∼Cn = (n, Mn ) codes with rate strictly larger than supX Iε (X; Y ) and lim supn→∞ Pe (∼Cn ) ≤ ε. Let the ultimate code rate for this code be supX Iε (X; Y ) + 3ρ for some ρ > 0. Then for sufficiently large n, 1 log Mn > sup Iε (X; Y ) + 2ρ. n X Since the above inequality holds for every X, it certainly holds if taking ˆ n which places probability mass 1/Mn on each codeword, i.e., input X 1 ˆ Yˆ ) + 2ρ, log Mn > Iε (X; n

(5.2.4)

ˆ Then from Corolwhere Yˆ is the channel output due to channel input X. lary 5.5, the error probability of the code satisfies    1 1 n n −nρ ˆ ; Yˆ ) ≤ log Mn − ρ Pe (∼Cn ) ≥ 1 − e Pr i ˆ n n (X n X W n    1 −nρ n n ˆ Yˆ ) + ρ , ˆ ; Yˆ ) ≤ Iε (X; ≥ 1−e Pr i ˆ n n (X n X W where the last inequality follows from (5.2.4), which by taking the limsup of both sides, we have   1 n ˆn ˆ ˆ ˆ ε ≥ lim sup Pe (∼Cn ) ≥ lim sup P r i ˆ n n (X ; Y ) ≤ Iε (X; Y ) + ρ > ε, n X W n→∞ n→∞ and a desired contradiction is obtained.

86

2

5.2.3

General Shannon capacity

Theorem 5.10 (general Shannon capacity) The channel capacity C for arbitrary channel satisfies C = sup I(X; Y ). X

Proof: Observe that C = C0 = lim Cε . ε↓0

Hence, from Theorem 5.9, we note that for ε ∈ (0, 1), C ≥ sup lim Iδ (X; Y ) ≥ sup I0 (X; Y ) = sup I(X; Y ). X

δ↑ε

X

X

It remains to show that C ≤ supX I(X; Y ). Suppose that there exists a sequence of ∼Cn = (n, Mn ) codes with rate strictly larger than supX I(X; Y ) and error probability tends to 0 as n → ∞. Let the ultimate code rate for this code be supX I(X; Y ) + 3ρ for some ρ > 0. Then for sufficiently large n, 1 log Mn > sup I(X; Y ) + 2ρ. n X Since the above inequality holds for every X, it certainly holds if taking input ˆ n which places probability mass 1/Mn on each codeword, i.e., X 1 ˆ Yˆ ) + 2ρ, log Mn > I(X; n

(5.2.5)

ˆ Then from Corollary where Yˆ is the channel output due to channel input X. 5.5, the error probability of the code satisfies    1 1 n ˆn −nρ ˆ Pe (∼Cn ) ≥ 1 − e i ˆ n n (X ; Y ) ≤ log Mn − ρ Pr n X W n    1 −nρ n ˆn ˆ ˆ ˆ ≥ 1−e Pr i ˆ n n (X ; Y ) ≤ I(X; Y ) + ρ , (5.2.6) n X W where the last inequality follows from (5.2.5). Since, by assumption, Pe (∼Cn ) ˆ Yˆ ), thereby vanishes as n → ∞, but (5.2.6) cannot vanish by definition of I(X; a desired contradiction is obtained. 2

87

5.2.4

Strong capacity

Definition 5.11 (strong capacity) Define the strong converse capacity (or strong capacity) CSC as the infimum of the rates R such that for all channel block codes ∼Cn = (n, Mn ) with lim inf n→∞

1 log Mn ≥ R, n

we have lim inf Pe (∼Cn ) = 1. n→∞

Theorem 5.12 (general strong capacity) ¯ CSC , sup I(X; Y ). X

5.2.5

Examples

With these general capacity formulas, we can now compute the channel capacity for some of the non-stationary or non-ergodic channels, and analyze their properties. Example 5.13 (Shannon capacity) Let the input and output alphabets be {0, 1}, and let every output Yi be given by: Yi = Xi ⊕ Ni , where “⊕” represents modulo-2 addition operation. Assume the input process X and the noise process N are independent. A general relation between the entropy rate and mutual information rate can be derived from (2.4.2) and (2.4.4) as: ¯ |X) ≤ I(X; Y ) ≤ H(Y ¯ ) − H(Y ¯ |X). H(Y ) − H(Y Since N n is completely determined from Y n under the knowledge of X n , ¯ |X) = H(N ¯ H(Y ). Indeed, this channel is a symmetric channel. Therefore, an uniform input yields ¯ ) = log(2) uniform output (Bernoulli with parameter (1/2)), and H(Y ) = H(Y nats. We thus have ¯ C = log(2) − H(N ) nats. We then compute the channel capacity for the next two cases of noises.

88

Case A) If N is a non-stationary binary independent sequence with P r{Ni = 1} = pi , then by the uniform boundedness (in i) of the variance of random variable − log PNi (Ni ), namely, Var[− log PNi (Ni )] ≤ E[(log PNi (Ni ))2 ]   ≤ sup pi (log pi )2 + (1 − pi )(log(1 − pi ))2 0 0. Therefore, H(N ) = lim supn→∞ (1/n) ni=1 hb (pi ). Consequently, n

1X C = log(2) − lim sup hb (pi ) nats/channel usage. n→∞ n i=1 This result is illustrated in Figures 5.1.

···

H(N )

cluster points

¯ H(N )

-

Figure 5.1: The ultimate CDFs of (1/n) log PN n (N n ).

Case B) If N has the same distribution as the source process in Example 4.23, ¯ then H(N ) = log(2) nats, which yields zero channel capacity. Example 5.14 (strong capacity) Continue from Example 5.13. To show the inf-information rate of channel W , we first derive the relation of the CDF between the information density and noise process with respective to the uniform

89

input that maximizes the information rate (In this case, PX n (xn ) = PY n (y n ) = 2−n ).   PY n |X n (Y n |X n ) 1 log ≤θ Pr n PY n (Y n )   1 1 n n = Pr log PN n (N ) − log PY n (Y ) ≤ θ n n   1 n = Pr log PN n (N ) ≤ θ − log(2) n   1 n = P r − log PN n (N ) ≥ log(2) − θ n   1 n (5.2.7) = 1 − P r − log PN n (N ) < log(2) − θ . n Case A) From (5.2.7), it is obvious that n

CSC

1X = 1 − lim inf hb (pi ). n→∞ n i=1

Case B) From (5.2.7) and also from Figure 5.2, CSC = log(2) nats/channel usage. In Case B of Example 5.14, we have derived the ultimate CDF of the normalized information density, which is depicted in Figure 5.2. This limiting CDF is called spectrum of the normalized information density. In Figure 5.2, it has been stated that the channel capacity is 0 nat/channnel usage, and the strong capacity is log(2) nat/channel usage. Hence, the operational meaning of the two margins has been clearly revealed. Question is “what is the operational meaning of the function value between 0 and log(2)?”. The answer of this question actually follows Definition 5.7 of a new capacity-related quantity. In practice, it may not be easy to design a block code which transmits information with (asymptotically) no error through a very noisy channel with rate equals to channel capacity. However, if we admit some errors in transmission, such as the error probability is bounded above by 0.001, we may have more chance to come up with a feasible block code. Example 5.15 (ε-capacity) Continue from case B of Example 5.13. Let the spectrum of the normalized information density be i(θ). Then the ε-capacity of this channel is actually the inverse function of i(θ), i.e. Cε = i−1 (θ).

90

1

-

log(2) nats

0

Figure 5.2: The ultimate CDF of the normalized information density for Example 5.14-Case B).

Note that the Shannon capacity can be written as: C = lim Cε ; ε↓0

and in general, the strong capacity satisfies CSC ≥ lim Cε . ε↑1

However, equality of the above inequality holds in this example.

5.3

Structures of good data transmission codes

The channel capacity for discrete memoryless channel is shown to be: C , max I(PX , QY |X ). PX

Let PX be the optimizer of the above maximization operation. Then C , max I(PX , QY |X ) = I(PX , QY |X ). PX

Here, the performance of the code is assumed to be the average error probability, namely M 1 X Pe (∼Cn |xni ), Pe (∼Cn ) = M i=1

91

if the codebook is ∼Cn , {xn1 , xn2 , . . . , xnM }. Due to the random coding argument, a deterministic good code with arbitrarily small error probability and rate less than channel capacity must exist. Question is what is the relationship between the good code and the optimizer PX ? It is widely believed that if the code is good (with rate close to capacity and low error probability), then the output statistics PYe n — due to the equally-likely code—must approximate the output distribution, denoted by PY n , due to the input distribution achieving the channel capacity. This fact is actually reflected in the next theorem. Theorem 5.16 For any channel W n = (Y n |X n ) with finite input alphabet and capacity C that satisfies the strong converse (i.e., C = CSC ), the following statement holds. Fix γ > 0 and a sequence of {∼Cn = (n, Mn )}∞ n=1 block codes with 1 log Mn ≥ C − γ/2, n and vanishing error probability (i.e., error probability approaches zero as blocklength n tends to infinity.) Then, 1 en n kY − Y k ≤ γ n

for all sufficiently large n, n

where Ye n is the output due to the block code and Y is the output due the X that satisfies n n I(X ; Y ) = max I(X n ; Y n ). n

n

X

To be specific, PYe n (y n ) =

X

PXe n (xn )PW n (y n |xn ) =

xn ∈ ∼ Cn

X xn ∈ ∼ Cn

1 PW n (y n |xn ) Mn

and PY n (y n ) =

X

PX n (xn )PW n (y n |xn ).

xn ∈X n

Note that the above theorem holds for arbitrary channels, not restricted to only discrete memoryless channels. One may query that “can a result in the spirit of the above theorem be proved for the input statistics rather than the output statistics?” The answer is negative. Hence, the statement that the statistics of any good code must approximate those that maximize the mutual information is erroneously taken for granted. (However, we do not rule out the possibility of the existence of

92

good codes that approximate those that maximize the mutual information.) To n e n (which see this, simply consider the normalized entropy of X versus that of X is uniformly distributed over the codewords) for discrete memoryless channels:   1 1 1 1 1 n n n n n n e H(X ) − H(X ) = H(X |Y ) + I(X ; Y ) − log(Mn ) n n n n n   1 = H(X|Y ) + I(X; Y ) − log(Mn ) n   1 = H(X|Y ) + C − log(Mn ). n A good code with vanishing error probability exists for (1/n) log(Mn ) arbitrarily close to C; hence, we can find a good code sequence to satisfy   1 1 n n e H(X ) − H(X ) = H(X|Y ). lim n→∞ n n Since the term H(X|Y ) is in general positive, where a quick example is the BSC with crossover probability p, which yields H(X|Y ) = −p log(p)−(1−p) log(1−p), the two input distributions does not necessarily resemble to each other.

5.4

5.4.1

Approximations of output statistics: resolvability for channels Motivations

The discussion of the previous section somewhat motivates the necessity to find a equally-distributed (over a subset of input alphabet) input distribution that generates the output statistics, which is close to the output due to the input that maximizes the mutual information. Since such approximations are usually performed by computers, it may be natural to connect approximations of the input and output statistics with the concept of resolvability.

5.4.2

Notations and definitions of resolvability for channels

In a data transmission system as shown in Figure 5.3, suppose that the source, channel and output are respectively denoted by X n , (X1 , . . . , Xn ), W n , (W1 , . . . , Wn ),

93

and Y n , (Y1 , . . . , Yn ), where Wi has distribution PYi |Xi . To simulate the behavior of the channel, a computer-generated input may be necessary as shown in Figure 5.4. As stated in Chapter 4, such computergenerated input is based on an algorithm formed by a few basic uniform random experiments, which has finite resolution. Our goal is to find a good computere n such that the corresponding output Ye n is very close to the generated input X true output Y n . . . . , X3 , X2 , X1 -

true source

PY n |X n true channel

. . . , Y3 , Y2 , Y1 -

true output

Figure 5.3: The communication system.

e3 , X e2 , X e1 ...,X -

computer-generated source

PY n |X n true channel

. . . , Ye3 , Ye2 , Ye1 corresponding output

Figure 5.4: The simulated communication system.

Definition 5.17 (ε-resolvability for input X and channel W ) Fix ε > 0, and suppose that the (true) input random variable and (true) channel statistics are X and W = (Y |X), respectively. Then the ε-resolvability Sε (X, W ) for input X and channel W is defined by: n ˜ and N )(∀ n > N ) Sε (X, W ) , min R : (∀ γ > 0)(∃ X  1 n n n e ) < R + γ and kY − Ye k < ε , R(X n where PYe n = PXe n PW n . (The definitions of resolution R(·) and variational distance k(·) − (·)k are given by Definitions 4.4 and 4.5.) Note that if we take the channel W n to be an identity channel for all n, namely X n = Y n and PY n |X n (y n |xn ) is either 1 or 0, then the ε-resolvability for input X and channel W is reduced to source ε-resolvability for X: Sε (X, W Identity ) = Sε (X).

94

Similar reductions can be applied to all the following definitions. Definition 5.18 (ε-mean-resolvability for input X and channel W ) Fix ε > 0, and suppose that the (true) input random variable and (true) channel statistics are respectively X and W . Then the ε-mean-resolvability S¯ε (X, W ) for input X and channel W is defined by: n ˜ and N )(∀ n > N ) S¯ε (X, W ) , min R : (∀ γ > 0)(∃ X

where PYe n

 1 n n n e e H(X ) < R + γ and kY − Y k < ε , n = PXe n PW n and PY n = PX n PW n .

Definition 5.19 (resolvability and mean resolvability for input X and channel W ) The resolvability and mean-resolvability for input X and W are defined respectively as: ¯ S(X, W ) , sup Sε (X, W ) and S(X, W ) , sup S¯ε (X, W ). ε>0

ε>0

Definition 5.20 (resolvability and mean resolvability for channel W ) The resolvability and mean-resolvability for channel W are defined respectively as: S(W ) , sup S(X, W ), X

and ¯ ¯ S(W ) , sup S(X, W ). X

5.4.3

Results on resolvability and mean-resolvability for channels

Theorem 5.21 ¯ S(W ) = CSC = sup I(X; Y ). X

It is somewhat a reasonable inference that if no computer algorithms can produce a desired good output statistics under the number of random nats specified, then all codes should be bad codes for this rate. Theorem 5.22 ¯ S(W ) = C = sup I(X; Y ). X

95

It it not yet clear that the operational relation between the resolvability (or mean-resolvability) and capacities for channels. This could be some interesting problems to think of.

96

Bibliography [1] H. V. Poor and S. Verd´ u, “A lower bound on the probability of error in multihypothesis Testing,” IEEE Trans. on Information Theory, vol. IT–41, no. 6, pp. 1992–1994, Nov. 1995. [2] S. Verd´ u and T. S. Han, “A general formula for channel capacity,” IEEE Trans. on Information Theory, vol. IT–40, no. 4, pp. 1147–1157, Jul. 1994.

97

Chapter 6 Optimistic Shannon Coding Theorems for Arbitrary Single-User Systems The conventional definitions of the source coding rate and channel capacity require the existence of reliable codes for all sufficiently large blocklengths. Alternatively, if it is required that good codes exist for infinitely many blocklengths, then optimistic definitions of source coding rate and channel capacity are obtained. In this chapter, formulas for the optimistic minimum achievable fixed-length source coding rate and the minimum ε-achievable source coding rate for arbitrary finite-alphabet sources are established. The expressions for the optimistic capacity and the optimistic ε-capacity of arbitrary single-user channels are also provided. The expressions of the optimistic source coding rate and capacity are examined for the class of information stable sources and channels, respectively. Finally, examples for the computation of optimistic capacity are presented.

6.1

Motivations

The conventional definition of the minimum achievable fixed-length source coding rate T (X) (or T0 (X)) for a source X (cf. Definition 3.2) requires the existence of reliable source codes for all sufficiently large blocklengths. Alternatively, if it is required that reliable codes exist for infinitely many blocklengths, a new, more optimistic definition of source coding rate (denoted by T (X)) is obtained [11]. Similarly, the optimistic capacity C¯ is defined by requiring the existence of reliable channel codes for infinitely many blocklengths, as opposed to the definition of the conventional channel capacity C [12, Definition 1]. This concept of optimistic source coding rate and capacity has recently been investigated by Verd´ u et.al for arbitrary (not necessarily stationary, ergodic, information stable, etc.) sources and single-user channels [11, 12]. More specifically, they establish an additional operational characterization for the optimistic

98

minimum achievable source coding rate (T (X) for source X) by demonstrating that for a given channel, the classical statement of the source-channel separation theorem1 holds for every channel if T (X) = T (X) [11]. In a dual fashion, they also show that for channels with C¯ = C, the classical separation theorem holds for every source. They also conjecture that T (X) and C¯ do not seem to admit a simple expression. In this chapter, we demonstrate that T (X) and C¯ do indeed have a general formula. The key to these results is the application of the generalized supinformation rate introduced in Chapter 2 to the existing proofs by Verd´ u and Han [12] of the direct and converse parts of the conventional coding theorems. We also provide a general expression for the optimistic minimum ε-achievable source coding rate and the optimistic ε-capacity. For the generalized sup/inf-information/entropy rates which will play a key role in proving our optimistic coding theorems, readers may refer to Chapter 2.

6.2

Optimistic source coding theorems

In this section, we provide the optimistic source coding theorems. They are shown based on two new bounds due to Han [7] on the error probability of a source code as a function of its size. Interestingly, these bounds constitute the natural counterparts of the upper bound provided by Feinstein’s Lemma and the Verd´ u-Han lower bound [12] to the error probability of a channel code. Furthermore, we show that for information stable sources, the formula for T (X) reduces to 1 T (X) = lim inf H(X n ). n→∞ n This is in contrast to the expression for T (X), which is known to be 1 T (X) = lim sup H(X n ). n→∞ n The above result leads us to observe that for sources that are both stationary and information stable, the classical separation theorem is valid for every channel. In [11], Vembu et.al characterize the sources for which the classical separation theorem holds for every channel. They demonstrate that for a given source X, 1

By the “classical statement of the source-channel separation theorem,” we mean the following. Given a source X with (conventional) source coding rate T (X) and channel W with capacity C, then X can be reliably transmitted over W if T (X) < C. Conversely, if T (X) > C, then X cannot be reliably transmitted over W . By reliable transmissibility of the source over the channel, we mean that there exits a sequence of joint source-channel codes such that the decoding error probability vanishes as the blocklength n → ∞ (cf. [11]).

99

the separation theorem holds for every channel if its optimistic minimum achievable source coding rate (T (X)) coincides with its conventional (or pessimistic) minimum achievable source coding rate (T (X)); i.e., if T (X) = T (X). We herein establish a general formula for T (X). We prove that for any source X, T (X) = limHδ (X). δ↑1

We also provide the general expression for the optimistic minimum ε-achievable source coding rate. We show these results based on two new bounds due to Han (one upper bound and one lower bound) on the error probability of a source code [7, Chapter 1]. The upper bound (Lemma 3.3) consists of the counterpart of Feinstein’s Lemma for channel codes, while the lower bound (Lemma 3.4) consists of the counterpart of the Verd´ u-Han lower bound on the error probability of a channel code ([12, Theorem 4]). As in the case of the channel coding bounds, both source coding bounds (Lemmas 3.3 and 3.4) hold for arbitrary sources and for arbitrary fixed blocklength. Definition 6.1 An (n, M ) fixed-length source code for X n is a collection of M n-tuples ∼Cn = {cn1 , . . . , cnM }. The error probability of the code is Pe (∼Cn ) , P r [X n 6∈ ∼Cn ] . Definition 6.2 (optimistic ε-achievable source coding rate) Fix 0 < ε < 1. R ≥ 0 is an optimistic ε-achievable rate if, for every γ > 0, there exists a sequence of (n, Mn ) fixed-length source codes ∼Cn such that lim sup n→∞

1 log Mn ≤ R n

and lim inf Pe (∼Cn ) ≤ ε. n→∞

The infimum of all ε-achievable source coding rates for source X is denoted by T ε (X). Also define T (X) , sup0 γ = 0 ∀γ > 0, H(X n ) n→∞ where H(X n ) = E[hX n (X n )] is the entropy of X n . Lemma 6.5 Every information source X satisfies T (X) = lim inf n→∞

1 H(X n ). n

Proof: 1. [T (X) ≥ lim inf n→∞ (1/n)H(X n )] Fix ε > 0 arbitrarily small. Using the fact that hX n (X n ) is a non-negative bounded random variable for finite alphabet, we can write the normalized block entropy as   1 1 n n H(X ) = E hX n (X ) n n    1 1 n n = E hX n (X ) 1 0 ≤ hX n (X ) ≤ limHδ (X) + ε δ↑1 n n    1 1 n n +E hX n (X ) 1 hX n (X ) > limHδ (X) + ε .(6.2.1) δ↑1 n n From the definition of limε↑1 Hδ (X), it directly follows that the first term in the right hand side of (6.2.1) is upper bounded by limδ↑1 Hδ (X) + ε, and that the liminf of the second term is zero. Thus T (X) = limHδ (X) ≥ lim inf n→∞

δ↑1

2. [T (X) ≤ lim inf n→∞ (1/n)H(X n )]

101

1 H(X n ). n

Fix ε > 0. For infinitely many n,   hX n (X n ) −1>ε Pr H(X n )    1 1 n n hX n (X ) > (1 + ε) H(X ) = Pr n n    1 1 n n hX n (X ) > (1 + ε) lim inf H(X ) + ε . ≥ Pr n→∞ n n Since X is information stable, we obtain that    1 1 n n lim inf P r hX n (X ) > (1 + ε) lim inf H(X ) + ε = 0. n→∞ n→∞ n n By the definition of limδ↑1 Hδ (X), the above implies that   1 n T (X) = limHδ (X) ≤ (1 + ε) lim inf H(X ) + ε . n→∞ n δ↑1 The proof is completed by noting that ε can be made arbitrarily small. 2 It is worth pointing out that if the source X is both information stable and stationary, the above Lemma yields 1 H(X n ). n→∞ n

T (X) = T (X) = lim

This implies that given a stationary and information stable source X, the classical separation theorem holds for every channel.

6.3

Optimistic channel coding theorems

In this section, we state without proving the general expressions for the opti¯ of arbitrary singlemistic ε-capacity2 (C¯ε ) and for the optimistic capacity (C) user channels. The proofs of these expressions are straightforward once the right definition (of I¯ε (X; Y )) is made. They employ Feinstein’s Lemma and the PoorVerd´ u bound, and follow the same arguments used in Theorem 5.10 to show the general expressions of the conventional channel capacity C = sup I0 (X; Y ) = sup I(X; Y ), X

X

The authors would like to point out that the expression of C¯ε was also separately obtained in [10, Theorem 7]. 2

102

and the conventional ε-capacity sup lim Iδ (X; Y ) ≤ Cε ≤ sup Iε (X; Y ). X

δ↑ε

X

We close this section by proving the formula of C¯ for information stable channels. Definition 6.6 (optimistic ε-achievable rate) Fix 0 < ε < 1. R ≥ 0 is an optimistic ε-achievable rate if there exists a sequence of ∼Cn = (n, Mn ) channel block codes such that 1 lim inf log Mn ≥ R n→∞ n and lim inf Pe (∼Cn ) ≤ ε. n→∞

Definition 6.7 (optimistic ε-capacity C¯ε ) Fix 0 < ε < 1. The supremum of optimistic ε-achievable rates is called the optimistic ε-capacity, C¯ε . It is straightforward for the definition that Cε is non-decreasing in ε, and C1 = log |X |. ¯ The optimistic channel capacity C¯ Definition 6.8 (optimistic capacity C) is defined as the supremum of the rates that are ε-achievable for all ε ∈ [0, 1]. It follows immediately from the definition that C = inf 0≤ε≤1 Cε = limε↓0 Cε = C0 and that C is the supremum of all the rates R for which there exists a sequence of ∼Cn = (n, Mn ) channel block codes such that lim inf n→∞

1 log Mn ≥ R, n

and lim inf Pe (∼Cn ) = 0. n→∞

Theorem 6.9 (optimistic ε-capacity formula) Fix 0 < ε < 1. The optimistic ε-capacity C¯ε satisfies sup lim I¯δ (X; Y ) ≤ C¯ε ≤ sup I¯ε (X; Y ). X

δ↑ε

(6.3.1)

X

Note that actually C¯ε = supX I¯ε (X; Y ), except possibly at the points of discontinuities of supX I¯ε (X; Y ) (which are countable). Theorem 6.10 (optimistic capacity formula) The optimistic capacity C¯ satisfies C¯ = sup I¯0 (X; Y ). X

103

We next investigate the expression of C¯ for information stable channels. The expression for the capacity of information stable channels is already known (cf. for example [11]) C = lim inf Cn . n→∞

¯ We prove a dual formula for C. Definition 6.11 (Information stable channels [6, 8]) A channel W is said to be information stable if there exists an input process X such that 0 < Cn < ∞ for n sufficiently large, and   iX n W n (X n ; Y n ) − 1 > γ = 0 lim sup P r nCn n→∞ for every γ > 0. Lemma 6.12 Every information stable channel W satisfies 1 C¯ = lim sup sup I(X n ; Y n ). n→∞ X n n Proof: 1. [C¯ ≤ lim supn→∞ supX n (1/n)I(X n ; Y n )] By using a similar argument as in the proof of [12, Theorem 8, property h)], we have 1 I¯0 (X; Y ) ≤ lim sup sup I(X n ; Y n ). n→∞ X n n Hence, 1 C¯ = sup I¯0 (X; Y ) ≤ lim sup sup I(X n ; Y n ). n→∞ X n n X 2. [C¯ ≥ lim supn→∞ supX n (1/n)I(X n ; Y n )] ˜ is the input process that makes the channel information stable. Suppose X Fix ε > 0. Then for infinitely many n,   1 n n ˜ PX˜ n W n i ˜ n n (X ; Y ) ≤ (1 − ε)(lim sup Cn − ε) n X W n→∞ " # n n ˜ ;Y ) i ˜ n n (X ≤ PX˜ n W n X W < (1 − ε)Cn n " # ˜ n; Y n) iX˜ n W n (X = PX˜ n W n − 1 < −ε . nCn

104

Since the channel is information stable, we get that   1 n n ˜ ; Y ) ≤ (1 − ε)(lim sup Cn − ε) = 0. i ˜ n n (X lim inf PX˜ n W n n→∞ n X W n→∞ ¯ the above immediately implies that By the definition of C, ˜ Y ) ≥ (1 − ε)(lim sup Cn − ε). C¯ = sup I¯0 (X; Y ) ≥ I¯0 (X; n→∞

X

Finally, the proof is completed by noting that ε can be made arbitrarily small. 2 Observations: • It is known that for discrete memoryless channels, the optimistic capacity C¯ is equal to the (conventional) capacity C [12, 5]. The same result holds for modulo−q additive noise channels with stationary ergodic noise. However, in general, C¯ ≥ C since I¯0 (X; Y ) ≥ I(X; Y ) [3, 4]. • Remark that Theorem 11 in [11] holds if, and only if, sup I(X; Y ) = sup I¯0 (X; Y ). X

X

Furthermore, note that, if C¯ = C and there exists an input distribution ¯ PXˆ that achieves C, then PXˆ also achieves C.

6.4

Examples

¯ The first two We provide four examples to illustrate the computation of C and C. examples present information stable channels for which C¯ > C. The third example shows an information unstable channel for which C¯ = C. These examples indicate that information stability is neither necessary nor sufficient to ensure that C¯ = C or thereby the validity of the classical source-channel separation theorem. The last example illustrates the situation where 0 < C < C¯ < CSC < log2 |Y|, where CSC is the channel strong capacity3 . We assume in this section that all logarithms are in base 2 so that C and C¯ are measured in bits. 3

The strong (or strong converse) capacity CSC (cf. [2]) is defined as the infimum of the numbers R for which for all (n, Mn ) codes with (1/n) log Mn ≥ R, lim inf n→∞ Pe (∼Cn ) = 1. This definition of CSC implies that for any sequence of (n, Mn ) codes with lim inf n→∞ (1/n) log Mn > CSC , Pe (∼Cn ) > 1 − ε for every ε > 0 and for n sufficiently large. It is shown in [2] that ¯ CSC = limε↑1 C¯ε = supX I(X; Y ).

105

6.4.1

Information stable channels

Example 6.13 Consider a nonstationary channel W such that at odd time instances n = 1, 3, · · · , W n is the product of the transition distribution of a binary symmetric channel with crossover probability 1/8 (BSC(1/8)), and at even time instances n = 2, 4, 6, · · · , W n is the product of the distribution of a BSC(1/4). It can be easily verified that this channel is information stable. Since the channel is symmetric, a Bernoulli(1/2) input achieves Cn = supX n (1/n)I(X n ; Y n ); thus ( 1 − hb (1/8), for n odd; Cn = 1 − hb (1/4), for n even, where hb (a) , −a log2 a − (1 − a) log2 (1 − a) is the binary entropy function. Therefore, C = lim inf n→∞ Cn = 1 − hb (1/4) and C¯ = lim supn→∞ Cn = 1 − hb (1/8) > C. Example 6.14 Here we use the information stable channel provided in [11, Section III] to show that C¯ > C. Let N be the set of all positive integers. Define the set J as J

, {n ∈ N : 22i+1 ≤ n < 22i+2 , i = 0, 1, 2, . . .} = {2, 3, 8, 9, 10, 11, 12, 13, 14, 15, 32, 33, · · · , 63, 128, 129, · · · , 255, · · · }.

Consider the following nonstationary symmetric channel W . At times n ∈ J , Wn is a BSC(0), whereas at times n 6∈ J , Wn is a BSC(1/2). Put W n = ˆ n. W1 × W2 × · · · × Wn . Here again Cn is achieved by a Bernoulli(1/2) input X We then obtain n J(n) 1 1X ˆ , I(Xi ; Yi ) = [J(n) · 1 + (n − J(n)) · 0] = Cn = n i=1 n n where J(n) , |J ∩ {1, 2, · · · , n}|. It can be shown that  2 2blog2 nc 1  + , for blog2 nc odd; J(n)  1 − 3 × n 3n = blog2 nc 2  n  2×2 − , for blog2 nc even. 3 n 3n Consequently, C = lim inf n→∞ Cn = 1/3 and C¯ = lim supn→∞ Cn = 2/3.

6.4.2

Information unstable channels

Example 6.15 The Polya-contagion channel: Consider a discrete additive channel with binary input and output alphabet {0, 1} described by Yi = Xi ⊕ Zi ,

106

i = 1, 2, · · · ,

where Xi , Yi and Zi are respectively the i-th input, i-th output and i-th noise, and ⊕ represents modulo-2 addition. Suppose that the input process is independent of the noise process. Also assume that the noise sequence {Zn }n≥1 is drawn according to the Polya contagion urn scheme [1, 9] as follows: an urn originally contains R red balls and B black balls with R < B; the noise just make successive draws from the urn; after each draw, it returns to the urn 1 + ∆ balls of the same color as was just drawn (∆ > 0). The noise sequence {Zi } corresponds to the outcomes of the draws from the Polya urn: Zi = 1 if ith ball drawn is red and Zi = 0, otherwise. Let ρ , R/(R + B) and δ , ∆/(R + B). It is shown in [1] that the noise process {Zi } is stationary and nonergodic; thus the channel is information unstable. From Lemma 2 and Section IV in [4, Part I], we obtain ¯ 1−ε (X) ≤ Cε ≤ 1 − lim H ¯ δ (X), 1−H δ↑(1−ε)

and 1 −H1−ε (X) ≤ C¯ε ≤ 1 − lim Hδ (X). δ↑(1−ε)

It has been shown [1] that −(1/n) log PX n (X n ) converges in distribution to the continuous random variable V , hb (U ), where U is beta-distributed with parameters (ρ/δ, (1 − ρ)/δ), and hb (·) is the binary entropy function. Thus ¯ 1−ε (X) = lim H ¯ δ (X) = H1−ε (X) = lim Hδ (X) = F −1 (1 − ε), H V δ↑(1−ε)

δ↑(1−ε)

where FV (a) , P r{V ≤ a} is the cumulative distribution function of V , and FV−1 (·) is its inverse [1]. Consequently, Cε = C¯ε = 1 − FV−1 (1 − ε), and   C = C¯ = lim 1 − FV−1 (1 − ε) = 0. ε↓0

˜ 1, W ˜ 2 , . . . consist of the channel in Example 6.14, and let Example 6.16 Let W ˆ 1, W ˆ 2 , . . . consist of the channel in Example 6.15. Define a new channel W as W follows: ˜ i and W2i−1 = W ˆ i for i = 1, 2, · · · . W2i = W As in the previous examples, the channel is symmetric, and a Bernoulli(1/2) input maximizes the inf/sup information rates. Therefore for a Bernoulli(1/2)

107

input X, we have   PW n (Y n |X n ) 1 log ≤θ Pr n PY n (Y n )      PWˆ i (Y i |X i ) 1 PW˜ i (Y i |X i )   Pr log + log ≤θ ,   2i PY i (Y i ) PY i (Y i )     if n = 2i;      = i+1 i+1 i i P |X ) P 1 i (Y |X )  i+1 (Y ˜ ˆ W W  log + log ≤θ ,  Pr   2i + 1 PY i (Y i ) PY i+1 (Y i+1 )    if n = 2i + 1;     1 1  i  1 − P r − log PZ i (Z ) < 1 − 2θ + J(i) ,   i i     if n = 2i;      = 1 1 1  i+1  1 − Pr − log PZ i+1 (Z ) < 1 − 2 − θ+ J(i) ,    i+1 i+1 i+1    if n = 2i + 1.  The fact that −(1/i) log[PZ i (Z i )] converges in distribution to the continuous random variable V , hb (U ), where U is beta-distributed with parameters (ρ/δ, (1 − ρ)/δ), and the fact that lim inf n→∞

1 1 J(n) = n 3

and

1 2 lim sup J(n) = 3 n→∞ n

imply that  i(θ) , lim inf P r n→∞

1 PW n (Y n |X n ) log ≤θ n PY n (Y n )



 = 1 − FV

 5 − 2θ , 3

and ¯i(θ) , lim sup P r



n→∞

1 PW n (Y n |X n ) log ≤θ n PY n (Y n )



 = 1 − FV

 4 − 2θ . 3

Consequently, 5 1 C¯ε = − FV−1 (1 − ε) 6 2 Thus 0 ε > 0. R is a ε-achievable data compression rate at distortion D for a source Z if there exists a sequence of data compression codes fn (·) with 1 lim sup log kfn k ≤ R, n→∞ n and   sup θ : λZ,f (Z) (θ) ≤ ε ≤ D.

(7.1.1)

Note that (7.1.1), which can be re-written as:     1 n n inf θ : lim sup P r ρn (Z , fn (Z )) > θ < 1 − ε ≤ D, n n→∞ is equivalent to stating that the limsup of the probability of excessive distortion (i.e., distortion larger than D) is smaller than 1 − ε. The infimum ε-achievable data compression rate at distortion D for Z is denoted by Tε (D, Z).

112

Theorem 7.6 (general data compression theorem) Fix D > 0 and 1 > ε > 0. Let Z and {ρn (·, ·)}n≥1 be given. Then Rε (D) ≤ Tε (D, Z) ≤ Rε (D − γ), for any γ > 0, where ˆ ¯ Rε (D) , n Z), i o I(Z; h inf PZ|Z : sup θ : λZ,Zˆ (θ) ≤ ε ≤ D ˆ where the infimum is taken over all conditional distributions PZ|Zˆ for which the joint distribution PZ,Zˆ = PZ PZ|Zˆ satisfies the distortion constraint. Proof: 1. Forward part (achievability): Tε (D, Z) ≤ Rε (D−γ) or Tε (D+γ, Z) ≤ Rε (D). Choose γ > 0. We will prove the existence of a sequence of data compression codes with 1 lim sup log kfn k ≤ Rε (D) + 2γ, n→∞ n and   sup θ : λZ,f (Z) (θ) ≤ ε ≤ D + γ. be the distribution achieving Rε (D). step 1: Let PZ|Z ˜ step 2: Let R = Rε (D) + 2γ. Choose M = enR codewords of blocklength n independently according to PZ˜n , where X z n |z n ), PZ n (z n )PZ˜n |Z n (˜ zn) = PZ˜n (˜ z n ∈Z n

and denote the resulting random set by Cn . step 3: For a given Cn , we denote by A(Cn ) the set of sequences z n ∈ Z n such that there exists zˆn ∈ Cn with 1 ρn (z n , zˆn ) ≤ D + γ. n step 4: Claim: lim sup EZ˜n [PZ n (Ac (Cn ))] < ε. n→∞

The proof of this claim will be provided as a lemma following this theorem. Therefore there exists (a sequence of) Cn∗ such that lim sup PZ n (Ac (Cn∗ )) < ε. n→∞

113

step 5: Define a sequence of codes {fn }n≥1 by ( ρn (z n , z˜n ), if z n ∈ A(Cn∗ ); arg min n ∈C ∗ n z ˜ n fn (z ) = 0, otherwise, where 0 is a fixed default n-tuple in Zˆn . Then

  1 n n n n z ∈ Z : ρn (z , fn (z )) ≤ D + γ ⊃ A(Cn∗ ), n

since (∀ z n ∈ A(Cn∗ )) there exists z˜n ∈ Cn∗ such that (1/n)ρn (z n , z˜n ) ≤ D+γ, which by definition of fn implies that (1/n)ρn (z n , fn (z n )) ≤ D + γ. step 6: Consequently,   1 n n n n λ(Z,f (Z)) (D + γ) = lim inf PZ n z ∈ Z : ρn (z , f (z )) ≤ D + γ n→∞ n ∗ ≥ lim inf PZ n (A(Cn )) n→∞

= 1 − lim sup PZ n (Ac (Cn∗ )) n→∞

> ε. Hence,   sup θ : λZ,f (Z) (θ) ≤ ε ≤ D + γ, where the last step is clearly depicted in Figure 7.1. This proves the forward part. 2. Converse part: Tε (D, Z) ≥ Rε (D). We show that for any sequence of encoders {fn (·)}∞ n=1 , if   sup θ : λZ,f (Z) (θ) ≤ ε ≤ D, then lim sup n→∞

1 log kfn k ≥ Rε (D). n

Let n

n

PZ˜n |Z n (˜ z |z ) ,



1, if z˜n = fn (z n ); 0, otherwise.

Then to evaluate the statistical properties of the random sequence {(1/n)ρn (Z n , fn (Z n )}∞ n=1

114

6

λZ,f (Z) (θ)

λZ,f (Z) (D + γ)

ε

-



 sup θ : λZ,f (Z) (θ) ≤ εD + γ

θ

Figure 7.1: λZ,f (Z) (D + γ) > ε ⇒ sup[θ : λZ,f (Z) (θ) ≤ ε] ≤ D + γ.

under distribution PZ n is equivalent to evaluating those of the random sequence {(1/n)ρn (Z n , Z˜ n )}∞ ˜ n . Therefore, n=1 under distribution PZ n ,Z Rε (D) ,

˜ ¯ inf I(Z; Z) : sup[θ : λ {PZ|Z (θ) ≤ ε] ≤ D} ˜ ˜ Z,Z

˜ ¯ I(Z; Z) ˜ ˜ −H(Z|Z) ¯ Z) H( ˜ ¯ Z) H( 1 ≤ lim sup log kfn k, n→∞ n

≤ ≤ ≤

where the second inequality follows from (2.4.3) (with γ ↑ 1 and δ = 0), and the ˜ third inequality follows from the fact that H(Z|Z) ≥ 0. 2 Lemma 7.7 (cf. Proof of Theorem 7.6) lim sup EZ˜n [PZ n (Ac (Cn∗ ))] < ε. n→∞

Proof: step 1: Let  D(ε) , sup θ : λZ,Z˜ (θ) ≤ ε .

115

Define A(ε) n,γ

Since

 ,

1 ρn (z n , z˜n ) ≤ D(ε) + γ n  1 n n ˜ +γ . ¯ Z) and iZ n ,Z˜n (z , z˜ ) ≤ I(Z; n

(z n , z˜n ) :

   1 n ˜n (ε) lim inf P r D , ρn (Z , Z ) ≤ D + γ > ε, n→∞ n

and 



lim inf P r E , n→∞

1 ˜ +γ ¯ i n ˜n (Z n ; Z˜ n ) ≤ I(Z; Z) n Z ,Z

 = 1,

we have lim inf P r(A(ε) n,γ ) = lim inf P r(D ∩ E) n→∞

n→∞

≥ lim inf P r(D) + lim inf P r(E) − 1 n→∞

n→∞

> ε + 1 − 1 = ε. (ε)

step 2: Let K(z n , z˜n ) be the indicator function of An,γ :  (ε) 1, if (z n , z˜n ) ∈ An,γ ; n n K(z , z˜ ) = 0, otherwise. step 3: By following a similar argument in step 4 of the achievability part of Theorem 8.3 of Volume I, we obtain X X EZ˜n [PZ n (Ac (Cn∗ ))] = PZ˜n (Cn∗ ) PZ n (z n ) ∗ Cn

=

∗) z n 6∈A(Cn

X

X

PZ n (z n )

z n ∈Z n

PZ˜n (Cn∗ )

∗ :z n 6∈A(C ∗ ) Cn n

M

 =

X

PZ n (z n ) 1 −

z n ∈Z n



PZ˜n (y n )K(z n , z˜n )

z˜n ∈Zˆn



X

X ¯

˜

PZ n (z n ) 1 − e−n(I(Z;Z)+γ)

z n ∈Z n

M ×

X

PZ˜n |Z n (˜ z n |z n )K(z n , z˜n )

z˜n ∈Zˆn

≤ 1−

X X

PZ n (z n )PZ˜n |Z n (z n , z˜n )K(z n , z˜n )

z n ∈Z n z˜n ∈Zˆn  n(R−Rε (D)−γ)

+ exp −e

116

.

Therefore, lim sup EZ˜n [PZ n (An (Cn∗ ))] ≤ 1 − lim inf P r(A(ε) n,γ ) n→∞

n→∞

< 1 − ε. 2 For the probability-of-error distortion measure ρn : Z n → Z n , namely,  n, if z n 6= z˜n ; n n ρn (z , z˜ ) = 0, otherwise, we define a data compression code fn : Z n → Z n based on a chosen (asymptotic) lossless fixed-length data compression code book ∼Cn ⊂ Z n : ( z n , if z n ∈ ∼Cn ; fn (z n ) = 0, if z n 6∈ ∼Cn , where 0 is some default element in Z n . Then (1/n)ρn (z n , fn (z n )) is either 1 or 0 which results in a cumulative distribution function as shown in Figure 7.2. Consequently, for any δ ∈ [0, 1),   1 n n ρn (Z , fn (Z )) ≤ δ = P r {Z n = fn (Z n )} . Pr n 1s

P r{Z n = fn (Z n )} s

c

c

0

-

δ

1

D

Figure 7.2: The CDF of (1/n)ρn (Z n , fn (Z n )) for the probability-oferror distortion measure.

117

By comparing the (asymptotic) lossless and lossy fixed-length compression theorems under the probability-of-error distortion measure, we observe that ˆ ¯ inf I(Z; Z) {PZ|Z (θ) ≤ ε] ≤ δ} : sup[θ : λ ˆ ˆ Z,Z   δ ≥ 1; 0, ˆ ¯ = inf I(Z; Z), δ < 1,  n n ˆ {PZ|Z : lim inf P r{Z = Z } > ε} ˆ n→∞  δ ≥ 1;  0, ˆ ¯ inf I(Z; Z), δ < 1, = n n  ˆ {PZ|Z : lim sup P r{Z 6= Z } ≤ 1 − ε} ˆ

Rε (δ) =

n→∞

where λZ,Zˆ (θ) =

n o ( n n ˆ lim inf P r Z = Z , 0 ≤ θ < 1; n→∞

θ ≥ 1.

1,

In particular, in the extreme case where ε goes to one, ¯ H(Z) = {PZ|Z ˆ

ˆ ¯ I(Z; Z). inf n n ˆ : lim sup P r{Z 6= Z } = 0} n→∞

Therefore, in this case, the data compression theorem reduces (as expected) to the asymptotic lossless fixed-length data compression theorem.

118

Chapter 8 Hypothesis Testing 8.1

Error exponent and divergence

Divergence can be adopted as a measure of how similar two distributions are. This quantity has the operational meaning that it is the exponent of the type II error probability for hypothesis testing system of fixed test level. For rigorousness, the exponent is first defined below. Definition 8.1 (exponent) A real number a is said to be the exponent for a sequence of non-negative quantities {an }n≥1 converging to zero, if   1 a = lim − log an . n→∞ n In operation, exponent is an index of the exponential rate-of-convergence for sequence an . We can say that for any γ > 0, e−n(a+γ) ≤ an ≤ e−n(a−γ) , as n large enough. Recall that in proving the channel coding theorem, the probability of decoding error for channel block codes can be made arbitrarily close to zero when the rate of the codes is less than channel capacity. This result can be mathematically written as: Pe (∼C∗n ) → 0, as n → ∞, provided R = lim supn→∞ (1/n) log k∼C∗n k < C, where ∼C∗n is the optimal code for block length n. From the theorem, we only know the decoding error will vanish as block length increases; but, it does not reveal that how fast the decoding error approaches zero. In other words, we do not know the rate-of-convergence of the

119

decoding error. Sometimes, this information is very important, especially for one to decide the sufficient block length to achieve some error bound. The first step of investigating the rate-of-convergence of the decoding error is to compute its exponent, if the decoding error decays to zero exponentially fast (it indeed does for memoryless channels.) This exponent, as a function of the rate, is in fact called the channel reliability function, and will be discussed in the next chapter. For the hypothesis testing problems, the type II error probability of fixed test level also decays to zero as the number of observations increases. As it turns out, its exponent is the divergence of the null hypothesis distribution against alternative hypothesis distribution. Lemma 8.2 (Stein’s lemma) For a sequence of i.i.d. observations X n which is possibly drawn from either null hypothesis distribution PX n or alternative hypothesis distribution PXˆ n , the type II error satisfies (∀ ε ∈ (0, 1))

1 lim − log βn∗ (ε) = D(PX kPXˆ ), n→∞ n

where βn∗ (ε) = minαn ≤ε βn , and αn and βn represent the type I and type II errors, respectively. Proof: [1. Forward Part] In the forward part, we prove that there exists an acceptance region for null hypothesis such that 1 lim inf − log βn (ε) ≥ D(PX kPXˆ ). n→∞ n step 1: divergence typical set. For any δ > 0, define divergence typical set as   n 1 n (x ) P X n An (δ) , x : log − D(PX kPXˆ ) < δ . n PXˆ n (xn ) Note that in divergence typical set, PXˆ n (xn ) ≤ PX n (xn )e−n(D(PX kPXˆ )−δ) . step 2: computation of type I error. By weak law of large number, PX n (An (δ)) → 1. Hence, αn = PX n (Acn (δ)) < ε, for sufficiently large n.

120

step 3: computation of type II error. βn (ε) = PXˆ n (An (δ)) X = PXˆ n (xn ) xn ∈An (δ)



X

PX n (xn )e−n(D(PX kPXˆ )−δ)

xn ∈An (δ)

X

≤ e−n(D(PX kPXˆ )−δ)

PX n (xn )

xn ∈An (δ)

≤ e−n(D(PX kPXˆ )−δ) (1 − αn ). Hence, 1 1 − log βn (ε) ≥ D(PX kPXˆ ) − δ + log(1 − αn ), n n which implies 1 lim inf − log βn (ε) ≥ D(PX kPXˆ ) − δ. n→∞ n The above inequality is true for any δ > 0; therefore 1 lim inf − log βn (ε) ≥ D(PX kPXˆ ). n→∞ n [2. Converse Part] In the converse part, we will prove that for any acceptance region for null hypothesis Bn satisfying the type I error constraint, i.e., αn (Bn ) = PX n (Bnc ) ≤ ε, its type II error βn (Bn ) satisfies 1 lim sup − log βn (Bn ) ≤ D(PX kPXˆ ). n n→∞ βn (Bn ) = PXˆ n (Bn ) ≥ PXˆ n (Bn ∩ An (δ)) X ≥ PXˆ n (xn ) xn ∈Bn ∩An (δ)



X

PX n (xn )e−n(D(PX kPXˆ )+δ)

xn ∈Bn ∩An (δ)

= ≥ ≥ ≥

e−n(D(PX kPXˆ )+δ) PX n (Bn ∩ An (δ)) e−n(D(PX kPXˆ )+δ) [1 − PX n (Bnc ) − PX n (Acn (δ))] e−n(D(PX kPXˆ )+δ) [1 − αn (Bn ) − PX n (Acn (δ))] e−n(D(PX kPXˆ )+δ) [1 − ε − PX n (Acn (δ))] .

121

Hence, 1 1 − log βn (Bn ) ≤ D(PX kPXˆ ) + δ + log [1 − ε − PX n (Acn (δ))] , n n which implies that 1 lim sup − log βn (Bn ) ≤ D(PX kPXˆ ) + δ. n n→∞ The above inequality is true for any δ > 0; therefore, 1 lim sup − log βn (Bn ) ≤ D(PX kPXˆ ). n n→∞ 2

8.1.1

Composition of sequence of i.i.d. observations

Stein’s lemma gives the exponent of the type II error probability for fixed test level. As a result, this exponent, which is the divergence of null hypothesis distribution against alternative hypothesis distribution, is independent of the type I error bound ε for i.i.d. observations. Specifically under i.i.d. environment, the probability for each sequence of xn depends only on its composition, which is defined as an |X |-dimensional vector, and is of the form   #k(xn ) #1(xn ) #2(xn ) , ,..., , n n n where X = {1, 2, . . . , k}, and #i(xn ) is the number of occurrences of symbol i in xn . The probability of xn is therefore can be written as n

n

n

PX n (xn ) = PX (1)#1(x ) × PX (2)#2(x ) × PX (k)#k(x ) . Note that #1(xn ) + · · · + #k(xn ) = n. Since the composition of a sequence decides its probability deterministically, all sequences with the same composition should have the same statistical property, and hence should be treated the same when processing. Instead of manipulating the sequences of observations based on the typical-set-like concept, we may focus on their compositions. As it turns out, such approach yields simpler proofs and better geometrical explanations for theories under i.i.d. environment. (It needs to be pointed out that for cases when composition alone can not decide the probability, this viewpoint does not seem to be effective.) Lemma 8.3 (polynomial bound on number of composition) The number of compositions increases polynomially fast, while the number of possible sequences increases exponentially fast.

122

Proof: Let Pn denotes the set of all possible compositions for xn ∈ X n . Then since each numerator of the components in |X |-dimensional composition vector ranges from 0 to n, it is obvious that |Pn | ≤ (n + 1)|X | which increases polynomially with n. However, the number of possible sequences is |X |n which is of exponential order w.r.t. n. 2 Each composition (or |X |-dimensional vector) actually represents a possible probability mass distribution. In terms of the composition distribution, we can compute the exponent of the probability of those observations that belong to the same composition. Lemma 8.4 (probability of sequences of the same composition) The probability of the sequences of composition C with respect to distribution PX n satisfies 1 e−nD(PC kPX ) ≤ PX n (C) ≤ e−nD(PC kPX ) , (n + 1)|X | where PC is the composition distribution for composition C, and C (by abusing notation without ambiguity) is also used to represent the set of all sequences (in X n ) of composition C. Proof: For any sequence xn of composition   #k(xn ) #1(xn ) #2(xn ) , ,..., C= = (PC (1), PC (2), . . . , PC (k)) , n n n where k = |X |, n

PX n (x ) = =

=

k Y i=1 k Y x=1 k Y

PX (i)#i(x

n)

PX (x)n·PC (x) en·PC (x) log PX (x)

x=1 P en x∈X PC (x) log PX (x) P −n( x∈X PC (x) log[PC (x)/PX (x)]−PC (x) log PC (x))

= = e = e−n(D(PC kPX )+H(PC )) . Similarly, we have

PC (xn ) = e−n[D(PC kPC )+H(PC )] = e−nH(PC ) .

123

(8.1.1)

(PC originally represents a one-dimensional distribution over X . Here, without ambiguity, we re-use the Q same notation of PC (·) to represent its n-dimensional n extension, i.e., PC (x ) = ni=1 PC (xi ).) We can therefore bound the number of sequence in C as X 1 ≥ PC (xn ) xn ∈C

X

=

e−nH(PC )

xn ∈C −nH(PC )

= |C|e

,

which implies |C| ≤ enH(PC ) . Consequently, X PX n (C) = PX n (xn ) xn ∈C

=

X

e−n(D(PC kPX )+H(PC ))

xn ∈C −n(D(PC kPX )+H(PC ))

= |C|e ≤ enH(PC ) e−n(D(PC kPX )+H(PC )) = e−n·D(PC kPX ) .

(8.1.2)

This ends the proof of the upper bound. From (8.1.2), we observe that to derive a lower bound of PX n (C) is equivalent to find a lower bound of |C|, which requires the next inequality. For any composition C 0 , and its corresponding composition set C 0 , Q |C| i∈X PC (i)n·PC (i) PC (C) Q = PC (C 0 ) |C 0 | i∈X PC (i)n·PC0 (i) n! Q Q PC (i)n·PC (i) j∈X (n · PC (j))! = × Q i∈X n·PC 0 (i) n! i∈X PC (i) Q j∈X (n · PC 0 (j))! Q Y (n · PC 0 (j))! PC (i)n·PC (i) = × Q i∈X n·PC 0 (i) (n · PC (j))! i∈X PC (i) j∈X Y Y ≥ (n · PC (j))n·PC0 (j)−n·PC (j) PC (i)n·PC (i)−n·PC0 (i) j∈X

i∈X

m! nm using the inequality ≥ n; n! n P n j∈X (PC 0 (j)−PC (j)) = n = n0 = 1.

124

(8.1.3)

Hence, 1 =

X



X

PC (C 0 )

{C 0 }

{C 0 }

=

X

max PC (C 00 ) 00 {C }

PC (C)

(8.1.4)

{C 0 }

≤ (n + 1)|X | PC (C) = (n + 1)|X | |C|e−nH(PC ) ,

(8.1.5)

where (8.1.4) and (8.1.5) follow from (8.1.3) and (8.1.1), respectively. Therefore, a lower bound of |C| is obtained as: |C| ≥

1 enH(PC ) . (n + 1)|X | 2

Theorem 8.5 (Sanov’s Theorem) For a set of compositions En and a given true distribution PX n , the exponent of PX n (En ) is given by min D(PC kPX ),

C∈En

where PC is the composition distribution for composition C. Proof: PX n (En ) =

X

PX n (C)

C∈En



X

e−nD(PC kPX )

C∈En



X C∈En

=

X

max e−nD(PC kPX )

C∈En

e−n·minC∈En D(PC kPX )

C∈En

≤ (n + 1)|X | e−n·minC∈En D(PC kPX ) . Let C ∗ ∈ En satisfying that D(PC ∗ kPX ) = min D(PC kPX ). C∈En

125

PX n (En ) =

X

PX n (C)

C∈En



X C∈En



1 e−nD(PC kPX ) (n + 1)|X |

1 e−nD(PC∗ kPX ) . (n + 1)|X |

Therefore, 1 e−nD(PC∗ kPX ) ≤ PX n (En ) ≤ (n + 1)|X | e−n·minC∈En D(PC kPX ) , (n + 1)|X | which implies 1 lim − log PX n (En ) = min D(PC kPX ). n→∞ C∈En n 2 The geometrical explanation for Sanov’s theorem is depicted in Figure 8.1. In the figure, it shows that if we adopt the divergence as the measure of distance in distribution space (the outer triangle region in Figure 8.1), then En is just a set of (empirical) distributions. Then for any distribution PX outside En , we can certainly measure the shortest distance from PX to En in terms of the divergence measure. This shortest distance, as it turns out, is indeed the exponent of PX (En ). The following examples show that the viewpoint from Sanov is really effective, especially for i.i.d. observations. @

En @ @

min D(PC kPX )

{C∈En }

@ P C1  @ u  P C2 @ CC u  @  @  CC  @ @ CC uPX @ @

Figure 8.1: The geometric meaning for Sanov’s theorem.

Example 8.6 One wants to roughly estimate the probability that the average of the throws is greater or equal to 4, when tossing a fair dice n times. He observes that whether the requirement is satisfied only depends on the compositions of the observations. Let En be the set of compositions which satisfy the requirement.

126

Also let PX be the probability for tossing a fair dice. Then En can be written as ( ) 6 X En = C : iPC (i) ≥ 4 . i=1

To minimize D(PC kPX ) for C ∈ En , we can use the Lagrange multiplier technique (since divergence is convex with respect to the first argument.) with the constraints on PC being: 6 X

iPC (i) = k

and

i=1

6 X

PC (i) = 1

i=1

for k = 4, 5, 6, . . . , n. So it becomes to minimize: ! ! 6 6 6 X X X PC (i) + λ1 iPC (i) − k + λ2 PC (i) − 1 . PC (i) log P (i) X i=1 i=1 i=1 By taking the derivatives, we found that the minimizer should be of the form eλ1 ·i PC (i) = P6 , λ1 ·j j=1 e for λ1 is chosen to satisfy 6 X

iPC (i) = k.

(8.1.6)

i=1

Since the above is true for all k ≥ 4, it suffices to take the smallest one as our solution, i.e., k = 4. Finally, by solving (8.1.6) for k = 4 numerically, the minimizer should be PC ∗ = (0.1031, 0.1227, 0.1461, 0.1740, 0.2072, 0.2468), and the exponent of the desired probability is D(PC ∗ kPX ) = 0.0433 nat. Consequently, PX n (En ) ≈ e−0.0433 n .

8.1.2

Divergence typical set on composition

In Stein’s lemma, one of the most important steps is to define the divergence typical set, which is   1 PX n (xn ) n An (δ) , x : log − D(PX kPXˆ ) < δ . n PXˆ n (xn )

127

One of the key feature of the divergence typical set is that its probability eventually tends to 1 with respect to PX n . In terms of the compositions, we can define another divergence typical set with the same feature as Tn (δ) , {xn ∈ X n : D(PCxn kPX ) ≤ δ}, where Cxn represents the composition of xn . The above typical set is re-written in terms of composition set as: Tn (δ) , {C : D(PC kPX ) ≤ δ}. PX n (Tn (δ)) → 1 is justified by 1 − PX n (Tn (δ)) =

X

PX n (C)

{C : D(PC kPX )>δ}



X

e−nD(PC kPX ) , from Lemma 8.4.

{C : D(PC kPX )>δ}



X

e−nδ

{C : D(PC kPX )>δ}

≤ (n + 1)|X | e−nδ , cf. Lemma 8.3.

8.1.3

Universal source coding on compositions

It is sometimes not possible to know the true source distribution. In such case, instead of finding the best lossless data compression code for a specific source distribution, one may turn to design a code which is good for all possible candidate distributions (of some specific class.) Here, “good” means that the average codeword length achieves the (unknown) source entropy for all sources with the distributions in the considered class. To be more precise, let n

fn : X →

∞ [

{0, 1}i

i=1

be a fixed encoding function. Then for any possible source distribution PX n in the considered class, the resultant average codeword length for this specific code satisfies 1X PX n (xn )`(fn (xn )) → H(X), n i=1 as n goes to infinity, where `(·) represents the length of fn (xn ). Then fn is said to be a universal code for the source distributions in the considered class.

128

Example 8.7 (universal encoding using composition) First, binary-index the compositions using log2 (n + 1)|X | bits, and denote this binary index for composition C by a(C). Note that the number of compositions is at most (n + 1)|X | . Let Cxn denote the composition with respect to xn , i.e. xn ∈ Cxn . For each composition C, we know that the number of sequence xn in C is at most 2n·H(PC ) (Here, H(PC ) is measured in bits. I.e., the logarithmic base in entropy computation is 2. See the proof of Lemma 8.4). Hence, we can binary-index the elements in C using n · H(PC ) bits. Denote this binary index for elements in C by b(C). Define a universal encoding function fn as fn (xn ) = concatenation{a(Cxn ), b(Cxn )}. Then this encoding rule is a universal code for all i.i.d. sources. Proof: All the logarithmic operations in entropy and divergence are taken to be base 2 throughout the proof. For any i.i.d. distribution with marginal PX , its associated average codeword length is X X `¯n = PX n (xn )`(a(Cxn )) + PX n (xn )`(b(Cxn )) xn ∈X n



X

xn ∈X n

PX n (xn ) · log2 (n + 1)|X | +

PX n (xn ) · n · H(PCxn )

xn ∈X n

xn ∈X n

≤ |X | · log2 (n + 1) +

X

X

PX n (C) · n · H(PC ).

{C}

Hence, the normalized average code length is upper-bounded by |X | × log2 (n + 1) X 1¯ `n ≤ + PX n (C)H(PC ). n n {C}

Since the first term of the right hand side of the above inequality vanishes as n tends to infinity, only the second term has contribution to the ultimate normalized average code length, which in turn can be bounded above using typical set

129

Tn (δ) as: X

PX n (C)H(PC )

{C}

=

X

PX n (C)H(PC ) +

{C∈Tn (δ)}

≤ ≤

X

PX n (C)H(PC )

{C6∈Tn (δ)}

max

{C : D(PC kPX )≤δ/ log(2)}

max

{C : D(PC kPX )≤δ/ log(2)}

X

H(PC ) +

PX n (C)H(PC )

{C : D(PC kPX )>δ/ log(2)}

X

H(PC ) +

2−nD(PC kPX ) H(PC ),

{C : D(PC kPX )>δ/ log(2)}

(From Lemma 8.4) ≤ ≤ ≤

max

{C : D(PC kPX )≤δ/ log(2)}

max

{C : D(PC kPX )≤δ/ log(2)}

max

{C : D(PC kPX )≤δ/ log(2)}

X

H(PC ) +

e−nδ H(PC )

{C : D(PC kPX )>δ/ log(2)}

X

H(PC ) +

e−nδ log2 |X |

{C : D(PC kPX )>δ/ log(2)}

H(PC ) + (n + 1)|X | e−nδ log2 |X |,

where the second term of the last step vanishes as n → ∞. (Note that when base-2 logarithm is taken in divergence instead of natural logarithm, the range [0, δ] in Tn (δ) should be replaced by [0, δ/ log(2)].) It remains to show that max

{C : D(PC kPX )≤δ/ log(2)}

H(PC ) ≤ H(X) + γ(δ),

where γ(δ) only depends on δ, and approaches zero as δ → 0. Since D(PC kPX ) ≤ δ/ log(2), p √ kPC − PX k ≤ 2 · log(2) · D(PC kPX ) ≤ 2δ, where kPC − PX k represents the variational distance of PC against PX , which implies that for all x ∈ X , X |PC (x) − PX (x)| ≤ |PC (x) − PX (x)| x∈X

= kPC − PX k √ ≤ 2δ. Fix 0 < ε < min{ min{x : PX (x)>0} PX (x), 2/e2 }. Choose δ = ε2 /8. Then for all x ∈ X , |PC (x) − PX (x)| ≤ ε/2.

130

For those x satisfying PX (x) ≥ ε, the following is true. From mean value theorem, namely, d(u log u) |t log t − s log s| ≤ max · |t − s|, u∈[t,s]∪[s,t] du we have

≤ ≤ ≤

=

|P (x) log PC (x) − PX (x) log PX (x)| C d(u log u) max u∈[PC (x),PX (x)]∪[P · |PC (x) − PX (x)| du X (x),PC (x)] d(u log u) ε max 2 u∈[PC (x),PX (x)]∪[P du X (x),PC (x)] max d(u log u) ε u∈[ε/2,1] 2 du (Since PX (x) + ε/2 ≥ PC (x) ≥ PX (x) − ε/2 ≥ ε − ε/2 = ε/2 and PX (x) ≥ ε > ε/2.) ε · |1 + log(ε/2)| , 2

where the last step follows from ε < 2/e2 . For those x satisfying PX (x) = 0, PC (x) < ε/2. Hence, |PC (x) log PC (x) − PX (x) log PX (x)| = |PC (x) log PC (x)| ε ε ≤ − log , 2 2 since ε ≤ 2/e2 < 2/e. Consequently, 1 X |PC (x) log PC (x) − PX (x) log PX (x)| log(2) x∈X   |X | ε ε ε · |1 + log(ε/2)| ≤ max − log , log(2) 2 2 2 ( r ) √ |X | δ 1 ≤ max − log(2δ), 2δ · 1 + log(2δ) log(2) 2 2

|H(PC ) − H(X)| ≤

, γ(δ). 2

131

8.1.4

Likelihood ratio versus divergence

Recall that the Neyman-Pearson lemma indicates that the optimal test for two hypothesis (H0 : PX n against H1 : PXˆ n ) is of the form PX n (xn ) > τ PXˆ n (xn )
0,   s s Φ(x0 − s) − H(x0 − s) ≥ Φ(x0 ) − √ − H(x0 ) = η − √ , 2π 2π and then obtain √ H

x0 −

1 ≥ η−√ 2π x η = −√ , 2 2π

! ! √ 2π 2π η − x − Φ x0 − η−x 2 2 ! √ 2π η+x 2

√ for |x| < η 2π/2. Together with the fact that H(x) − Φ(x) ≥ −η for all

162

x ∈ 0 for E[Yn ] < 0, we obtain

= = ≥ ≥

P r(Yn > 0) Z ∞ dFn (y) 0 Z ∞ Mn (θ) e−θ sn (θ) y dHn (y) Z0 α Mn (θ) e−θ sn (θ) y dHn (y), for some α > 0 specified later 0 Z α −θ sn (θ) α Mn (θ)e dHn (y) 0 −θ sn (θ) α

= Mn (θ)e [H(α) − H(0)] −θ sn (θ) α ≥ Mn (θ)e "

# 4(n − d − 1)ˆ ρ(θ) √  . × Φ(α) − Φ(0) − Cn,d √ ˆ 2 (θ)sn (θ) π 2(n − d) − 3 2 σ

The proof is completed by letting α be the solution of 4(n − d − 1)ˆ ρ(θ) 1 √  Φ(α) = Φ(0) + Cn,d √ + . π 2(n − d) − 3 2 σ ˆ 2 (θ)sn (θ) |θ|sn (θ) 2 The upper bound obtained in Lemma 8.31 in fact depends only upon the ratio of (µ/σ) or (1/2)(µ2 /σ 2 ). Therefore, we can rephrase Lemma 8.31 as the next corollary.

173

P Corollary 8.32 Let Yn = ni=1 Xi be the sum of independent random variables, among which {Xi }di=1 are identically Gaussian distributed with mean µ > 0 and variance σ 2 , and {Xi }ni=d+1 have common distribution as min{Xi , 0}. Let γ , (1/2)(µ2 /σ 2 ). Then P r {Yn ≤ 0} ≤ B (d, (n − d); γ) where

√ 4πγ d √ ; ≥ 1 − −γ √ An,d (γ)Mn (γ), if B (d, (n − d); γ) , n e − 4πγΦ( 2γ)  1 − Bn,d (γ)Mn (γ), otherwise, ! 1 4Cn,d (n − d − 1)˜ ρ(λ) √  An,d (γ) , min √ +√  ,1 , √ 2π λ − 2γ s˜n (λ) π 2(n − d) − 3 2 σ ˜ 2 (λ)˜ sn (λ)  

n p 1 exp − λ − 2γ s˜n (λ) Bn,d (γ) , √ λ − 2γ s˜n (λ) × Φ−1

4Cn,d (n − d − 1)˜ ρ(λ) 1 1 √ +√ + √ 2 λ − 2γ s˜n (θ) π[2(n − d) − 3 2]˜ σ 2 (λ)˜ sn (λ) 2 /2

Mn (λ) , e−nγ edλ

h

Φ (−λ) eλ

2 /2

+ eγ Φ

p in−d 2γ ,

d nd n 1 2 √ − λ + , √ n − d (n − d)2 n − d 1 + 2πλeγ Φ( 2γ)  λ d(n + d) 2 n √ 1− λ ρ˜(λ) , √ γ (n − d) [1 + 2πλe Φ( 2γ)] (n − d)2   n2 2 2 2 λ + 2 e−d(2n−d)λ /(n−d) +2 2 (n − d)   p √ d n2 γ 2 − λ + 3 2πλe Φ( 2γ) n − d (n − d)2     √ 2n n2 n 2 λ2 /2 − λ +3 2πλe Φ − λ , n − d (n − d)2 n−d   d 1 2 2 √ λ + , s˜n (λ) , n − √ n−d 1 + 2πλeγ Φ( 2γ)

σ ˜ 2 (λ) , −

and λ is the unique positive solution of   p 1 d d (1/2)λ2 λe Φ(−λ) = √ 1− − eγ Φ( 2γ)λ, n n 2π provided that n − d ≥ 3.

174

!) ,

Proof: The notations used in this proof follows those in Lemma 8.31. √ Let λ , (µ/σ) + σθ = 2γ + σθ. Then the moment generating functions of X1 and Xd+1 , min{X1 , 0} are respectively p  2 2 2γ . MX1 (λ) = e−γ eλ /2 and MXd+1 (λ) = Φ (−λ) e−γ eλ /2 + Φ P Therefore, the moment generating function of Yn = ni=1 Xi becomes h p in−d −nγ dλ2 /2 λ2 /2 γ (λ) = e e Φ (−λ) e + e Φ Mn (λ) = MXd 1 (λ)MXn−d 2γ . d+1 (8.4.16) Since λ and θ have a one-to-one correspondence, solving ∂Mn (θ)/∂θ = 0 is equivalent to solving ∂Mn (λ)/∂λ = 0, which from (8.4.16) results in   p d d 1 (1/2)λ2 (8.4.17) 1− − eγ Φ( 2γ). e Φ(−λ) = √ n n 2πλ As it turns out, the solution λ of the above equation depends only on γ. Next we derive σ ˆ 2 (θ) ρˆ(θ) σ ˜ (λ) , , ρ˜(λ) , 3 σ 2 θ=(λ−√2γ)/σ σ θ=(λ−√2γ)/σ 2

and s˜2n (λ)

s2n (θ) . , σ 2 θ=(λ−√2γ)/σ

2

By replacing e(1/2)λ Φ(−λ) with   p d 1 d √ − eγ Φ( 2γ) 1− n n 2πλ (cf. (8.4.17)), we obtain d n nd 1 √ − λ2 + , (8.4.18) √ 2 n − d (n − d) n − d 1 + 2πλeγ Φ( 2γ)  n λ d(n + d) 2 √ ρ˜(λ) = 1− λ √ (n − d) [1 + 2πλeγ Φ( 2γ)] (n − d)2   n2 2 2 2 +2 λ + 2 e−d(2n−d)λ /(n−d) 2 (n − d)   p √ d n2 2 γ λ + 3 − 2πλe Φ( 2γ) n − d (n − d)2     √ 2n n2 n 2 λ2 /2 − λ +3 2πλe Φ − λ ,(8.4.19) n − d (n − d)2 n−d

σ ˜ 2 (λ) = −

175

and  = n −

 d 1 2 √ λ + . (8.4.20) √ n−d 1 + 2πλeγ Φ( 2γ) √ Taking (8.4.18), (8.4.19), (8.4.20) and θ = (λ − 2γ)/σ into An,d and Bn,d in Lemma 8.31, we obtain: ! 1 4(n − d − 1)˜ ρ(λ) √  An,d = min √ + Cn,d √  ,1 √ 2π λ − 2γ s˜n (λ) π 2(n − d) − 3 2 σ ˜ 2 (λ)˜ sn (λ) s˜2n (λ)

and Bn,d

n p 1 exp − λ − 2γ s˜n (λ) = √ λ − 2γ s˜n (λ) × Φ−1

4Cn,d (n − d − 1)˜ ρ(λ) 1 √ Φ(0) + √ + √ λ − 2γ s˜n (θ) π[2(n − d) − 3 2]˜ σ 2 (λ)˜ sn (λ)

!) ,

which depends only on λ and γ, as desired. Finally, a simple derivation yields E[Yn ] = dE[X1 ] + (n − d)E[Xd+1 ]  p h p p i √ −γ = σ d 2γ + (n − d) −(1/ 2π)e + 2γΦ(− 2γ) , and hence,

√ d 4πγ √ . E[Yn ] ≥ 0 ⇔ ≥ 1 − −γ √ n e − 4πγΦ( 2γ) 2

We close this section by remarking that the parameters in the upper bounds of Corollary 8.32, although they have complex expressions, are easily computerevaluated, once their formulas are established.

8.5

Generalized Neyman-Pearson Hypothesis Testing

The general expression of the Neyman-Pearson type-II error exponent subject to a constant bound on the type-I error has been proved for arbitrary observations. In this section, we will state the results in terms of the ε-inf/sup-divergence rates. Theorem 8.33 (Neyman-Pearson type-II error exponent for a fixed test level [3]) Consider a sequence of random observations which is assumed to

176

have a probability distribution governed by either PX (null hypothesis) or PXˆ (alternative hypothesis). Then, the type-II error exponent satisfies ˆ ˆ ≤ lim sup − 1 log β ∗ (ε) ≤ D ¯ ε (XkX) ¯ δ (XkX) lim D n δ↑δ n n→∞ ˆ ≤ lim inf − 1 log β ∗ (ε) ≤ Dε (XkX) ˆ lim Dδ (XkX) n n→∞ δ↑ε n where βn∗ (ε) represents the minimum type-II error probability subject to a fixed type-I error bound ε ∈ [0, 1). The general formula for Neyman-Pearson type-II error exponent subject to an exponential test level has also been proved in terms of the ε-inf/sup-divergence rates. Theorem 8.34 (Neyman-Pearson type-II error exponent for an exponential test level) Fix s ∈ (0, 1) and ε ∈ [0, 1). It is possible to choose decision regions for a binary hypothesis testing problem with arbitrary datawords of blocklength n, (which are governed by either the null hypothesis distribution PX or the alternative hypothesis distribution PXˆ ,) such that 1 ˆ (s) kX) ˆ and lim sup − 1 log αn ≥ D(1−ε) (X ˆ (s) kX), ¯ ε (X lim inf − log βn ≥ D n→∞ n n n→∞ (8.5.1) or 1 ˆ (s) kX) ˆ and lim sup − 1 log αn ≥ D ˆ (s) kX), ¯ (1−ε) (X lim inf − log βn ≥ Dε (X n→∞ n n n→∞ (8.5.2)  (s) ∞ (s) ˆ where X exhibits the tilted distributions P defined by ˆ n n=1 X

(s) dPXˆ n (xn )

and

  1 dPX n n , exp s log (x ) dPXˆ n (xn ), Ωn (s) dPXˆ n

  dPX n n Ωn (s) , exp s log (x ) dPXˆ n (xn ). dPXˆ n Xn Z

Here, αn and βn are the type-I and type-II error probabilities, respectively. ˜ to represent X ˆ (s) . We only prove (8.5.1); Proof: For ease of notation, we use X (8.5.2) can be similarly demonstrated.

177

(s)

By definition of dPXˆ n (·), we have       1 1 1 1 1 1 n ˆn n n e e d e n (X kX ) + d e n (X kX ) = − log Ωn (s) . s n X 1−s n X s(1 − s) n (8.5.3) ¯ Let Ω , lim supn→∞ (1/n) log Ωn (s). Then, for any γ > 0, ∃N0 such that ∀ n > N0 , 1 ¯ + γ. log Ωn (s) < Ω n From (8.5.3), dXe n kXˆ n (θ) 

, = = ≤ = =

 1 n n e kX ˆ )≤θ lim inf P r d e n (X n→∞ n X       1 1 1 θ 1 n n e kX ) − d e n (X log Ωn (s) ≤ lim inf P r − n→∞ 1−s n X s(1 − s) n s    1 e n kX n ) ≥ − 1 − s θ − 1 1 log Ωn (s) lim inf P r dXe n (X n→∞ n s s n   1 1−s 1¯ γ n n e lim inf P r d e n (X kX ) > − θ− Ω− n→∞ n X s s s   1−s 1¯ γ 1 n n e d e n (X kX ) ≤ − θ− Ω− 1 − lim sup P r n X s s s n→∞   1¯ γ 1−s 1 − d¯Xe n kX n − θ− Ω − . s s s

Thus, ˜ X) ˆ , sup{θ : d e n ˆ n (θ) ≤ ε} ¯ ε (Xk D X kX     1−s 1 ¯ ¯ ≥ sup θ : 1 − dXe n kX n − θ − (Ω + γ) < ε s s   s 0 1 ¯ 0 ¯ (Ω + γ) − θ : dXe n kX n (θ ) > 1 − ε = sup − 1−s 1−s 1 ¯ s (Ω + γ) − inf{θ0 : d¯Xe n kX n (θ0 ) > 1 − ε} = − 1−s 1−s 1 ¯ s = − (Ω + γ) − sup{θ0 : d¯Xe n kX n (θ0 ) ≤ 1 − ε} 1−s 1−s 1 ¯ s ˜ = − (Ω + γ) − D1−ε (XkX). 1−s 1−s Finally, choosing the acceptance region for null hypothesis as   dPXe n n 1 n n ˜ ˆ ¯ x ∈ X : log (x ) ≥ Dε (XkX) , n dPXˆ n

178

we obtain:  βn = PXˆ n

 n o dPXe n n 1 ˜ ˆ ˜ X) ˆ , ¯ ¯ ε (Xk log (X ) ≥ Dε (XkX) ≤ exp −nD n dPXˆ n

and  dPXe n n 1 ˜ X) ˆ ¯ ε (Xk log (X ) < D PX n n dPXˆ n   dPXe n n 1 1 ¯ s ˜ PX n log (X ) < − (Ω + γ) − D1−ε (XkX) n dPXˆ n 1−s 1−s      dPXe n n 1 1 −1 1 log (X ) − log Ωn (s) PX n 1−s n dPX n s(1 − s) n  ¯ +γ Ω 1 ˜ D1−ε (XkX) Ω + . PX n n dPX n s n s 

αn = ≤ =

=

Then, for n > N0 ,  dPXe n n 1 ˜ log (X ) > D1−ε (XkX) ≤ PX n n dPX n n o ˜ ≤ exp −nD1−ε (XkX) . 

αn

Consequently, 1 ˆ (s) kX). ˆ (s) kX) ˆ and lim sup − 1 log αn ≥ D(1−ε) (X ¯ ε (X lim inf − log βn ≥ D n→∞ n n n→∞ 2

179

Bibliography [1] P. Billingsley. Probability and Measure, 2nd edition, New York, NY: John Wiley and Sons, 1995. [2] James A. Bucklew. Large Deviation Techniques in Decision, Simulation, and Estimation. Wiley, New York, 1990. [3] P.-N. Chen, “General formulas for the Neyman-Pearson type-II error exponent subject to fixed and exponential type-I error bound,” IEEE Trans. Info. Theory, vol. 42, no. 1, pp. 316–323, November 1993. [4] Jean-Dominique Deuschel and Daniel W. Stroock. Large Deviations. Academic Press, San Diego, 1989. [5] W. Feller, An Introduction to Probability Theory and its Applications, 2nd edition, New York, John Wiley and Sons, 1970.

180

Chapter 9 Channel Reliability Function The channel coding theorem (cf. Chapter 5) shows that block codes with arbitrarily small probability of block decoding error exist at any code rate smaller than the channel capacity C. However, the result mentions nothing on the relation between the rate of convergence for the error probability and code rate (specifically for rate less than C.) For example, if C = 1 bit per channel usage, and there are two reliable codes respectively with rates 0.5 bit per channel usage and 0.25 bit per channel usage, is it possible for one to adopt the latter code even if it results in lower information transmission rate? The answer is affirmative if a higher error exponent is required. In this chapter, we will first re-prove the channel coding theorem for DMC, using an alternative and more delicate method that describes the dependence between block codelength and the probability of decoding error.

9.1

Random-coding exponent

Definition 9.1 (channel block code) An (n, M ) code for channel W n with input alphabet X and output alphabet Y is a pair of mappings f : {1, 2, · · · , M } → X n

and g : Y n → {1, 2, · · · , M }.

Its average error probability is given by M 1 X Pe (n, M ) = M m=1 4

X

W n (y n |f (m)).

{y n :g(y n )6=m}

Definition 9.2 (channel reliability function [1]) For any R ≥ 0, define the channel reliability function E ∗ (R) for a channel W as the largest scalar β > 0 such that there exists a sequence of (n, Mn ) codes with 1 β ≤ lim inf − log Pe (n, Mn ) n→∞ n

181

and R ≤ lim inf n→∞

1 log Mn . n

(9.1.1)

From the above definition, it is obvious that E ∗ (0) = ∞; hence; in deriving the bounds on E ∗ (R), we only need to consider the case of R > 0. Definition 9.3 (random-coding exponent) The random coding exponent for DMC with generic distribution PY |X is defined by  !1+s  X X 1/(1+s) . Er (R) , sup sup −sR − log PX (x)PY |X (y|x) 0≤s≤1 PX

y∈Y

x∈X

Theorem 9.4 (random coding exponent) For DMC with generic transition probability PY |X , E ∗ (R) ≥ Er (R) for R ≥ 0. Proof: The theorem holds trivially at R = 0 since E ∗ (0) = ∞. It remains to prove the theorem under R > 0. Similar to the proof of channel capacity, the codeword is randomly selected according to some distribution PXe n . For fixed R > 0, choose a sequence of {Mn }n≥1 with R = limn→∞ (1/n) log Mn . Step 1: Maximum likelihood decoder. Let {c1 , . . . , cMn } ∈ X n denote the set of n-tuple block codewords selected, and let the decoding partition for symbol m (namely, the set of channel outputs that classify to m) be Um , {y n : PY n |X n (y n |cm ) > PY n |X n (y n |cm0 ),

for all m0 6= m}.

Those channel outputs that are on the boundary, i.e., for some m and m0 ,

PY n |X n (y n |cm ) = PY n |X n (y n |cm0 ),

will be arbitrarily assigned to either m or m0 . Step 2: Property of indicator function for s ≥ 0. Let φm be the indicator function of Um . Then1 for all s ≥ 0, ( 1/(1+s) )s  n X n |X n (y |cm0 ) P Y . 1 − φm (y n ) ≤ n P Y n |X n (y |cm ) 0 0 m 6=m,1≤m ≤M n

1

P

i

1/(1+s)

ai

s

is no less than 1 when one of {ai } is no less than 1.

182

Step 3: Probability of error given codeword cm is transmitted. Let Pe|m denote the probability of error given codeword cm is transmitted. Then X Pe|m ≤ PY n |X n (y n |cm ) y n 6∈Um

X

=

PY n |X n (y n |cm )[1 − φm (y n )]

y n ∈Y n

( X





X

PY n |X n (y n |cm )

m0 6=m,1≤m0 ≤Mn

y n ∈Y n

PY n |X n (y n |cm0 ) PY n |X n (y n |cm )

1/(1+s) )s )s

( X

=

X

1/(1+s)

PY n |X n (y n |cm )

1/(1+s)

PY n |X n (y n |cm0 )

.

m0 6=m,1≤m0 ≤Mn

y n ∈Y n

Step 4: Expectation of Pe|m . E[Pe|m ] ( " X 1/(1+s) PY n |X n (y n |cm ) ≤ E " =

=

X

1/(1+s) PY n |X n (y n |cm0 )

m0 6=m,1≤m0 ≤Mn

y n ∈Y n

X

)s #

( 1/(1+s) PY n |X n (y n |cm )

E

1/(1+s)

PY n |X n (y n |cm0 )

m0 6=m,1≤m0 ≤Mn

y n ∈Y n

X

)s # X

h

1/(1+s)

i

)s #

"(

E PY n |X n (y n |cm ) E

X m0 6=m,1≤m0 ≤M

y n ∈Y n

1/(1+s)

PY n |X n (y n |cm0 )

,

n

where the latter step follows because {cm }1≤m≤M are independent random variables. Step 5: Jensen’s inequality for 0 ≤ s ≤ 1, and bounds on probability of error. By Jensen’s inequality, when 0 ≤ s ≤ 1, E[ts ] ≤ (E[t])s .

183

Therefore, E[Pe|m ] "( h i X 1/(1+s) ≤ E PY n |X n (y n |cm ) E ≤

h

1/(1+s)

E PY n |X n (y n |cm )

i

=

#!s

" X

E

1/(1+s)

PY n |X n (y n |cm0 )

m0 6=m,1≤m0 ≤Mn

y n ∈Y n

h

X

1/(1+s)

PY n |X n (y n |cm0 )

m0 6=m,1≤m0 ≤Mn

y n ∈Y n

X

)s # X

1/(1+s)

E PY n |X n (y n |cm )

i

i 1/(1+s) n 0 E PY n |X n (y |cm ) h

X

!s

m0 6=m,1≤m0 ≤Mn

y n ∈Y n

Since the codewords are selected with identical distribution, the expectations h i 1/(1+s) E PY n |X n (y n |cm ) should be the same for each m. Hence, E[Pe|m ] h i X 1/(1+s) n E PY n |X n (y |cm ) ≤

m0 6=m,1≤m0 ≤Mn

y n ∈Y n

=

! h i s 1/(1+s) E PY n |X n (y n |cm0 )

X

i h is 1/(1+s) 1/(1+s) E PY n |X n (y n |cm ) (Mn − 1)E PY n |X n (y n |cm ) h

X y n ∈Y n

= (Mn − 1)s

i1+s X  h 1/(1+s) E PY n |X n (y n |cm ) y n ∈Y n

≤ Mns

i1+s X  h 1/(1+s) E PY n |X n (y n |cm ) y n ∈Y n

!1+s = Mns

X

X

y n ∈Y n

xn ∈X n

1/(1+s)

PXe n (xn )PY n |X n (y n |xn )

Since the upper bound of E[Pe|m ] is no longer dependent on m, the expected average error probability E[Pe ] can certainly be bounded by the same bound, namely, !1+s X X 1/(1+s) n n s n E[Pe ] ≤ Mn PXe n (x )PY n |X n (y |x ) , (9.1.2) y n ∈Y n

xn ∈X n

which immediately implies the existence of an (n, Mn ) code satisfying: !1+s X X 1/(1+s) PXe n (xn )PY n |X n (y n |xn ) . Pe (n, Mn ) ≤ Mns y n ∈Y n

xn ∈X n

184

By using the fact that PXe n and PY n |X n are product distributions with identical marginal, and taking logarithmic operation on both sides of the above inequality, we obtain: !1+s X X s 1 1/(1+s) PXe (x)PY |X (y|x) − log Pe (n, Mn ) ≥ − log Mn − log , n n y∈Y x∈X which implies !1+s X X 1 1/(1+s) PXe (x)PY |X (y|x) . lim inf − log Pe (n, Mn ) ≥ −sR − log n→∞ n y∈Y x∈X The proof is completed by noting that the lower bound holds for any s ∈ [0, 1] and any PXe . 2

9.1.1

The properties of random coding exponent

We re-formulate Er (R) to facilitate the understanding of its behavior as follows. Definition 9.5 (random-coding exponent) The random coding exponent for DMC with generic distribution PY |X is defined by Er (R) , sup [−sR + E0 (s)] , 0≤s≤1

where

 E0 (s) , sup − log PX

!1+s  X X y∈Y

1/(1+s)

PX (x)PY |X

(y|x)

.

x∈X

Based on the reformulation, the properties of Er (R) can be realized via the analysis of function E0 (s). Lemma 9.6 (properties of Er (R)) 1. Er (R) is convex and non-increasing in R; hence, it is a strict decreasing and continuous function of R ∈ [0, ∞). 2. Er (C) = 0, where C is channel capacity, and Er (C − δ) > 0 for all 0 < δ < C. 3. There exists Rcr such that for 0 < R < Rcr , the slope of Er (R) is −1.

185

Proof: 1. Er (R) is convex and non-increasing, since it is the supremum of affine functions −sR+E0 (s) with non-positive slope −s. Hence, Er (R) is strictly decreasing and continuous for R > 0. 2. −sR + E0 (s) equals zero at R = E0 (s)/s for 0 < s ≤ 1. Hence by the continuity of Er (R), Er (R) = 0 for R = sup0 0 for R < sup0 0. Randomly select 2Mn codewords according to some distribution PXe n . For fixed R > 0, choose a sequence of {Mn }n≥1 with R = limn→∞ (1/n) log Mn . Step 1: Maximum likelihood decoder. Let {c1 , . . . , c2Mn } ∈ X n denote the set of n-tuple block codewords selected, and let the decoding partition for symbol m (namely, the set of channel outputs that classify to m) be Um = {y n : PY n |X n (y n |cm ) > PY n |X n (y n |cm0 ),

for all m0 6= m}.

Those channel outputs that are on the boundary, i.e., for some m and m0 ,

PY n |X n (y n |cm ) = PY n |X n (y n |cm0 ),

will be arbitrarily assigned to either m or m0 . Step 2: Property of indicator function for s = 1. Let φm be the indicator function of Um . Then for all s > 0. )s ( X  PY n |X n (y n |cm0 ) 1/(1+s) . 1 − φm (y n ) ≤ PY n |X n (y n |cm ) m0 6=m (Note that this step is the same as random coding exponent, except only s = 1 is considered.) By taking s = 1, we have ( ) X  PY n |X n (y n |cm0 ) 1/2 1 − φm (y n ) ≤ . n P Y n |X n (y |cm ) 0 m 6=m

190

Step 3: Probability of error given codeword cm is transmitted. Let Pe|m (∼C2Mn ) denote the probability of error given codeword cm is transmitted. Then Pe|m (∼C2Mn ) X ≤ PY n |X n (y n |cm ) y n 6∈Um

=

X

PY n |X n (y n |cm )[1 − φm (y n )]

y n ∈Y n

1/2 ) n n |X n (y |cm0 ) P Y PY n |X n (y n |cm ) PY n |X n (y n |cm ) y n ∈Y n m0 6=m,1≤m0 ≤2Mn ( ) X X  PY n |X n (y n |cm0 ) 1/2 PY n |X n (y n |cm ) PY n |X n (y n |cm ) y n ∈Y n 1≤m0 ≤2Mn (  1/2 ) n X X n |X n (y |cm0 ) P Y PY n |X n (y n |cm ) PY n |X n (y n |cm ) y n ∈Y n 1≤m0 ≤2Mn ( ) X X q PY n |X n (y n |cm )PY n |X n (y n |cm0 ) (





=

=

X



X

1≤m0 ≤2Mn

y n ∈Y n

Step 4: Standard inequality for s0 = 1/ρ ≥ 1. It is known that for any 0 < ρ = 1/s0 ≤ 1, we have !ρ X X ρ ai ≤ ai , i

i

for all non-negative sequence3 {ai }. P ρ P Proof: Let f (ρ) = ( i aρi ) / . Then we need to show f (ρ) ≥ 1 when 0 < ρ ≤ 1. j aj P P Let pi = ai /( k ak ), and hence, ai = pi ( k ak ). Take it to the numerator of f (ρ): P ρ P ρ i( i pP k ak ) f (ρ) = ρ ( j aj ) X ρ = pi 3

i ρ i log(pi )pi

P

Now since ∂f (ρ)/∂ρ = ≤ 0 (which implies f (ρ) is non-increasing in ρ) and f (1) = 1, we have the desired result. 2

191

Hence, ( X

ρ Pe|m (∼C2Mn ) ≤

1≤m0 ≤2Mn

X q

X 1≤m0 ≤2Mn

)!ρ

y n ∈Y n

( ≤

X q PY n |X n (y n |cm )PY n |X n (y n |cm0 )

)ρ PY n |X n (y n |cm )PY n |X n (y n |cm0 )

y n ∈Y n

ρ Step 5: Expectation of Pe|m ( ∼C2Mn ). ρ E[Pe|m (∼C2Mn )] " ( )ρ # X X q ≤ E PY n |X n (y n |cm )PY n |X n (y n |cm0 ) 1≤m0 ≤2Mn

y n ∈Y n

)ρ # X q PY n |X n (y n |cm )PY n |X n (y n |cm0 )

"( X

=

E

1≤m0 ≤2M

y n ∈Y n

n

"(

X q

= 2Mn E

)ρ # PY n |X n (y n |cm )PY n |X n (y n |cm0 )

.

y n ∈Y n

Step 6: Lemma 9.8. From Lemma 9.8, there exists one codebook with size Mn such that Pe|m (∼CMn ) ρ ≤ 21/ρ E 1/ρ [Pe|m (∼C2Mn )] )ρ #!1/ρ "( X q PY n |X n (y n |cm )PY n |X n (y n |cm0 ) ≤ 21/ρ 2Mn E y n ∈Y n

"( = (4Mn )1/ρ E 1/ρ

X q PY n |X n (y n |cm )PY n |X n (y n |cm0 )

)ρ #

y n ∈Y n

( 0 0 = (4Mn )s E s 

X q PY n |X n (y n |cm )PY n |X n (y n |cm0 )

)1/s0 

y n ∈Y n

 0 = (4Mn )s 

X

X

PXe n (xn )PXe n ((x0 )n )

xn ∈X n (x0 )n ∈X n

( ×

X q PY n |X n (y n |xn )PY n |X n (y n |(x0 )n )

y n ∈Y n

192

)1/s0 s0  .



By using the fact that PXe n and PY n |X n are product distributions with identical marginal, and taking logarithmic operation on both sides of the above inequality, followed by taking the liminf operation, the proof is completed by the fact that the lower bound holds for any s0 ≥ 1 and any PXe . 2 As a result of the expurgated exponent obtained, it only improves the random coding exponent at lower rate. It is even worse than random coding exponent for rate higher Rcr . One possible reason for its being worse at higher rate may be due to the replacement of the indicator-function argument )s ( X  PY n |X n (y n |cm0 ) 1/(1+s) . 1 − φm (y n ) ≤ n P Y n |X n (y |cm ) 0 m 6=m by PY n |X n (y n |cm0 ) 1 − φm (y ) ≤ PY n |X n (y n |cm ) n



1/2 .

Note that )s X  PY n |X n (y n |cm0 ) 1/(1+s) X  PY n |X n (y n |cm0 ) 1/2 min ≤ . s>0 PY n |X n (y n |cm ) PY n |X n (y n |cm ) m0 6=m m0 6=m (

The improvement of expurgated bound over random coding bound at lower rate may also reveal the possibility that the contribution of “bad codes” under unbiased random coding argument is larger at lower rate; hence, suppressing the weights from “bad codes” yield better result.

9.2.1

The properties of expurgated exponent

Analysis of Eex (R) is very similar to that of Er (R). We first re-write its formula. Definition 9.11 (expurgated exponent) The expurgated exponent for DMC with generic distribution PY |X is defined by Eex (R) , max [−sR + Ex (s)] , s≥1

where  Ex (s) , max −s log PX

XX x∈X

PX (x)PX (x0 )

x0 ∈X

Xq y∈Y

193

!1/s  PY |X (y|x)PY |X (y|x0 )

.

The properties of Eex (R) can be realized via the analysis of function Ex (s) as follows. Lemma 9.12 (properties of Eex (R)) 1. Eex (R) is non-increasing. 2. Eex (R) is convex in R. (Note that the first two properties imply that Eex (R) is strictly decreasing.) 3. There exists Rcr such that for R > Rcr , the slope of Er (R) is −1. Proof: 1. Again, the slope of Eex (R) is equal to the maximizer s∗ for Eex (R) times −1. Therefore, the slope of Eex (R) is non-positive, which certainly implies the desired property. 2. Similar to the proof of the same property for random coding exponent. 3. Similar to the proof of the same property for random coding exponent. 2 Example 9.13 For BSC with crossover probability ε, the expurgated exponent becomes Eex (R) =

   1/s  p 2 2 + (1 − p) max max −sR − s log p + 2p(1 − p) 2 ε(1 − ε)

1≥p≥0 s≥1

where (p, 1 − p) is the input distribution. Note that the input distribution achieving Eex (R) is uniform, i.e., p = 1/2. The expurgated exponent, as well as random coding exponent, for ε = 0.2 is depicted in Figures 9.3 and 9.4.

9.3

Partitioning bound: an upper bounds for channel reliability

In the previous sections, two lower bounds of the channel reliability, random coding bound and expurgated bound, are discussed. In this section, we will start to introduce the upper bounds of the channel reliability function for DMC, which includes partitioning bound, sphere-packing bound and straight-line bound.

194

0.12 0.1 0.08 0.06 0.04 0.02 C 0 0

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16

0.192745

Figure 9.3: Expurgated exponent (solid line) and random coding exponent (dashed line) for BSC with crossover probability 0.2 (over the range of (0, 0.192745)).

The partitioning bound relies heavily on theories of binary hypothesis testing. This is because the receiver end can be modeled as: H0 : cm = codeword transmitted H1 : cm 6= codeword transmitted, where m is the final decision made by receiver upon receipt of some channel outputs. The channel decoding error given that codeword m is transmitted becomes the type II error, which can be computed using the theory of binary hypothesis testing. Definition 9.14 (partitioning bound) For a DMC with marginal PY |X , Ep (R) , max

min

PX {PYe |X : I(PX ,PYe |X )≤R}

D(PYe |X kPY |X |PX ).

One way to explain the above definition of partitioning bound is as follows. For a given i.i.d. source with marginal PX , to transmit it into a dummy channel with rate R and dummy capacity I(PX , PYe |X ) ≤ R is expected to have poor performance (note that we hope the code rate to be smaller than capacity). We can expect that those “bad” noise patterns whose composition transition probability PYe |X satisfies I(PX , PYe |X ) ≤ R will contaminate the codewords more

195

0.11

0.105

0.1 0

0.001

0.002

0.003

0.004

0.005

0.006

Figure 9.4: Expurgated exponent (solid line) and random coding exponent (dashed line) for BSC with crossover probability 0.2 (over the range of (0, 0.006)).

seriously than other noise patterns. Therefore for any codebook with rate R, the probability of decoding error is lower bounded by the probability of these “bad” noise patterns, since the “bad” noise patterns anticipate to unrecoverably distort the codewords. Accordingly, the channel reliability function is upper bounded by the exponent of the probability of these “bad” noise patterns. Sanov’s theorem already reveal that for any set consisting of all elements with the same compositions, its exponent is equal to the minimum divergence of the composition distribution against the true distribution. This is indeed the case for the set of “bad” noise patterns. Therefore, the exponent of the probability of these “bad” noise patterns is given as: min

{PYe |X : I(PX ,PYe |X )≤R}

D(PYe |X kPY |X |PX ).

We can then choose the best input source to yield the partitioning exponent. The partitioning exponent can be easily re-formulated in terms of divergences. This is because the mutual information can actually be written in the form of

196

divergence. I(PX , PYe |X ) ,

XX x

=

X

PX (x)PYe |X (y|x) log

y

PX (x)

x

=

X

X

PX Ye (x, y) PX (x)PYe (y) PYe |X (y|x)

PYe |X (y|x) log

PYe (y)

y

PX (x)D(PYe |X kPYe |X = x)

x

= D(PYe |X kPYe |PX ) Hence, the partitioning exponent can be written as: Ep (R) , max

min

PX {PYe |X : D(PYe |X kPYe |PX )≤R}

D(PYe |X kPY |X |PX ).

In order to apply the theory of hypothesis testing, the partitioning bound needs to be further re-formulated. Observation 9.15 (partitioning bound) If PXe and PYe |X are distributions P that achieves Ep (R), and PYe (y) = x∈X PXe (x)PYe |X (y|x), then Ep (R) , max

min

PX {PYˆ |X : D(PYˆ |X kPYe |PX )≤R}

D(PYˆ |X kPY |X |PX ).

In addition, the distribution PYˆ |X that achieves Ep (R) is a tilted distribution between PY |X and PYe , i.e., Ep (R) , max D(PYλ |X kPY |X |PX ), PX

where PYλ |X is a tilted distribution4 between PY |X and PYe , and λ is the solution of D(PYλ kPYe |PX ) = R. Theorem 9.16 (partitioning bound) For a DMC with marginal PY |X , for any ε > 0 arbitrarily small, E(R + ε) ≤ Ep (R). 4

PYλ |X (y|x) , P

PYeλ (y)PY1−λ |X (y|x)

y 0 ∈Y

0 PYeλ (y 0 )PY1−λ |X (y |x)

197

.

Proof: Step 1: Hypothesis testing. Let code size Mn = d2enR e ≤ en(R+ε) for n > ln2/ε, and let {c1 , . . . , cMn } be the best codebook. (Note that using a code with smaller size should yield a smaller error probability, and hence, a larger exponent. So, if this larger exponent is upper bounded by Ep (R), the error exponent for code size exp{R + ε} should be bounded above by Ep (R).) Suppose that PXe and PYe |X are the distributions achieving Ep (R). Define P PYe (y) = x∈X PXe (x)PYe |X (y|x). Then for any (mutually-disjoint) decoding partitions at the output site, U1 , U2 , . . . , UMn , X PYe n (Um0 ) = 1, 1≤m0 ≤Mn

which implies the existence of m satisfying PYe n (Um ) ≤

2 . Mn

(Note that there are at least Mn /2 codewords satisfying this condition.) Construct the binary hypothesis testing problem of H0 : PYe n (·) against H1 : PY n |X n (·|cm ), c (i.e., the acand choose the acceptance region for null hypothesis as Um ceptance region for alternative hypothesis is Um ), the type I error

αn , P r(Um |H0 ) = PYe n (Um ) ≤

2 , Mn

and the type II error c c βn , P r(Um |H1 ) = PY n |X n (Um |cm ),

which is exactly, Pe|m , the probability of decoding error given codeword cm is transmitted. We therefore have   2 −nR Pe|m ≥ min βn ≥ min βn . Since ≤e αn ≤2/Mn Mn αn ≤e−nR

198

Then from Theorem 8.34, the best exponent

5

for Pe|m is

n

lim sup n→∞

1X D(PYλ,i kPY |X (·|ai )), n i=1

(9.3.1)

where ai is the ith component of cm , PYλ,i is the tilted distribution between PY |X (·|ai ) and PYe , and n

lim inf n→∞

1X D(PYλ,i kPYe ) = R. n i=1

If we denote the composition distribution with respect to cm as P (cm ) , and observe that the number of different tilted distributions (i.e., PYλ,i ) is at most |X |, the quantity of (9.3.1) can be further upper bounded by n

1X D(PYλ,i kPY |X (·|ai )) lim sup n→∞ n i=1 X ≤ lim sup P (cm ) (x)D(PYλ,x kPY |X (·|x)) n→∞

x∈X

≤ max D(PYλ kPY |X |PX ) PX

= D(PYλ kPY |X |PXe ) = Ep (R).

(9.3.2)

Besides, D(PYλ kPYe |PXe ) = R. Step 2: A good code with rate R. Within the Mn ≈ 2enR codewords, there are at least Mn /2 codewords satisfying PYe n (Um ) ≤

2 , Mn

and hence, their Pe|m should all satisfy (9.3.2). Consequently, if B , {m : those m satisfying step 1}, 5

According to Theorem 8.34, we need to consider the quantile function of the CDF of dPYλn 1 log (Y n ). n dPY n |X=cm

Since both PY n |X (·|cm ) and PYe n are product distributions, PYλn is a product distribution, too. Therefore, by the similar reason as in Example 5.13, the CDF becomes a bunch of unit functions (see Figure 2.4), and its largest quantile becomes (9.3.1).

199

then Pe ,

1 Mn

X

Pe|m

1≤m≤Mn

1 X Pe|m Mn m∈B 1 X min Pe|m ≥ Mn m∈B m∈B ≥

1 |B| min Pe|m m∈B Mn 1 ≥ min Pe|m 2 m∈B 1 = Pe|m0 , (m0 is the minimizer.) 2 =

which implies 1 1 E(R + ε) ≤ lim sup − log Pe ≤ lim sup − log Pe|m0 ≤ Ep (R). n n n→∞ n→∞ 2 The partitioning upper bound can be re-formulated to shape similar as random coding lower bound. Lemma 9.17  Ep (R) = max max −sR − log s≥0

PX

!1+s  X X y∈Y

PX (x)PY |X (y|x)1/(1+s)



x∈X

= max [−sR − E0 (s)] . s≥0

Recall that the random coding exponent is max0 0, there exists a universal constant α = α(γ) ∈ (0, 1) (independent of blocklength n) and a codebook sequence {∼Cn,αM }n≥1 such that 1 min dm (∼Cn,αM ) n 0≤m≤αM −1     1 > inf a ∈ < : lim sup P r Dm > a = 0 − γ, (9.4.2) n n→∞ for infinitely many n; 2. for any γ > 0, there exists a universal constant α = α(γ) ∈ (0, 1) (independent of blocklength n) and a codebook sequence {∼Cn,αM }n≥1 such that 1 min dm (∼Cn,αM ) n 0≤m≤αM −1     1 > inf a ∈ < : lim inf P r Dm > a = 0 − γ, (9.4.3) n→∞ n for sufficiently large n. Proof: We will only prove (9.4.2). (9.4.3) can be proved by simply following the same procedure. Define

    1 ¯ Dm > a = 0 . LX (R) , inf a : lim sup P r n n→∞

Let 1(A) be the indicator function of a set A, and let   1 ¯ φm , 1 Dm > LX (R) − γ . n ¯ X (R), By definition of L 

 1 ¯ lim sup P r Dm > LX (R) − γ > 0. n n→∞   ¯ X (R) − γ . Then for infinitely many n, Let 2α , lim supn→∞ P r (1/n)Dm > L   1 ¯ X (R) − γ > α. Pr Dm > L n

209

For those n that satisfies the above inequality, "M −1 # M −1 X X E φm = E [φm ] > αM, m=0

m=0

which implies that among all possible selections, there exist (for infinite many n) a codebook ∼Cn,M in which αM codewords satisfy φm = 1, i.e., 1 ¯ X (R) − γ dm (∼Cn,M ) > L n for at least αM codewords in the codebook ∼Cn,M . The collection of these αM codewords is a desired codebook for the validity of (9.4.2). 2 Our second lemma concerns the spectrum of (1/n)Dm . Lemma 9.23 Let each codeword be independently selected through the distriˆ n is independent of, and has the same distribution bution PX n . Suppose that X as, X n . Then    M 1 1  ˆ n n Pr Dm > a ≥ P r µn X , X > a . n n 

Proof: Let C (n) m denote the m-th randomly selected codeword. From the definition of Dm , we have   (n) 1 Pr Dm > a C m n ( ) 1  (n) (n)  = Pr min µn C mˆ , C m > a C (n) 0≤m≤M ˆ −1 n m m6 ˆ =m   Y (n) 1  (n) (n)  = Pr µn C mˆ , C m > a C m (9.4.4) n 0≤m≤M ˆ −1 m6 ˆ =m

M −1   n 1 n n ˆ = Pr µn (X , X ) > a X , n where (9.4.4) holds because   1  (n) (n)  µn C mˆ , C m n 0≤m≤M ˆ −1 m6 ˆ =m

210

(n) is conditionally independent given C m . Hence,   1 Dm > a Pr n M −1 Z   1 n n ˆ = Pr µn ( X , x ) > a dPX n (xn ) n n X M Z   1 n n ˆ ,x ) > a Pr µn ( X dPX n (xn ) ≥ n Xn "  M # n 1  ˆ n n = EX n Pr µn X , X > a X n    n 1  ˆ n n M ≥ EX n P r µn X , X > a X n   M 1  ˆ n n µn X , X > a , = Pr n

(9.4.5)

where (9.4.5) follows from Lyapounov’s inequality [1, page 76], i.e., E 1/M [U M ] ≥ E[U ] for a non-negative random variable U . 2 We are now ready to prove the main theorem of the paper. For simplicity, ˆ n and X n are used specifically to denote two indepenthroughout the article, X dent random variables having common distribution PX n . Theorem 9.24 (distance-spectrum formula) ¯ X (R) ≥ lim sup sup Λ X

n→∞

and sup ΛX (R) ≥ lim inf X

n→∞

dn,M ¯ X (R + δ) ≥ sup Λ n X

(9.4.6)

dn,M ≥ sup ΛX (R + δ) n X

(9.4.7)

for every δ > 0, where ¯ X (R) , inf {a ∈ < : Λ )   M 1 ˆ n, X n) > a lim sup P r µn (X =0 n n→∞ and ΛX (R) , inf {a ∈ < : )   M 1 ˆ n, X n) > a lim inf P r µn (X =0 . n→∞ n

211

Proof: 1. lower bound. Observe that in Lemma 9.22, the rate only decreases by the amount − log(α)/n when employing a code ∼Cn,αM . Also note that for any δ > 0, − log(α)/n < δ for sufficiently large n. These observations, together with Lemma 9.23, imply the validity of the lower bound. 2. upper bound. Again, we will only prove (9.4.6), since (9.4.7) can be proved by simply following the same procedure. To show that the upper bound of (9.4.6) holds, it suffices to prove the existence of X such that ¯ X (R) ≥ lim sup dn,M . Λ n n→∞ ∗ Let X n be uniformly distributed over one of the optimal codes ∼Cn,M . (By “optimal,” we mean that the code has the largest minimum distance among all codes of the same size.) Define

λn ,

1 ∗ ¯ , lim sup λn . min dm (∼Cn,M ) and λ n 0≤m≤M −1 n→∞

Then for any δ > 0, ¯−δ λn > λ

for infinitely many n.

(9.4.8)

For those n satisfying (9.4.8),   1  ˆ n n ¯ µn X , X > λ − δ Pr n   1  ˆ n n ≥ Pr µn X , X ≥ λn n n o ˆ n 6= X n = 1 − 1 , ≥ Pr X M which implies   M 1 n n ¯−δ ˆ ,X ) > λ µn ( X lim sup P r n n→∞  M 1 ≥ lim sup 1 − M n→∞  enR 1 = lim sup 1 − nR = e−1 > 0. e n→∞ ¯ − δ. Since δ can be made arbitrarily small, the ¯ X (R) ≥ λ Consequently, Λ upper bound holds. 2

212

¯ X (R) is non-increasing in R, and hence, the number of Observe that supX Λ discontinuities is countable. This fact implies that ¯ X (R) = lim sup(R + δ) sup Λ δ↓0

X

X

except for countably many R. Similar argument applies to supX ΛX (R). We can then re-phrase the above theorem as appeared in the next corollary. Corollary 9.25 lim sup n→∞

 resp.

dn,M ¯ X (R) = sup Λ n X

dn,M = sup ΛX (R) lim inf n→∞ n X



except possibly at the points of discontinuities of ¯ X (R) (resp. sup ΛX (R)), sup Λ X

X

which are countable. From the above theorem (or corollary), we can characterize the largest minimum distance of deterministic block codes in terms of the distance spectrum. We ¯ X (·) and ΛX (·) will thus name it distance-spectrum formula. For convenience, Λ be respectively called the sup-distance-spectrum function and the inf-distancespectrum function in the remaining article. We conclude this section by remarking that the distance-spectrum formula obtained above indeed have an analogous form to the information-spectrum formulas of the channel capacity and the optimistic channel capacity. Furthermore, by taking the distance metric to be the n-fold Bhattacharyya distance [2, Definition 5.8.3], an upper bound on channel reliability [2, Theorem 10.6.1] can be obtained, i.e., 1 lim sup − log Pe (n, M = enR ) ≤ n n→∞ ( " X 1 sup inf a : lim sup P r − log n n→∞ X y n ∈Y n  iM 1/2 1/2 n ˆn n n PY n |X n (y |X )PY n |X n (y |X ) > a =0

213

and 1 lim inf − log Pe (n, M = enR ) ≤ n→∞ (n " X 1 sup inf a : lim inf P r − log n→∞ n X y n ∈Y n  iM 1/2 1/2 n ˆn n n PY n |X n (y |X )PY n |X n (y |X ) > a =0 , where PY n |X n is the n-dimensional channel transition distribution from code alphabet X n to channel output alphabet Y n , and Pe (n, M ) is the average probability of error for optimal channel code of blocklength n and size M . Note that the formula of the above channel reliability bound is quite different from those formulated in terms of the exponents of the information spectrum (cf. [6, Sec. V] and [17, equation (14)]). B) Determination of the largest minimum distance for a class of distance functions In this section, we will present a minor class of distance functions for which the optimization input X for the distance-spectrum function can be characterized, and thereby, the ultimate largest minimum distance among codewords can be established in terms of the distance-spectrum formula. A simple example for which the largest minimum distance can be derived in terms of the new formula is the probability-of-error distortion measure, which is defined  0, if xˆn = xn ; n n µn (ˆ x ,x ) = n, if xˆn 6= xn . It can be easily shown that ¯ X (R) sup Λ X

≤ inf {a ∈ < : ) M   1 ˆ n, X n) > a µn (X =0 lim sup sup P r n n→∞ X n  1, for 0 ≤ R < log |X |; = 0, for R ≥ log |X |, and the upper bound can be achieved by letting X be uniformly distributed over the code alphabet. Similarly,  1, for 0 ≤ R < log |X |; sup ΛX (R) = 0, for R ≥ log |X |. X

214

Another example for which the optimizer X of the distance-spectrum function can be characterized is the separable distance function defined below. Definition 9.26 (separable distance functions) µn (ˆ xn , xn ) , fn (|gn (ˆ xn ) − gn (xn )|), where fn (·) and gn (·) are real-valued functions. Next, we derive the basis for finding one of the optimization distributions for   1 n n ˆ ,X ) > a sup P r µn (X n Xn under separable distance functionals. o n ˆ Lemma 9.27 Define G(α) , inf U P r U − U ≤ α , where the infimum is taken over all (Uˆ , U ) pair having independent and identical distribution on [0, 1]. Then for j = 2, 3, 4, . . . and 1/j ≤ α < 1/(j − 1), 1 G(α) = . j In addition, G(α) is achieved by uniform distribution over   1 2 j−2 0, , ,..., ,1 . j−1 j−1 j−1 Proof:

≥ ≥

=



n o P r Uˆ − U ≤ α  1 ˆ P r U − U ≤ j (  2 ) j−2   X i + 1 i Pr Uˆ , U ∈ , j j i=0 ( )    j − 1 2 +P r Uˆ , U ∈ ,1 j   2 j−2  X i i+1 Pr U ∈ , j j i=0    2 j−1 + Pr U ∈ ,1 j 1 . j

215

Achievability of G(α) by uniform distribution over   1 2 j−2 0, , ,..., ,1 j−1 j−1 j−1 can be easily verified, and hence, we omit here.

2

Lemma 9.28 For 1/j ≤ α < 1/(j − 1), n o 1 1 1 ≤ inf P r Uˆn − Un ≤ α ≤ + , Un j j n + 0.5 where the infimum is taken over all (Uˆn , Un ) pair having independent and identical distribution on   1 2 n−1 0, , , . . . , ,1 , n n n and j = 2, 3, 4, . . . etc. Proof: The lower bound follows immediately from Lemma 9.27. To prove the upper bound, let Un∗ be uniformly distributed over   ` 2` k` 0, , , . . . , , 1 , n n n where ` = bnαc + 1 and k is an integer satisfying n n −1≤k < . n/(j − 1) + 1 n/(j − 1) + 1 (Note that   n n +1≥ + 1 ≥ `.) j−1 j−1 Then n o n o inf P r Uˆn − Un ≤ α ≤ P r Uˆn∗ − Un∗ ≤ α Un

=

1 k+2

1 1 +1 1/(j − 1) + 1/n 1 (j − 1)2 1 = + 2 j j n + (j − 1)/j 1 1 ≤ + . j n + 0.5



216

2 Based on the above Lemmas, we can then proceed to compute the asymptotic largest minimum distance among codewords of the following examples. It needs to be pointed out that in these examples, our objective is not to attempt to solve any related problems of practical interests, but simply to demonstrate the computation of the distance spectrum function for general readers. Example 9.29 Assume that the n-tuple code alphabet is {0, 1}n . Let the n-fold distance function be defined as µn (ˆ xn , xn ) , |u(ˆ xn ) − u(xn )| where u(xn ) represents the number of 1’s in xn . Then ¯ X (R) sup Λ X  ≤ inf a ∈ < : lim sup n→∞

) M  1 ˆn =0 sup P r u(X ) − u(X n ) > a n n X  = inf a ∈ < : lim sup 

n→∞

) M   1 ˆn n =0 1 − infn P r u(X ) − u(X ) ≤ a X n  ≤ inf a ∈ < : lim sup n→∞  oexp{nR}  n ˆ =0 1 − inf P r U − U ≤ a U ( )  exp{nR} 1 = inf a ∈ < : lim sup 1 − = 0 = 0. d1/ae n→∞ Hence, the asymptotic largest minimum distance among block codewords is zero. This conclusion is not surprising because the code with nonzero minimal distance should contain the codewords of different Hamming weights and the whole number of such words is n + 1. Example 9.30 Assume that the code alphabet is binary. Define µn (ˆ xn , xn ) , log2 (|gn (ˆ xn ) − gn (xn )| + 1) ,

217

where gn (xn ) = xn−1 · 2n−1 + xn−2 · 2n−2 + · · · + x1 · 2 + x0 . Then ¯ X (R) sup Λ X  ≤ inf a ∈ < : lim sup sup n→∞

Xn

) M     1 ˆ n ) − gn (X n ) + 1 > a log2 gn (X Pr =0 n  ≤ inf a ∈ < : lim sup n→∞ )   2na − 1 exp{nR} ˆ 1 − inf P r U − U ≤ n =0 U 2 −1    exp{nR}           1     =0 = inf a ∈ < : lim sup 1 −   2n − 1  n→∞       na 2 −1 R = 1− . log 2

(9.4.9)

By taking X ∗ to be the one under which UK , gn (X n )/(2n − 1) has the distribution as used in the proof of the upper bound of Lemma 9.28, where K , 2n − 1,

218

we obtain ¯ X ∗ (R) Λ  ≥ inf a ∈ < : lim sup n→∞

exp{nR}



  1 1 −  1  −   2n − 1 K + 0.5  2na − 1  = inf a ∈ < : lim sup

     =0    

n→∞

exp{nR}



  1 1 −  1  −  n  n 2 −1 2 − 0.5  2na − 1 R . = 1− log 2

     =0    

This proved the achievability of (9.4.9). C) General properties of distance-spectrum function ¯ X (R) and ΛX (R). We next address some general functional properties of Λ ¯ X (R) and ΛX (R) are non-increasing and right-continuous Lemma 9.31 1. Λ functions of R. 2.

¯ X (R) ≥ D ¯ 0 (X) , lim sup ess inf 1 µn (X ˆ n, X n) Λ (9.4.10) n n→∞ 1 ˆ n, X n) (9.4.11) ΛX (R) ≥ D0 (X) , lim inf ess inf µn (X n→∞ n where ess inf represents essential infimum9 . In addition, equality holds for (9.4.10) and (9.4.11) respectively when ¯ 0 (X) , lim sup − 1 log P r{X ˆ n = X n} R>R n n→∞

9

For a given random variable Z, its essential infimum is defined as ess inf Z , sup{z : P r[Z ≥ z] = 1}.

219

(9.4.12)

and

1 ˆ n = X n }, R > R0 (X) , lim inf − log P r{X n→∞ n provided that (∀ xn ∈ X n ) min µn (ˆ xn , xn ) = µn (xn , xn ) = constant. n n x ˆ ∈X

(9.4.13)

(9.4.14)

3. ¯ X (R) ≤ D ¯ p (X) Λ

(9.4.15)

ΛX (R) ≤ Dp (X)

(9.4.16)

¯ p (X); and for R > R for R > Rp (X), where ¯ p (X) , lim sup 1 E[µn (X ˆ n , X n )|µn (X ˆ n , X n ) < ∞] D n→∞ n 1 ˆ n , X n )|µn (X ˆ n , X n ) < ∞] E[µn (X n→∞ n ¯ p (X) , lim sup − 1 log P r{µn (X ˆ n , X n ) < ∞} R n n→∞

Dp (X) , lim inf

and

1 ˆ n , X n ) < ∞}. Rp (X) , lim inf − log P r{µn (X n→∞ n In addition, equality holds for (9.4.15) and (9.4.16) respectively when R = ¯ p (X) and R = Rp (X), provided that [µn (X ˆ n , X n )|µn (X ˆ n , X n ) < ∞] has R the large deviation type of behavior, i.e., for all δ > 0, 1 lim inf − log Ln > 0, n→∞ n where

 Ln , P r

(9.4.17)

 1 (Yn − E[Yn |Yn < ∞]) ≤ −δ Yn < ∞ , n

ˆ n , X n ). and Yn , µn (X ¯ p (X), Λ ¯ X (R) = ∞. Similarly, for 0 < R < Rp (X), 4. For 0 < R < R ΛX (R) = ∞. ¯ X (R) will be provided. The properties Proof: Again, only the proof regarding Λ of ΛX (R) can be proved similarly. 1. Property 1 follows by definition.

220

2. (9.4.10) can be proved as follows. Let 1 ˆ n , X n ), en (X) , ess inf µn (X n ¯ 0 (X) = lim supn→∞ en (X). Observe that for any δ > 0 and and hence, D for infinitely many n,   M 1 n n ˆ ¯ Pr µn (X , X ) > D0 (X) − 2δ n M   1 n n ˆ , X ) > en (X) − δ µn ( X = 1. ≥ Pr n ¯ X (R) ≥ D ¯ 0 (X)−2δ for arbitrarily small δ > 0. This completes Therefore, Λ the proof of (9.4.10). To prove the equality condition for (9.4.10), it suffices to show that for any δ > 0, ¯ X (R) ≤ D ¯ 0 (X) + δ Λ (9.4.18) for

n o 1 n n ˆ R > lim sup − log P r X 6= X . n n→∞ By the assumption on the range of R, there exists γ > 0 such that n o 1 ˆ n 6= X n + γ for sufficiently large n. (9.4.19) R > − log P r X n ¯ 0 (X) + δ/2 and (9.4.19) (of which Then for those n satisfying en (X n ) ≤ D there are sufficiently many)   M 1 n n ˆ ,X ) > D ¯ 0 (X) + δ Pr µn ( X n   M 1 δ n n n ˆ , X ) > en (X ) + ≤ Pr µn ( X n 2  n oM ˆ n 6= X n (9.4.20) ≤ Pr X  n oM ˆ n = Xn = 1 − Pr X , where (9.4.20) holds because (9.4.14). Consequently,   M 1 n n ˆ ,X ) > D ¯ 0 (X) + δ lim sup P r µn (X =0 n n→∞ which immediately implies (9.4.18).

221

¯ p (X) = ∞. Thus, without loss of generality, we 3. (9.4.15) holds trivially if D ¯ assume that Dp (X) < ∞. (9.4.15) can then be proved by observing that for any δ > 0 and sufficiently large n, M   1 2 ¯ Yn > (1 + δ) · Dp (X) Pr n M   1 1 Yn > (1 + δ) · E[Yn |Yn < ∞] ≤ Pr n n = (P r {Yn > (1 + δ) · E[Yn |Yn < ∞]})M = (P r {Yn = ∞} + P r {Yn < ∞} P r {Yn > (1 + δ) · E[Yn |Yn < ∞]| Yn < ∞})M  M 1 ≤ P r {Yn = ∞} + P r {Yn < ∞} 1+δ  M δ = 1− P r {Yn < ∞} , 1+δ

(9.4.21)

where (9.4.21) follows from Markov’s inequality. Consequently, for R > ¯ p (X), R   M 1 n n 2 ˆ , X ) > (1 + δ) · D ¯ p (X) lim sup P r µn (X = 0. n n→∞ ¯ p (X), it suffices to show To prove the equality holds for (9.4.15) at R = R ¯ ¯ ¯ ¯ X (R) is rightthe achievability of ΛX (R) to Dp (X) by R ↓ Rp (X), since Λ continuous. This can be shown as follows. For any δ > 0, we note from (9.4.17) that there exists γ = γ(δ) such that for sufficiently large n,   1 1 Pr Yn − E[Yn |Yn < ∞] ≤ −δ Yn < ∞ ≤ e−nγ . n n

222

Therefore, for infinitely many n, M   1 ¯ p (X) − 2δ Yn > D Pr n   M 1 1 ≥ Pr Yn > E[Yn |Yn < ∞] − δ n n = (P r {Yn = ∞} + P r {Yn < ∞} M  1 1 Yn > E[Yn |Yn < ∞] − δ Yn < ∞ Pr n n    1 1 = 1 − Pr Yn − E[Yn |Yn < ∞] ≤ −δ Yn < ∞ n n P r {Yn < ∞})M ≥

M 1 − e−nγ · P r {Yn < ∞} .

¯ p (X) < R < R ¯ p (X) + γ, Accordingly, for R M   1 ¯ p (X) − 2δ > 0, lim sup P r Yn > D n n→∞ which in turn implies that ¯ X (R) ≥ D ¯ p (X) − 2δ. Λ ¯ p (X). This completes the proof of achievability of (9.4.15) by R ↓ R 4. This is an immediate consequence of   M 1 n n ˆ (∀ L > 0) Pr µn ( X , X ) > L n  n oM n n ˆ ≥ 1 − P r µn (X , X ) < ∞ . 2 Remarks. • A weaker condition for (9.4.12) and (9.4.13) is R > lim sup n→∞

1 1 log |Sn | and R > lim inf log |Sn |, n→∞ n n

where Sn , {xn ∈ X n : PX n (xn ) > 0}. This indicates an expected result that when the rate is larger than log |X |, the asymptotic largest minimum ¯ 0 (X) (resp. distance among codewords remains at its smallest value supX D supX D0 (X)), which is usually zero.

223

¯ X (R) and the spec• Based on Lemma 9.31, the general relation between Λ n n ˆ trum of (1/n)µn (X , X ) can be illustrated as in Figure 9.7, which shows ¯ X (R) lies asymptotically within that Λ ess inf

1 ˆn n µ(X , X ) n

and

1 ˆ n , X n )|µn (X ˆ n , X n )] E[µn (X n ¯ p (X) < R < R ¯ 0 (X). Similar remarks can be made on ΛX (R). On for R ¯ X (R) (similarly for ΛX (R)) can be the other hand, the general curve of Λ plotted as shown in Figure 9.8. To summarize, we remark that under fairly general situations,  ¯ p (X); for 0 < R < R   =∞  ¯ at R = Rp (X); ¯ X (R) = Dp (X) Λ ¯ ¯ ∈ (D0 (X), Dp (X)] for Rp (X) < R < R0 (X);    ¯ 0 (X) =D for R ≥ R0 (X). • A simple universal upper bound on the largest minimum distance among block codewords is the Plotkin bound. Its usual expression is given by [10] for which a straightforward generalization (cf. Appendix B) is 1 ˆ n , X n )]. sup lim sup E[µn (X n n→∞ X Property 3 of Lemma 9.31, however, provides a slightly better form for the general Plotkin bound. We now, based on Lemma 9.31, calculate the distance-spectrum function of the following examples. The first example deals with the case of infinite code alphabet, and the second example derives the distance-spectrum function under unbounded generalized distance measure. Example 9.32 (continuous code alphabet) Let X = [0, 1), and let the marginal distance metric be µ(x1 , x2 ) = min {1 − |x1 − x2 |, |x1 − x2 |} . Note that the metric is nothing but treating [0, 1) as a circle (0 and 1 are glued together), and then to measure the shorter distance between two positions. Also, the additivity property is assumed for the n-fold distance function, Pn n n i.e., µn (ˆ x , x ) , i=1 µ(ˆ xi , xi ).

224

Using the product X of uniform distributions over X , the sup-distancespectrum function becomes  ¯ X (R) = inf a ∈ < : lim sup Λ n→∞  ( n )!M  X 1 ˆ 1 − Pr µ(Xi , Xi ) ≤ a =0 .  n i=1

By Cram´er Theorem [4], 1 lim − P r n→∞ n where

(

n

1X ˆ µ(Xi , Xi ) ≤ a n i=1

) = IX (a),

n o ˆ −t·µ(X,X) IX (a) , sup −ta − log E[e ] t>0

is the large deviation rate function10 . Since IX (a) is convex in a, there exists a supporting line to it satisfying h ∗ i ˆ IX (a) = −t∗ a − log E e−t ·µ(X,X) , which implies h i ∗ ˆ a = −s∗ IX (a) − s∗ · log E e−µ(X,X)/s for s∗ , 1/t∗ > 0; or equivalently, the inverse function of IX (·) is given as h i ∗ ˆ −1 ∗ ∗ −µ(X,X)/s IX (R) = −s R − s · log E e (9.4.22) n h io ˆ = sup −sR − s · log E e−µ(X,X)/s , s>0

where11 the last step follows from the observation that (9.4.22) is also a support−1 ing line to the convex IX (R). Consequently, ¯ X (R) = inf {a ∈ < : IX (a) < R} Λ    = sup −sR − s · log 2s · 1 − e−1/2s ,

(9.4.23)

s>0 10

We take the range of supremum to be [t > 0] (instead of [t ∈ 0

(9.4.24)

where n o ˆ IX (a) , sup −ta − log E[e−t·µ(X,X) ] t>0   2 + e−t = sup −ta − log . 4 t>0 This curve is plotted in Figure 9.10. It is worth noting that there exists a region obtaining     inf a ∈ < : sup −ta − log E e−tZ < R t>0 n h io = sup −sR − s · log E e−Z/s , s>0

for a random variable Z so that readers do not have to refer to literatures regarding to error exponent function or channel reliability exponent function for the validity of the above equality. This equality will be used later in Examples 9.33 and 9.36, and also, equations (9.4.27) and (9.4.29).

226

that the sup-distance-spectrum function is infinity. This is justified by deriving ¯ 0 (X) = − log P r{X ˆ = X} = log 2; R ¯ 0 (X) = ess infµ(X, ˆ X) = 0; D ¯ p (X) = − log P r{µ(X, ˆ X) < ∞} = log 4 ; R 3 1 ¯ p (X) = E[µ(X, ˆ X)|µ(X, ˆ X) < ∞] = . D 3 One can draw the same conclusion by simply taking the derivative of (9.4.24) with respect to s, and obtaining that the derivative −R − log 2 + e

−1/s



e−1/s − + log(4) s (2 + e−1/s )

is always positive when R < log(4/3). Therefore, when 0 < R < log(4/3), the distance-spectrum function is infinity. From the above two examples, it is natural to question whether the formula of the largest minimum distance can be simplified to the quantile function12 of the large deviation rate function (cf. (9.4.23) and (9.4.24)), especially when the distance functional is symmetric and additive. Note that the quantile function of the large deviation rate function is exactly the well-known Varshamov-Gilbert lower bound (cf. the next section). This inquiry then becomes to find the answer of an open question: under what conditions is the Varshamov-Gilbert lower bound tight? Some insight on this inquiry will be discussed in the next section. D) General Varshamov-Gilbert lower bound In this section, a general Varshamov-Gilbert lower bound will be derived directly from the distance-spectrum formulas. Conditions under which this lower bound is tight will then be explored. ¯ X (R) and ΛX (R)) Lemma 9.34 (large deviation formulas for Λ  ¯ X (R) = inf a ∈ < : `¯X (a) < R Λ and ΛX (R) = inf {a ∈ < : `X (a) < R} 12

Note that the usual definition [1, page 190] of the quantile function of a non-decreasing function F (·) is defined as: sup{θ : F (θ) < δ}. Here we adopt its dual definition for a nonincreasing function I(·) as: inf{a : I(a) < R}. Remark that if F (·) is strictly increasing (resp. I(·) is strictly decreasing), then the quantile is nothing but the inverse of F (·) (resp. I(·)).

227

where `¯X (a) and `X (a) are respectively the sup- and the inf-large deviation ˆ n , X n ), defined as spectrums of (1/n)µn (X   1 1 n n ¯ ˆ µn ( X , X ) ≤ a `X (a) , lim sup − log P r n n n→∞ and

1 `X (a) , lim inf − log P r n→∞ n



 1 n n ˆ ,X ) ≤ a . µn (X n

¯ X (R). All the properties of Proof: We will only provide the proof regarding Λ ΛX (R) can be proved by following similar arguments.  ¯ , inf a ∈ < : `¯X (a) < R . Then for any γ > 0, Define λ ¯ + γ) < R `¯X (λ

(9.4.25)

¯ − γ) ≥ R. `¯X (λ

(9.4.26)

and Inequality (9.4.25) ensures the existence of δ = δ(γ) > 0 such that for sufficiently large n,   1 n n ¯ + γ ≥ e−n(R−δ) , ˆ ,X ) ≤ λ µn ( X Pr n which in turn implies 



1 ¯+γ ˆ n, X n) > λ lim sup P r µn (X n n→∞ enR ≤ lim sup 1 − e−n(R−δ) = 0.

M

n→∞

¯ + γ. On the other hand, (9.4.26) implies the existence of ¯ X (R) ≤ λ Hence, Λ ∞ subsequence {nj }j=1 satisfying   1 1 n n ¯ − γ ≥ R, ˆ j, X j) ≤ λ µn ( X lim − log P r j→∞ nj nj j which in turn implies 



1 ¯−γ ˆ n, X n) > λ lim sup P r µn (X n n→∞ enj R ≥ lim sup 1 − e−nj R = e−1 .

M

j→∞

¯ − γ. Since γ is arbitrary, the lemma therefore holds. 2 Accordingly, ΛX (R) ≥ λ

228

¯ X (·) (resp. The above lemma confirms that the distance spectrum function Λ ΛX (·)) is exactly the quantile of the sup- (resp. inf-) large deviation spectrum ˆ n , X n )}n>1 Thus, if the large deviation spectrum is known, so is of {(1/n)µn (X the distance spectrum function. By the generalized G¨artner-Ellis upper bound derived in [5, Thm. 2.1], we obtain `¯X (a) ≥ inf I X (x) = I X (a) [x≤a]

and `X (a) ≥ inf I¯X (x) = I¯X (a), [x≤a]

where the equalities follow from the convexity (and hence, continuity and strict decreasing) of I X (x) , sup[θ0

Some remarks on the Varshamov-Gilbert bound obtained above are given below. Remarks.

229

• One can easily see from [2, page 400], where the Varshamov-Gilbert bound is given under Bhattacharyya distance and finite code alphabet, that ¯ X (R) sup G X

and sup GX (R) X

are the generalization of the conventional Varshamov-Gilbert bound. • Since 0 ≤ exp{−µn (ˆ xn , xn )/s} ≤ 1 for s > 0, the function exp{−µ(·, ·)/s} is always integrable. Hence, (9.4.27) and (9.4.29) can be evaluated under any non-negative measurable function µn (·, ·). In addition, no assumption on the alphabet space X is needed in deriving the lower bound. Its full generality can be displayed using, again, Examples 2.8 and 9.33, which result in exactly the same curves as shown in Figures 9.9 and 9.10. ¯ X (R) and GX (R) are both the pointwise supremum of • Observe that G a collection of affine functions, and hence, they are both convex, which immediately implies their continuity and strict decreasing property on the interior of their domains, i.e., ¯ X (R) < ∞} and {R : GX (R) < ∞}. {R : G However, as pointed out in [5], `¯X (·) and `X (·) are not necessarily con¯ X (·) and vex, which in turns hints the possibility of yielding non-convex Λ ΛX (·). This clearly indicates that the Varshamov-Gilbert bound is not tight whenever the asymptotic largest minimum distance among codewords is non-convex. An immediate improvement from [5] to the Varshamov-Gilbert bound is to employ the twisted large deviation rate function (instead of I¯X (·)) J¯X,h (x) ,

sup

[θ · h(x) − ϕ¯X (θ; h)]

{θ∈< : ϕ ¯X (θ;h)>−∞}

and yields a potentially non-convex Varshamov-Gilbert-type bound, where h(·) is a continuous real-valued function, and h i 1 ˆn n ϕ¯X (θ; h) , lim sup log E en·θ·h(µn (X ,X )/n) . n→∞ n Question of how to find a proper h(·) for such improvement is beyond the scope of this paper and hence is deferred for further study. ¯ X (R) > G ¯ X (R) by a simple example. • We now demonstrate that Λ

230

Example 9.36 Assume binary P code alphabet X = {0, 1}, and n-fold n n Hamming distance µn (ˆ x , x ) = ni=1 µ(ˆ xi , xi ). Define a measurable function as follows:  if 0 ≤ µn (ˆ xn , xn ) < αn;  0, αn, if αn ≤ µn (ˆ xn , xn ) < 2αn; µ ˆn (ˆ xn , xn ) ,  ∞, if 2αn ≤ µn (ˆ xn , xn ), where 0 < α < 1/2 is a universal constant. Let X be the product of ˆ i , Xi ) for 1 ≤ i ≤ n. uniform distributions over X , and let Yi , µ(X Then

= =

=

=

ϕX (−1/s) h i 1 ˆn n lim inf log E e−ˆµn (X ,X )/s n→∞ n " ! ˆ n, X n) µn (X 1 0 = sup min {s[IY (α) − R], s[IY (2α) − R] + α} s>0  ∞, for 0 ≤ R < IY (2α);    α(I (α) − R) Y , for IY (2α) ≤ R < IY (α); =  I (α) − IY (2α) Y   0, for R ≥ IY (α).

231

¯ X (R). Since We next derive Λ   1 n n ˆ ,X ) ≤ a Pr µ ˆ n (X n    Y1 + · · · + Yn   < α , 0 ≤ a < α;  Pr 0 ≤ n   = Y1 + · · · + Yn   < 2α , α ≤ a < 2α,  Pr 0 ≤ n we obtain

  ∞, if 0 ≤ R < IY (2α); ¯ α, if IY (2α) ≤ R < IY (α); ΛX (R) =  0, if IY (α) ≤ R.

¯ X (R) > G ¯ X (R) for IY (2α) < R < IY (α). Consequently, Λ • One of the problems that remain open in the combinatorial coding theory is the tightness of the asymptotic Varshamov-Gilbert bound for the binary code and the Hamming distance [3, page vii]. As mentioned in Section I, it is already known that the asymptotic Varshamov-Gilbert bound is in general not tight, e.g., for algebraic-geometric code with large code alphabet size and Hamming distance. Example 9.36 provides another example to confirm the untightness of the asymptotic Varshamov-Gilbert bound for simple binary code and quantized Hamming measure. By the generalized G¨artner-Ellis lower bound derived in [5, Thm. 2.1], we conclude that ¯ X (R) = G ¯ X (R)) `¯X (R) = I X (R) (or equivalently Λ if  ϕ (θ + t) − ϕX (θ)  [  ¯ 0 (X), D ¯ p (X) ∈ D lim sup X , t t↓0 θ0

ˆ n , X n )/s}], and X ˆ n ≡ X n are where ϕ¯X (s) , lim supn→∞ (1/n) log E[exp{−µn (X independent. For binary alphabet,  y