arXiv:1402.0608v1 [cs.IT] 4 Feb 2014. 1. Variable-length compression allowing errors (extended). Victoria Kostina, Yury Polyanskiy, Sergio Verdú. Abstract.
1
Variable-length compression allowing errors (extended) Victoria Kostina, Yury Polyanskiy, Sergio Verd´u
arXiv:1402.0608v1 [cs.IT] 4 Feb 2014
Abstract This paper studies the fundamental limits of the minimum average length of variable-length compression when a nonzero error probability ǫ is tolerated. We give non-asymptotic bounds on the minimum average length in terms of Erokhin’s rate-distortion function and we use those bounds to obtain a Gaussian approximation on the speed of approach to the limit which is quite accurate for all but small blocklengths: (1 − ǫ)kH(S) −
r
kV (S) − (Q−1 (ǫ))2 2 e 2π
where Q−1 (·) is the functional inverse of the Q-function and V (S) is the source dispersion. A nonzero error probability thus not only reduces the asymptotically achievable rate by a factor of 1 − ǫ, but also this asymptotic limit is approached from below, i.e. a larger source dispersion and shorter blocklengths are beneficial. Further, we show that variable-length lossy compression under excess distortion constraint also exhibits similar properties.
Index Terms variable-length compression, lossy compression, single-shot, dispersion, anti-redundancy, Shannon theory.
I. I NTRODUCTION
AND SUMMARY OF RESULTS
Let S be a discrete random variable to be compressed into a variable-length binary string. We denote the set of all binary strings (including the empty string) by {0, 1}⋆ and the length of a
string a ∈ {0, 1}⋆ by ℓ(a). The codes considered in this paper fall under the following paradigm. This work was supported in part by the Center for Science of Information (CSoI), an NSF Science and Technology Center, under Grant CCF-0939370.
February 5, 2014
DRAFT
2
Definition 1 ((L, ǫ) code). A variable length (L, ǫ) code for source S defined on a finite or countably infinite alphabet M is a pair of possibly random transformations PW |S : M 7→ {0, 1}⋆ : {0, 1}⋆ 7→ M such that1 and PS|W ˆ
h i ˆ P S 6= S ≤ ǫ
(1)
E [ℓ(W )] ≤ L
(2)
The corresponding fundamental limit is L⋆S (ǫ) , inf {L : ∃ an (L, ǫ) code}
(3)
Lifting the prefix condition in variable-length coding is discussed in [1], [2]. In particular, in the zero-error case we have [3], [4] H(S) − log2 (H(S) + 1) − log2 e ≤ L⋆S (0) ≤ H(S) ,
(4) (5)
while [1] shows that in the i.i.d. case (with a non-lattice distribution S, otherwise o(1) becomes O(1)) L⋆S k (0) = k H(S) −
1 log2 (8πeV (S)k) + o(1) 2
(6)
where V (S) is the varentropy of PS , namely the variance of the information ıS (S) = log2
1 . PS (S)
(7)
Under the rubric of “weak variable-length source coding,” T. S. Han [5], [6, Section 1.8] considers the asymptotic fixed-to-variable (M = S k ) almost-lossless version of the foregoing setup with vanishing error probability and prefix encoders. Among other results, Han showed that the minimum average length LS k (ǫ) of prefix-free encoding of a stationary ergodic source with entropy rate H behaves as 1 LS k (ǫ) = H. ǫ→0 k→∞ k
lim lim 1
(8)
Note that L need not be an integer.
February 5, 2014
DRAFT
3
Koga and Yamamoto [7] characterized asymptotically achievable rates of variable-length prefix codes with non-vanishing error probability and, in particular, showed that for finite alphabet i.i.d. sources with distribution S, lim
k→∞
1 L k (ǫ) = (1 − ǫ)H(S). k S
(9)
The benefit of variable length vs. fixed length in the case of given ǫ is clear from (9): indeed, the latter satisfies a strong converse and therefore any rate below the entropy is fatal. Allowing both nonzero error and variable-length coding is interesting not only conceptually but on account on several important generalizations. For example, the variable-length counterpart of Slepian-Wolf coding considered e.g. in [8] is particularly relevant in universal settings, and has a radically different (and practically uninteresting) zero-error version. Another substantive important generalization where nonzero error is inevitable is variable-length joint source-channel coding without or with feedback. For the latter, Polyanskiy et al. [9] showed that allowing a nonzero error probability boosts the ǫ-capacity of the channel, while matching the transmission length to channel conditions accelerates the rate of approach to that asymptotic limit. The use of nonzero error compressors is also of interest in hashing [10]. The purpose of Section II in this paper is to give non-asymptotic bounds on the fundamental limit (3), and to apply those bounds to analyze the speed of approach to the limit in (9), which also holds without the prefix condition. Specifically, we show that (cf. (4)–(5)) L⋆S (ǫ) = H(S, ǫ) ± log2 H(S) = E [hıS (S)iǫ ] ± log2 H(S)
(10) (11)
where △
H(S, ǫ) =
min I(S; Z)
PZ|S : P[S6=Z]≤ǫ
(12)
is Erokhin’s function [11], and the ǫ-cutoff random transformation acting on a real-valued random variable X is defined as
hXiǫ ,
February 5, 2014
X η 0 0
X η] + α P [X = η] = ǫ
(14)
The code that achieves (10) essentially discards “rich” source realizations with ıS (S) > η and encodes the rest losslessly assigning them in the order of decreasing probabilities to the elements of {0, 1}⋆ ordered lexicographically.
For memoryless sources with Si ∼ S we show that the speed of approach to the limit in (9)
is given by the following result. L⋆S k (ǫ) r kV (S) − (Q−1 (ǫ))2 2 e + O (log k) = (1 − ǫ)kH(S) − H(S k , ǫ) 2π
E ıS k (S k ) ǫ
(15)
To gain some insight into the form of (15), note that if the source is memoryless, the
information in S k is a sum of i.i.d. random variables, and by the central limit theorem k
ıS k (S ) =
k X
ıS (Si )
(16)
i=1
d
≈ N (kH(S), kV (S)) while for Gaussian X
(17)
r
Var [X] − (Q−1 (ǫ))2 2 e (18) 2π Our result in (15) underlines that not only ǫ > 0 allows for a (1 − ǫ) reduction in asymptotic E [hXiǫ ] = (1 − ǫ)E [X] −
rate (as found in [7]), but, in contrast to [12]–[14], larger source dispersion is beneficial. This curious property is further discussed in Section II-E. In Section III, we generalize the setting to allow a general distortion measure in lieu of the Hamming distortion in (1). More precisely, we replace (1) by the excess probability constraint P [d (S, Z) > d] ≤ ǫ. In this setting, refined asymptotics of minimum achievable lengths of variable-length lossy prefix codes almost surely operating at distortion d was studied in [15] (pointwise convergence) and in [16], [17] (convergence in mean). Our main result in the lossy case is that (15) generalizes simply by replacing H(S) and V (S) by the corresponding ratedistortion and rate-dispersion functions, Erokhin’s function by RS (d, ǫ) ,
February 5, 2014
min
PZ|S : P[d(S,Z)>d]≤ǫ
I(S; Z)
(19)
DRAFT
5
and the ǫ-cutoff of information by that of d-tilted information [14], hS (S, d)iǫ . II. A LMOST
LOSSLESS VARIABLE LENGTH COMPRESSION
A. Optimal code In the zero-error case the optimum variable-length compressor without prefix constraints fS⋆ is known explicitly [3], [18]2 : a deterministic mapping that assigns the elements in M (labeled without loss of generality as the positive integers) ordered in decreasing probabilities to {0, 1}⋆
ordered lexicographically. The decoder is just the inverse of this injective mapping. This code is optimal in the strong stochastic sense that the cumulative distribution function of the length of any other code cannot lie above that achieved with fS⋆ . The length function of the optimum code is [3]: ℓ(fS⋆ (m)) = ⌊log2 m⌋.
(20)
In order to generalize this code to the nonzero-error setting, we take advantage of the fact that in our setting error detection is not required at the decoder. This allows us to retain the same decoder as in the zero-error case. As far as the encoder is concerned, to save on length on a given set of realizations which we are willing to fail to recover correctly, it is optimal to assign them all to ∅. Moreover, since we have the freedom to choose the set that we want to recover correctly (subject to a constraint on its probability ≥ 1 − ǫ) it is optimal to include all the most
likely realizations (whose encodings according to fS⋆ are shortest). If we are fortunate enough that P ⋆ ǫ is such that M m=1 PS (m) = 1 − ǫ for some M, then the code f(m) = fS (m), if m = 1, . . . , M
and f(m) = ∅, if m > M is optimal.
Formally, notice that for a given encoder PW |S , the optimal decoder is always deterministic and denote its output by Sˆ = g(W ). Consider w0 ∈ {0, 1}⋆ \ ∅ and source realization m with PW |S (w0 |m) > 0. If g(w0 ) 6= m, the average length can be decreased, without affecting the probability of error, by setting PW |S (w0 |m) = 0 and adjusting PW |S (∅|m) accordingly. This argument implies that the optimal encoder has at most one source realization m mapping to each w0 6= ∅. Next, let m0 = g(∅) and by a similar argument conclude that PW |S (∅|m0 ) = 1. But then, interchanging m0 and 1 leads to the same or better probability of error and shorter 2
The construction in [18] omits the empty string.
February 5, 2014
DRAFT
6
average length, which implies that the optimal encoder maps 1 to ∅. Continuing in the same manner for m0 = g(0), g(1), . . . , g(fS⋆ (M)), we derive that the optimal code maps f(m) = fS⋆ (m), m = 1, . . . , M. Finally, assigning the remaining source outcomes whose total mass is ǫ to ∅ shortens the average length without affecting the error probability, so optimally f(m) = ∅, m > M. To describe an optimum construction that holds without the foregoing fortuitous choice of ǫ, let P ⋆ M be the minimum M such that M m=1 PS (m) ≥ 1 −ǫ, let η = ⌊log 2 M⌋, and let f(m) = fS (m), if ⌊log2 m⌋ < η and f(m) = ∅, if ⌊log2 m⌋ > η, and assign the outcomes with ⌊log2 m⌋ = η to
∅ with probability α and to the lossless encoding fS⋆ (m) with probability 1 − α, which is chosen
so that3
ǫ=α
X
PS (m) +
X
m∈M:
m∈M:
⌊log2 m⌋=η
⌊log2 m⌋>η
PS (m)
= E [ε⋆ (S)] where
3
0 ε⋆ (m) = α 1
(21)
(22)
ℓ(fS⋆ (m)) < η ℓ(fS⋆ (m)) = η
(23)
ℓ(fS⋆ (m)) > η
We have shown that the output of the optimal encoder has structure4 f ⋆ (m) hℓ(f ⋆ (m))iǫ > 0 S S W (m) = ∅ otherwise
(24)
It does not matter exactly how the encoder implements randomization on the boundary as long as conditioned on ⌊log2 S⌋ = η,
the probability that S is mapped to ∅ is α. In the deterministic code with the fortuitous choice of ǫ described above, α is the ratio of the probabilities of the sets {m ∈ M : m > M, ⌊log 2 m⌋ = η} to {m ∈ M : ⌊log2 m⌋ = η}. 4
If error detection is required and ǫ ≥ PS (1), then fS⋆ (m) in the right side of (24) is replaced by fS⋆ (m + 1). Similarly, if
error detection is required and PS (j) > ǫ ≥ PS (j + 1), fS⋆ (m) in the right side of (24) is replaced by fS⋆ (m + 1) as long as m ≥ j, and ∅ in the right side of (24) is replaced by fS⋆ (j).
February 5, 2014
DRAFT
7
and that the minimum average length is given by L⋆S (ǫ) = E [hℓ(fS⋆ (S))iǫ ] = L⋆S (0) −
max
(25)
ε(·):E [ε(S)]≤ǫ
E [ε(S)ℓ(fS⋆ (S))]
(26)
= L⋆S (0) − E [ε⋆ (S)ℓ(fS⋆ (S))]
(27)
where the optimization is over ε : Z+ 7→ [0, 1], and the optimal error profile ε⋆ (·) that achieves (26) is given by (23). An immediate consequence is that in the region of large error probability ǫ > 1 − PS (1),
M = 1, all outcomes are mapped to ∅, and therefore, L⋆S (ǫ) = 0. At the other extreme, if ǫ = 0, then M = |M| and [2] L⋆S (0) = E[ℓ(fS⋆ (S))] =
∞ X i=1
P[S ≥ 2i ]
(28)
B. Non-asymptotic bounds Expression (25) is not always convenient to work with. The next result generalizes the bounds in (4) and (5) to ǫ > 0, in which the role of entropy is taken over by Erokhin’s function. Theorem 1. If 0 ≤ ǫ < 1 − PS (1), then the minimum achievable average length satisfies H(S, ǫ) − log2 (H(S, ǫ) + 1) − log2 e ≤ L⋆S (ǫ)
(29)
≤ H(S, ǫ) + ǫ log2 (H(S) + ǫ) + ǫ log2 where H(S, ǫ) is Erokhin’s function defined in (12), and h(x) = x log2
1 x
e + 2 h(ǫ) ǫ
+ (1 − x) log2
(30) 1 1−x
is
the binary entropy function. If ǫ > 1 − PS (1), then L⋆S (ǫ) = 0. Note that we recover (4) and (5) by particularizing Theorem 1 to ǫ = 0. Erokhin’s function [11] can be parametrically represented as follows. H(S, ǫ) =
M X
PS (m) log2
m=1
1 1 1 − (1 − ǫ) log2 − (M − 1)η log2 PS (m) 1−ǫ η
(31)
with the integer M and η > 0 determined by ǫ through M X
m=1
February 5, 2014
PS (m) = 1 − ǫ + (M − 1)η
(32)
DRAFT
8
In particular, H(S, 0) = H(S), and if S is equiprobable on an alphabet of M letters, then H(S, ǫ) = log2 M − ǫ log2 (M − 1) − h(ǫ) ,
(33)
In principle, it may seem surprising that L⋆S (ǫ) is connected to H(S, ǫ) in the way dictated by Theorem 1, which implies that whenever the unnormalized quantity H(S, ǫ) is large it must be close to the minimum average length. After all, the objectives of minimizing the input/output dependence and minimizing the description length of Sˆ appear to be disparate, and in fact (24) and the conditional distribution achieving (12) are quite different: although in both cases S and its approximation coincide on the most likely outcomes, the number of retained outcomes is different, and to lessen dependence, errors in the optimizing conditional in (24) do not favor m = 1 or any particular outcome of S. Unfortunately a direct asymptotic analysis of both quantities L⋆S (ǫ) and H(S, ǫ) is challenging. The next result tightly bounds these quantities in terms of the ǫ-cutoff of information, hıS (S)iǫ , a random variable that is easy to deal with. Theorem 2. If 0 ≤ ǫ < 1 − PS (1), then the minimum achievable average length satisfies E [hıS (S)iǫ ] + L⋆S (0) − H(S) ≤ L⋆S (ǫ)
(34)
≤ E [hıS (S)iǫ ]
(35)
and Erokhin’s function satisfies E [hıS (S)iǫ ] − ǫ log2 (L⋆S (0) + ǫ) − 2 h(ǫ) − ǫ log2
e ≤ H(S, ǫ) ǫ ≤ E [hıS (S)iǫ ]
(36) (37)
If ǫ > 1 − PS (1), then L⋆S (ǫ) = H(S, ǫ) = 0. Example. If S is equiprobable on an alphabet of cardinality M, then log M w. p. 1 − ǫ 2 hıS (S)iǫ = 0 otherwise
February 5, 2014
(38)
DRAFT
9
C. Proofs of Theorems 1 and 2 Proof of the converse bound (29): The entropy of the output string W ∈ {0, 1}⋆ of an h i arbitrary compressor S → W → Sˆ with P S = 6 Sˆ ≤ ǫ satisfies ˆ ≥ H(S, ǫ) H(W ) ≥ I(S; W ) = I(S; S)
(39)
where the rightmost inequality holds in view of (12). Noting that the identity mapping W 7→ W 7→ W is a lossless variable-length code, we lower-bound its average length as H(W ) − log2 (H(W ) + 1) − log2 e ≤ L⋆W (0) ≤ E[ℓ(W )]
(40) (41)
where (40) follows from (4). The function of H(W ) in the left side of (40) is monotonically increasing if H(W ) > log2
e 2
= 0.44 bits and it is positive if H(W ) > 3.66 bits. Therefore, it is
safe to further weaken the bound in (40) by invoking (39). This concludes the proof of (29). By applying [1, Theorem 1] to W , we can get a sharper lower bound (which is always positive) ψ −1 (H(S, ǫ)) ≤ L⋆S (ǫ)
(42)
where ψ −1 is the inverse of the monotonic function on the positive real line: ψ(x) = x + (1 + x) log2 (1 + x) − x log2 x.
(43)
Proof of the achievability bound (30): First, notice that the constraint in (12) is achieved with equality: H(S, ǫ) =
min I(S; Z)
PZ|S : P[S6=Z]=ǫ
(44)
Indeed, given P[S 6= Z] ≤ ǫ we may define Z ′ such that S → Z → Z ′ and P[S 6= Z ′ ] = ǫ (for
example, by probabilistically mapping non-zero values of Z to Z ′ = 0). Fix PZ|S satisfying the constraint in (44). Denote for brevity Λ , ℓ(fS⋆ (S))
(45)
E , 1{S 6= Z}
(46)
△
ε(i) = P[S 6= Z|Λ = i] February 5, 2014
(47)
DRAFT
10
We proceed to lower bound the mutual information between S and Z: I(S; Z) = I(S; Z, Λ) − I(S; Λ|Z)
(48)
= H(S) − H(Λ|Z) − H(S|Z, Λ)
(49)
= H(S) − I(Λ; E|Z) − H(Λ|Z, E) − H(S|Z, Λ)
(50)
≥ L⋆S (ǫ) + H(S) − L⋆S (0) − ǫ log2 (L⋆S (0) + ǫ) − ǫ log2
e − 2 h(ǫ) ǫ
(51)
where (51) follows from I(Λ; E|Z) ≤ h(ǫ) and the following chains (52)-(53) and (55)-(59). H(S|Z, Λ) ≤ E [ε(Λ)Λ + h(ε(Λ))]
(52)
≤ L⋆S (0) − L⋆S (ǫ) + h(ǫ)
(53)
where (52) is by Fano’s inequality: conditioned on Λ = i, S can have at most 2i values, so H(S|Z, Λ = i) ≤ i ε(i) + h(ε(i))
(54)
and (53) follows from (26), (44) and the concavity of h(·). The third term in (50) is upper bounded as follows. H(Λ|Z, E) = ǫH(Λ|Z, E = 1)
(55)
≤ ǫH(Λ|S 6= Z)
(56)
≤ ǫ (log2 (1 + E [Λ|S 6= Z]) + log2 e) E [Λ] ≤ ǫ log2 1 + + log2 e ǫ e = ǫ log2 + ǫ(log2 (L⋆S (0)) + ǫ) , ǫ
(57) (58) (59)
where (55) follows since H(Λ|Z, E = 0) = 0, (56) is because conditioning decreases entropy, (57) follows by maximizing entropy under the mean constraint (achieved by the geometric distribution), (58) follows by upper-bounding P[S 6= Z] E [Λ|S 6= Z] ≤ E [Λ] and (59) applies (28). Finally, since the right side of (51) does not depend on Z, we may minimize the left side over PZ|S satisfying the constraint in (44) to obtain L⋆S (ǫ) ≤ H(S, ǫ) + L⋆S (0) − H(S) + ǫ log2 (L⋆S (0) + ǫ) + 2 h(ǫ) + ǫ log2 February 5, 2014
e ǫ
(60) DRAFT
11
which leads to (30) via Wyner’s bound (5). Proof of Theorem 2: Similarly to (26), we have the variational characterization: E [hıS (S)iǫ ] = H(S) −
max
ε(·):E [ε(S)]≤ǫ
E [ε(S)ıS (S)]
(61)
where ε(·) takes values in [0, 1]. Noting that the ordering PS (1) ≥ PS (2) ≥ . . . implies ⌊log2 m⌋ ≤ ıS (m)
(62)
we obtain (34)–(35) comparing (26) and (61) via (62). The bound in (36) follows from (60) and (34). Showing (37) involves defining a suboptimal choice (in (12)) of Z= where PS S¯ = PS PS .
S
S¯
hıS (S)iǫ > 0
(63)
hıS (S)iǫ = 0
D. Asymptotics for memoryless sources Theorem 3. Assume that: •
PS k = PS × . . . × PS .
•
The third absolute moment of ıS (S) is finite.
For any 0 ≤ ǫ ≤ 1 and k → ∞ we have L⋆S k (ǫ) r kV (S) − (Q−1 (ǫ))2 k 2 e + θ(k) = (1 − ǫ)kH(S) − H(S , ǫ) 2π
E ıS k (S k ) ǫ
(64)
where the remainder term satisfies
− log2 k + O (log2 log2 k) ≤ θ(k) ≤ O (1)
(65)
Proof: If the source is memoryless, the information in S k is a sum of i.i.d. random variables as indicated in (16), and Theorem 3 follows by applying the next Lemma to the bounds in Theorem 2. February 5, 2014
DRAFT
12
Lemma 1. Let X1 , X2 , . . . be a sequence of independent random variables with a common distribution PX and a finite third absolute moment. Then for any 0 ≤ ǫ ≤ 1 and k → ∞ we have E
"*
k X
Xi
i=1
+# ǫ
= (1 − ǫ)kE [X] −
r
kVar [X] − (Q−1 (ǫ))2 2 e + O (1) 2π
(66)
Proof: Appendix A. Remark 1. Applying (6) to (34), for finite alphabet sources the lower bound on L⋆S k (ǫ) is improved to 1 θ(k) ≥ − log2 k + O (1) 2 For H(S k , ǫ), the lower bound is in fact θ(k) ≥ −ǫ log2 k + O (1), and for E θ(k) = O (1).
(67)
ıS k (S k )
, ǫ
Remark 2. If the source has finite alphabet, we can sketch an alternative proof of Theorem 3 using the method of types. By concavity and symmetry, it is easy to see that the optimal coupling that achieves H(S k , ǫ) satisfies the following property: the error profile △
ǫ(sk ) = P[Z k 6= S k |S k = sk ]
(68)
is constant on each k-type (see [19, Chapter 2] for types). Denote the type of sk as Pˆsk and its size as M(sk ). We then have the following chain: I(S k ; Z k ) = I(S k , PˆS k ; Z k ) = I(S k ; Z k |PˆS k ) + O(log k) ≥ E (1 − ǫ(S k )) log M(S k ) + O(log k)
(69) (70) (71)
where (70) follows since there are only polynomially many types and (71) follows from (33). Next, (71) is to be minimized over all ǫ(S k ) satisfying E [ǫ(S k )] ≤ ǫ. The solution (of this linear optimization) is easy: ǫ(sk ) is 1 for all types with M(sk ) exceeding a certain threshold, and 0 otherwise. In other words, we get H(S k , ǫ) = (1 − ǫ)E [log M(S k )|M(S k ) ≤ γ] + O(log k) ,
February 5, 2014
(72)
DRAFT
13
where γ is chosen so that P[M(S k ) > γ] = ǫ. Using a known relation between type size and its entropy, we have M(sk ) = H(Pˆsk ) + O(log k) and from the central-limit theorem, cf. [12], [20], we get r V (S) d H(PˆS k ) = kH(S) + U + O(log k) k
(73)
U ∼ N (0, 1) .
(74)
Thus, putting together (72), (73), (74) and after some algebra (64) follows. E. Discussion Theorem 3 exhibits an unusual phenomenon in which the dispersion term improves the achievable average rate. As illustrated in Fig. 1, a nonzero error probability ǫ decreases the average achievable rate as the source outcomes falling into the shaded area are assigned length 0. The center of probabilistic mass shifts to the left when the ǫ-tail of the distribution is chopped off. Since the size of this shift is proportional to the width of the distribution, shorter blocklengths and larger dispersions help to achieve a lower average rate.
pmf of
1 ℓ(f ⋆ (S k )) k
ǫ 1 ⋆ L k (ǫ) k S
Fig. 1.
1 ⋆ L k (0) k S
The benefit of nonzero ǫ and dispersion.
February 5, 2014
DRAFT
14
For a source of biased coin flips, Fig. 3 depicts the exact average rate of the optimal code as well as the approximation in (64). Both curves are monotonically increasing in k. The dispersion term in (64) vanishes quickly with ǫ. More precisely, as ǫ → 0, we have (Appendix B)
−1 (ǫ) 2
(Q 1 √ e− 2π
)
2
r
=ǫ
2 log2
1 + o (ǫ) ǫ
(75)
Therefore, a refined analysis of higher order terms in the expansion (64) is desirable in order to obtain an approximation that is accurate even at short blocklengths. Inspired by [21], in Fig. 3 we adopt the following value for the remainder in (64): log2 k 1 p 1 θ(k) = (1 − ǫ) − log2 (4e3 π) + + log2 2 2 1 − 2p 1 − 2p 1−p 1 log2 + 2(1 − 2p) p
(76)
where p is the coin bias, which proves to yield a remarkably good approximation accurate for blocklengths as short as 20. Figure 4 plots the bounds to H(S k , ǫ) for biased coin flips. III. L OSSY
VARIABLE - LENGTH COMPRESSION
A. The setup In the basic setup of lossy compression, we are given a source alphabet M, a reproduction c a distortion measure d : M × M c 7→ [0, +∞] to assess the fidelity of reproduction, alphabet M,
and a probability distribution of the object S to be compressed.
Definition 2 ((L, d, ǫ) code). A variable-length (L, d, ǫ) lossy code for {S, d} is a pair of random c such that transformations PW |S : M 7→ {0, 1}⋆ and PZ|W : {0, 1}⋆ 7→ M P [d (S, Z) > d] ≤ ǫ
(77)
E [ℓ(W )] ≤ L
(78)
The goal of this section is to characterize the minimum achievable average length compatible with the given tolerable error ǫ: L⋆S (d, ǫ) , {min L : ∃ an (L, d, ǫ) code}
February 5, 2014
(79)
DRAFT
15
0.45 ε = 10−4
Rate, bits per bit
0.4 ε = 0.1
0.35 0.3 0.25 1 ⋆ k LS k (ǫ)
0.2 0.15
Approximation (64)
0.1 0.05 0
0
10
20
30
40
50
60
70
80
90
100
k Fig. 2.
Average rate achievable for variable-rate almost lossless encoding of a memoryless binary source with bias p = 0.11
and two values of ǫ. For ǫ < 10−4 , the resulting curves are almost indistinguishable from the ǫ = 10−4 curve.
Note that unlike the lossless setup in Section II, the optimal encoding and decoding mappings do not admit, in general, an explicit description. Section III-B reviews some background facts from rate-distortion theory. Section III-C presents single-shot results, and Section III-D focuses on the asymptotics. B. A bit of rate-distortion theory The minimal mutual information quantity RS (d) ,
February 5, 2014
inf
PZ|S : E[d(S,Z)]≤d
I(S; Z)
(80)
DRAFT
16
0.45
0.4
Rate, bits per bit
1 ⋆ k LS k (ǫ)
0.35
Upper bound (30) 0.3
Upper bound (35) 0.25
0.2
Lower bound (42)
0.15
Lower bound (29)
0.1
0.05
0
0
50
100
150
200
250
300
350
400
450
500
k
Fig. 3.
Bounds to the average rate achievable for variable-rate almost lossless encoding of a memoryless binary source with
bias p = 0.11 and ǫ = 0.1. The lower bound in (29) is virtually indistinguishable from a weakening of (34) using (4).
characterizes the minimum asymptotically achievable rate in both fixed-length compression under the average or excess distortion constraint and variable-length lossy compression under the almost sure distortion constraint [22], [23]. We assume throughout that the following basic assumptions are met. (A) RS (d) is finite for some d, i.e. dmin < ∞, where dmin , inf {d : RS (d) < ∞}
February 5, 2014
(81)
DRAFT
17
ε = 0.01 0.45 0.4
ε = 0.1
bits per bit
0.35
Upper bound (37) 1 k k H(S , ǫ)
0.3
(31)
0.25 0.2
Lower bound (36)
0.15 0.1 0.05 0
0
10
20
30
40
50
60
70
80
90
100
k Fig. 4.
Bounds to Erokhin’s function for a memoryless binary source with bias p = 0.11.
c such that (B) The distortion measure is such that there exists a finite set E ⊂ M E min d(S, z) < ∞ z∈E
(82)
The following characterization of RS (d) due to Csisz´ar [24] will be instrumental. Theorem 4 (Characterization of RS (d) [24, Theorem 2.3]). For each d > dmin it holds that RS (d) = max {E [J(S)] − λd} J(s), λ
(83)
where the maximization is over J(s) ≥ 0 and λ ≥ 0 satisfying the constraint
February 5, 2014
c E [exp {J(S) − λd(S, z)}] ≤ 1 ∀z ∈ M
(84) DRAFT
18
Let (JS (s), λS ) attain the maximum in the right side of (83). If there exists a transition probability kernel PZ ⋆ |S that actually achieves the infimum in the right side of (80), then [24] JS (s) = ıS;Z ⋆ (s; z) + λS d(s, z) = − log2 E [exp (−λS d(s, Z ⋆ ))]
(85) (86)
where (85) holds for PZ ⋆ -a.e. z, the expectation in (86) is with respect to the unconditional distribution of Z ⋆ , and the usual information density is denoted by ıS;Z (s; z) , log2
dPZ|S=s (z) dPZ
(87)
Note from (86) that by the concavity of logarithm 0 ≤ JS (s) ≤ E [d(s, Z ⋆ )]
(88)
The random variable that plays the key role in characterizing the nonasymptotic fundamental limit of lossy data compression is the d-tilted information in s ∈ M [14]: S (s, d) , JS (s) − λS d
(89)
RS (d) = E [S (S, d)]
(90)
It follows from (83) that
Much like information in s ∈ M which quantifies the number of bits necessary to represent s losslessly, d-tilted information in s quantifies the number of bits necessary to represent s within distortion d, in a sense that goes beyond average as in (90) [14], [15]. Particularizing (84), we observe that the d-tilted information satisfies E [exp(S (S, d) + λS d − λS d(S, z))] ≤ 1
(91)
Using Markov’s inequality and (86), it is easy to see that the d-tilted information is linked to the probability that Z ⋆ falls within distortion d from s ∈ M: S (s, d) ≤ log2 where
1 PZ ⋆ (Bd (s))
n o c : d(s, z) ≤ d Bd (s) , z ∈ M
(92)
(93)
Moreover, under regularity conditions the reverse inequality in (92) can be closely approached [15, Proposition 3]. February 5, 2014
DRAFT
19
C. Nonasymptotic bounds The next result provides nonasymptotic bounds to the minimum achievable average length. Theorem 5 (Bounds to L⋆S (d, ǫ)). The minimal average length with excess-distortion criterion satisfies RS (d, ǫ) − log2 (RS (d, ǫ) + 1) − log2 e ≤ L⋆S (d, ǫ) ≤ inf E [h− log2 PZ (Bd (S))iǫ ] PZ
(94) (95)
where Bd (s) is the distortion d-ball around s (formally defined in (93)), RS (d, ǫ) is the minimal c information quantity defined in (19), and the infimum is over all distributions on M. Proof: The converse bound in (94) is shown by regurgitating the argument in (39)–(41).
To show the achievability bound in (95), consider the (d, ǫ) code that, given an infinite list of codewords z1 , z2 , . . ., outputs the first d-close match to s as long as s is not too atypical. Specifically, the encoder outputs the lexicographic binary encoding (including the empty string) W
of
where
1
h− log2 PZ (Bd (S))iǫ > 0
(96)
otherwise
W = min {m : d(S, zm ) ≤ d}
(97)
The encoded length averaged over both the source and all codebooks with codewords Z1 , Z2 , . . . drawn i.i.d. from PZ is upper bounded by E [⌊log2 W ⌋1 {h− log2 PZ (Bd (S))iǫ > 0}] ≤ E [log2 W 1 {h− log2 PZ (Bd (S))iǫ > 0}]
(98)
= E [1 {h− log2 PZ (Bd (S))iǫ > 0} E [log2 W |S]] (99) ≤ E [1 {h− log2 PZ (Bd (S))iǫ > 0} log2 E [W |S]] (100) = E [h− log2 PZ (Bd (S))iǫ ]
(101)
where
February 5, 2014
DRAFT
20
•
(100) is by Jensen’s inequality;
•
(101) holds because conditioned on S = s and averaged over codebooks, W has geometric distribution with success probability PZ (Bd (s)).
It follows that there is at least one codebook that yields the encoded length not exceeding the expectation in (101).
Remark 3. The minimum average length of d-semifaithful codes is bounded by: RS (d, 0) − log2 (RS (d, 0) + 1) − log2 e ≤ L⋆S (d, 0)
(102)
≤ inf E [− log2 PZ (Bd (S))]
(103)
≤ Hd (S)
(104)
≤ RS (d, 0) + log2 (RS (d, 0) + 1) + C
(105)
PZ
where Hǫ (S) is the ǫ-entropy of the source S [26]: Hǫ (S) ,
min
c: f : M7→M d(S,f(S))≤ǫ a.s.
H(f(S)),
(106)
(104) holds because for any f satisfying the constraint in (106) ( [26, Lemma 9]) Pf(S) (f(s)) ≤ Pf(S) (Bǫ (s))
(107)
and (105), where C is a universal constant, holds whenever d is a metric by [27, Theorem 2]. The function RS (d, ǫ) is challenging to compute and analyze. The gateway to its analysis is the next result, which implies that under regularity assumptions RS (d, ǫ) = E [hS (S, d)iǫ ] ± O (log2 RS (d))
(108)
Theorem 6 (Bounds to RS (d, ǫ)). For all d > dmin we have E [hS (S, d)iǫ ] − log2 (RS (d) − R′S (d)d + 1) − log2 e − h(ǫ) ≤ RS (d, ǫ)
(109)
≤ inf E [h− log2 PZ (Bd (S))iǫ ] PZ
(110)
Proof: Appendix C. February 5, 2014
DRAFT
21
Remark 4. As follows from Lemma 3 in Appendix C, in the special case where S (S, d) = RS (d) a.s.
(111)
which in particular includes the equiprobable source under a permutation distortion measure (e.g. symbol error rate) [25], the lower bound in (109) can be tightened as RS (d, ǫ) ≥ (1 − ǫ)RS (d) − h(ǫ)
(112)
Remark 5. If ǫ = 0, we have the following basic bounds RS (d) ≤ RS (d, 0)
(113)
≤ Hd (S)
(114)
Remark 6. Similar to (61), note the variational characterization: E [hS (S, d)iǫ ] = RS (d) −
max
ε : M7→[0,1] E [ε(S)]≤ǫ
E [ε(S)S (S, d)]
(115)
from where it follows via (92) that E [hS (S, d)iǫ ] ≤ E [h− log2 PZ ⋆ (Bd (S))iǫ ] ≤ E [hS (S, d)iǫ ] + E [− log2 PZ ⋆ (Bd (S))] − RS (d)
(116) (117)
where PZ ⋆ is the output distribution that achieves RS (d). D. Asymptotic analysis In this section we assume that the following conditions are satisfied. (i) The source {Si } is stationary and memoryless, PS k = PS × . . . × PS . P (ii) The distortion measure is separable, d(sk , z k ) = k1 ki=1 d(si , zi ).
(iii) The distortion level satisfies dmin < d < dmax , where dmin is defined in (81), and dmax = inf z∈M c E [d(S, z)], where the expectation is with respect to the unconditional distribution of S. (iv) E [d12 (S, Z⋆ )] < ∞ where the expectation is with respect to PS × PZ⋆ , and Z⋆ achieves the rate-distortion function RS (d).
February 5, 2014
DRAFT
22
If (i)–(iii) are satisfied, then λS k = kλS and PZ k⋆ |S k = PZ⋆ |S × . . . × PZ⋆ |S , where PZ⋆ |S achieves RS (d). Moreover, even if RS (d) is not achieved by any conditional distribution S k (sk , d) =
k X
S (si , d)
(118)
i=1
Theorem 7. Under assumptions (i)–(iv), for any 0 ≤ ǫ ≤ 1 L⋆S k (d, ǫ) r kV(d) − (Q−1 (ǫ))2 2 + θ(k) e = (1 − ǫ)kR(d) − RS k (d, ǫ) 2π
E S k (S k , d) ǫ
(119)
where
V(d) = Var [S (S, d)]
(120)
is the rate-dispersion function, and the remainder term satisfies − 2 log2 k + O (log log k) ≤ θ(k) ≤
1 log2 k + O (1) 2
(121)
Proof: Case ǫ = 0 was shown in [16]. For ǫ > 0, note that due to (88), the assumption (iv) implies that the twelfth (and thus the third) moment of S (S, d) is finite, and the expansion for
E S k (S k , d) ǫ follows from (118) and Lemma 1. The converse direction is now immediate from (94) and (109). The achievability direction follows by an application of Lemma 2 below
to (95) and (110). Lemma 2. Let 0 < ǫ ≤ 1. Under assumptions (i)–(iv)
where
E − log2 PZ k⋆ (Bd (S k )) ǫ = (1 − ǫ)kR(d) − O (1) ≤ θ(k) ≤
r
kV(d) − (Q−1 (ǫ))2 2 e + θ(k) 2π
1 log2 k + O (1) 2
(122)
(123)
Proof: Appendix D.
February 5, 2014
DRAFT
23
A PPENDIX A P ROOF
L EMMA 1
OF
The following non-uniform strengthening of the Berry-Esse´en inequality will be instrumental. Theorem 8 (Bikelis (1966), e.g. [28]). Fix a positive integer k. Let Xi , i = 1, . . . , k be independent, E [Xi ] = 0, E [|Xi |3 ] < ∞. Then, for any real t " # k X p Bk Xi > t kVk − Q(t) ≤ √ , P k(1 + |t|3 )
(124)
i=1
where
k
1X E |Xi |2 Vk = k i=1
(125)
Tk =
(126)
k
Bk =
1X E |Xi |3 k i=1 c0 Tk
(127)
3/2
Vk
and c0 is a positive constant. Denote for brevity Σk ,
k X
Xi
(128)
i=1
If Var [X] = 0
E [hΣk iǫ ] = (1 − ǫ)kE [X]
(129)
and we are done. If Var [X] > 0 notice that (1 − ǫ)kE [X] − E [hΣk iǫ ] = E [(Σk − kE [X]) 1 {Σk ≥ η}] + α (kE [X] − η) P [Σk = η] (130) where η and α are those in (14). Applying Theorem 8 to (14), we observe that η = kE [X] +
p
kVar [X]Q−1 (ǫ) + bk
(131)
where bk = O (1), and that the second term in the right side of (130) is O (1). To evaluate the first term, assume for now that ǫ < February 5, 2014
1 2
so that the random variable (Σk − kH(X)) 1 {Σk ≥ η} is DRAFT
24
nonnegative for large enough k, and write its expectation as an integral of its complementary cdf: E [(Σk − kE [X]) 1 {Σk ≥ η}] Z ∞ = P [Σk > η + t] dt 0 Z ∞ h i p −1 = P Σk > kE [X] + kVar [X]Q (ǫ) + t dt bk Z ∞ h i p = P Σk > kE [X] + kVar [X]Q−1 (ǫ) + t dt + O (1) 0 Z ∞ h i p p P Σk > kE [X] + kVar [X](Q−1 (ǫ) + r) dr + O (1) = kVar [X] Z0 ∞ p = kVar [X] Q Q−1 (ǫ) + r dr + O (1) Z 0∞ p x2 1 √ xe− 2 dx + O (1) = kVar [X] 2π Q−1 (ǫ) −1 p (Q (ǫ))2 1 = kVar [X] √ e− 2 + O (1) 2π
(132) (133) (134) (135) (136) (137) (138)
where (136) follows by applying Theorem 8 to the integrand in the left side and observing that Z ∞ dr
1 2
is shown in an analogous manner writing the first term in (130) as E [(Σk − kE [X]) 1 {Σk ≥ η}] = E [(kE [X] − Σk ) 1 {Σk < η}]
(140)
where the random variable in the right side is positive for large enough k. For ǫ = 21 , write E [(Σk − kE [X]) 1 {Σk ≥ η}] ≤ E [(Σk − kE [X]) 1 {Σk − kE [X] ≥ max{0, bk }}]
where the second term is O
√1 k
+ E [(Σk − kE [X]) 1 {−|bk | ≤ Σk − kE [X] ≤ 0}]
(141)
, and the first term is evaluated in the same manner as (132). A PPENDIX B P ROOF
Denote for brevity
OF
(75) −1 (ǫ) 2
(Q 1 f (ǫ) = √ e− 2π February 5, 2014
)
2
(142)
DRAFT
25
Direct computation yields f (ǫ) = −
1 (Q−1 )′ (ǫ)
f ′ (ǫ) = Q−1 (ǫ) f ′′ (ǫ) = −
(143) (144)
1 f (ǫ)
(145)
Furthermore, using the bounds √ we infer that as ǫ → 0
x2 x2 1 x e− 2 < Q(x) < √ e− 2 2π(1 + x2 ) 2π
−1
Q Finally
lim
ǫ→0
(ǫ) =
r
1 1 2 loge + O loge loge ǫ ǫ
q f (ǫ) − ǫ 2 loge ǫ
1 ǫ
f (ǫ) − ǫf ′ (ǫ) ǫ→0 ǫ
(146)
(147)
= lim
(148)
= lim f ′′ (ǫ)ǫ
(149)
ǫ→0
ǫ ǫ→0 f (ǫ) 1 = lim −1 ǫ→0 Q (ǫ)
= lim
=0
(150) (151) (152)
where •
(148) is due to (144) and (147);
•
(149) is by the l’Hˆopital rule;
•
(150) applies (145);
•
(151) is by the l’Hˆopital rule and (144). A PPENDIX C P ROOF
OF
T HEOREM 6
Given PS , d, denote for arbitrary PΓ|S RS|Γ=γ (d, ǫ) ,
February 5, 2014
min
PZ|S : P[d(S,Z)>d|Γ=γ]≤ǫ
I(S; Z|Γ = γ)
(153)
DRAFT
26
In the proof of the converse bound in (109), the following result will be instrumental. Lemma 3. Suppose PS , d, d > dmin and F ⊆ M are such that for all s ∈ F S (S, d) ≥ r a.s.
(154)
RS|S∈F (d, ǫ) ≥ |(1 − ǫ)r + (1 − ǫ) log2 P [S ∈ F ] − h(ǫ)|+
(155)
for some real r. Then
Proof: Denote pS (z) , P [d(S, z) ≤ d|S ∈ F ] p , sup pS (z)
(156) (157)
c z∈M
If ǫ > 1 − p, RS (d, ǫ) = 0, so in the sequel we focus on the nontrivial case ǫ≤1−p
(158)
To lower-bound the left side of (155), we weaken the supremum in (83) by selecting a suitable pair (J(s), λ) satisfying the constraint in (84). Specifically, we choose ǫp (1 − ǫ)(1 − p) 1−ǫ , s∈F exp(J(s)) = exp(J) , p exp(−λ) =
(159) (160)
To verify that the condition (84) is satisfied, we substitute the choice in (159) and (160) into the left side of (84) to obtain pS (z) pS (z) 1 − pS (z) pS (z) 1 − pS (z) + + (1 − ǫ) ≤ (1 − p) − ǫ 1−p p 1−p p p =1
(161) (162)
where (161) is due to (158) and the observation that the expression in square brackets in the
February 5, 2014
DRAFT
27
right side of (161) is nonnegative. Plugging (159) and (160) into (83), we conclude that RS|S∈F (d, ǫ) ≥ J − λǫ
(163)
= d(ǫk1 − p) − h(ǫ)
(164)
≥ (1 − ǫ) log2
(165)
1 − h(ǫ) p
≥ (1 − ǫ)r + (1 − ǫ) log2 P [S ∈ F ] − h(ǫ)
(166)
where (166) is due to pS (z) ≤ E [exp(λS d − λS d(S, z))|S ∈ F ]
where λS , −RS (d), and
(167)
≤ E [exp(S (S, d) + λS d − λS d(S, z) − r)|S ∈ F ]
(168)
≤
(169)
exp(−r) E [exp(S (S, d) + λS d − λS d(S, z))] P [S ∈ F ] exp(−r) ≤ P [S ∈ F ]
•
(167) is Markov’s inequality;
•
(168) applies (154);
•
(170) is equivalent to (91).
(170)
Proof of Theorem 6: We start with the converse bound in (109). Note first that, similar to (44), the constraint in (19) is achieved with equality. Denoting the random variable Γ = ⌊S (S, d)⌋ + 1
(171)
we may write I(S; Z) = I(S, Γ; Z) = I(S; Z|Γ) + I(Γ; Z)
(172) (173)
so RS (d, ǫ) ≥ =
February 5, 2014
min
PZ|S : P[d(S,Z)>d]≤ǫ
min
ε(·) : E[ε(Γ)]≤ǫ
I(S; Z|Γ) ∞ X
PΓ (j)RS|Γ=j (d, ǫ(j))
(174)
(175)
j=0
DRAFT
28
We apply Lemma 3 to lower bound each term of the sum by RS|Γ=j (d, ǫ(j)) ≥ |(1 − ǫ(j))j + (1 − ǫ) log2 PΓ (j) − h(ǫ(j))|+
(176)
to obtain RS (d, ǫ) ≥ =
min
{E [(1 − ǫ(Γ))S (S, d)] − E [h(ǫ(Γ))]} − H(Γ)
(177)
min
{E [(1 − ǫ(Γ))S (S, d)]} − H(Γ) − h(ǫ)
(178)
ε(·) : E[ε(Γ)]≤ǫ
ε(·) : E[ε(Γ)]≤ǫ
≥ E [hS (S, d)iǫ ] − H(Γ) − h(ǫ)
(179)
≥ E [hS (S, d)iǫ ] − log2 (E [JS (S)] + 1) − log2 e − h(ǫ)
(180)
where (177) uses (92), (178) is by concavity of h(·), (179) is due to (115), and (180) holds because Γ + λS d ≥ JS (S) ≥ 0, and the entropy of a random variable on Z+ with a given mean is maximized by that of the geometric distribution. To show the upper bound in (110), fix an arbitrary distribution PZ¯ and define the conditional probability distribution PZ|S through5 1{d(s,z)≤d} PZ|S=s(z) PZ¯ (Bd (s)) = PZ¯ (z) 1
h− log2 PZ¯ (Bd (s))iǫ > 0
(181)
otherwise
By the definition of PZ|S
P [d(S, Z) > d] ≤ ǫ
(182)
Upper-bounding the minimum in (19) with the choice of PZ|S in (181), we obtain the following nonasymptotic bound: RS (d, ǫ) ≤ I(S; Z)
= D PZ|S kPZ¯ |PS − D(PZ kPZ¯ ) ≤ D PZ|S kPZ¯ |PS = E [h− log2 PZ¯ (Bd (S))iǫ ]
5
(183) (184) (185) (186)
Note that in general PS → PZ|S 9 PZ¯ .
February 5, 2014
DRAFT
29
A PPENDIX D P ROOF
OF
L EMMA 2
The following refinement of the lossy AEP is essentially contained in [17]. Lemma 4. Under restrictions (i)–(iv), there exist constants C1 , C2 such that eventually, almost surely k
X 1 1 ¯ k )) + kC1 (d − d(S ¯ k ))2 + C2 ≤ S (Si , d) + log2 k − kλS (d − d(S log2 k PZ k⋆ (Bd (S )) 2 i=1
(187)
where
k
X ¯ k) , 1 d(s E [d(si , Z⋆ )|S = si ] k i=1
(188)
Proof: It follows from [17, (4.6), (5.5)] that the probability of violating (187) is O k12 . P 1 Since ∞ k=1 k 2 is summable, by the Borel-Cantelli lemma (187) holds w. p. 1 for k large enough. ¯ k ) is a normalized sum of independent random variables with mean d, we Noting that d(s conclude using Lemma 4 that for k large enough 1 1 ≤ kR(d) + log2 k + O (1) E log2 k PZ k⋆ (Bd (S )) 2 Lemma 2 is now immediate from (116) and (117) and the expansion for E (119).
(189)
S k (S k , d) ǫ in
R EFERENCES [1] W. Szpankowski and S. Verd´u, “Minimum expected length of fixed-to-variable lossless compression without prefix constraints: memoryless sources,” IEEE Transactions on Information Theory, vol. 57, no. 7, pp. 4017–4025, 2011. [2] S. Verd´u and I. Kontoyiannis, “Optimal lossless data compression: Non-asymptotics and asymptotics,” IEEE Transactions on Information Theory, vol. 60, no. 2, pp. 777–795, Feb. 2014. [3] N. Alon and A. Orlitsky, “A lower bound on the expected length of one-to-one codes,” IEEE Transactions on Information Theory, vol. 40, no. 5, pp. 1670–1672, 1994. [4] A. D. Wyner, “An upper bound on the entropy series,” Inf. Contr., vol. 20, no. 2, pp. 176–181, 1972. [5] T. S. Han, “Weak variable-length source coding,” IEEE Transactions on Information Theory, vol. 46, no. 4, pp. 1217–1226, 2000. [6] ——, Information spectrum methods in information theory. February 5, 2014
Springer, Berlin, 2003. DRAFT
30
[7] H. Koga and H. Yamamoto, “Asymptotic properties on codeword lengths of an optimal fixed-to-variable code for general sources,” IEEE Transactions on Information Theory, vol. 51, no. 4, pp. 1546–1555, April 2005. [8] A. Kimura and T. Uyematsu, “Weak variable-length Slepian-Wolf coding with linked encoders for mixed sources,” IEEE Transactions on Information Theory, vol. 50, no. 1, pp. 183–193, 2004. [9] Y. Polyanskiy, H. V. Poor, and S. Verd´u, “Feedback in the non-asymptotic regime,” IEEE Transactions on Information Theory, vol. 57, no. 8, pp. 4903–4925, 2011. [10] H. Koga, “Source coding using families of universal hash functions,” IEEE Transactions on Information Theory, vol. 53, no. 9, pp. 3226–3233, Sep. 2007. [11] V. Erokhin, “Epsilon-entropy of a discrete random variable,” Theory of Probability and Applications, vol. 3, pp. 97–100, 1958. [12] V. Strassen, “Asymptotische absch¨atzungen in Shannon’s informationstheorie,” in Proceedings 3rd Prague Conference on Information Theory, Prague, 1962, pp. 689–723. [13] Y. Polyanskiy, H. V. Poor, and S. Verd´u, “Channel coding rate in finite blocklength regime,” IEEE Transactions on Information Theory, vol. 56, no. 5, pp. 2307–2359, May 2010. [14] V. Kostina and S. Verd´u, “Fixed-length lossy compression in the finite blocklength regime,” IEEE Transactions on Information Theory, vol. 58, no. 6, pp. 3309–3338, June 2012. [15] I. Kontoyiannis, “Pointwise redundancy in lossy data compression and universal lossy data compression,” IEEE Transactions on Information Theory, vol. 46, no. 1, pp. 136–152, Jan. 2000. [16] Z. Zhang, E. Yang, and V. Wei, “The redundancy of source coding with a fidelity criterion,” IEEE Transactions on Information Theory, vol. 43, no. 1, pp. 71–91, Jan. 1997. [17] E. Yang and Z. Zhang, “On the redundancy of lossy source coding with abstract alphabets,” IEEE Transactions on Information Theory, vol. 45, no. 4, pp. 1092–1110, May 1999. [18] S. Leung-Yan-Cheong and T. Cover, “Some equivalences between shannon entropy and kolmogorov complexity,” IEEE Transactions on Information Theory, vol. 24, no. 3, pp. 331–338, 1978. [19] I. Csisz´ar and J. K¨orner, Information Theory: Coding Theorems for Discrete Memoryless Systems. New York: Academic, 1981. [20] A. A. Yushkevich, “On limit theorems connected with the concept of entropy of markov chains,” Uspekhi Matematicheskikh Nauk, vol. 8, no. 5, pp. 177–180, 1953. [21] W. Szpankowski, “A one-to-one code and its anti-redundancy,” IEEE Transactions on Information Theory, vol. 54, no. 10, pp. 4762–4766, 2008. [22] C. E. Shannon, “Coding theorems for a discrete source with a fidelity criterion,” IRE Int. Conv. Rec., vol. 7, pp. 142–163, Mar. 1959, reprinted with changes in Information and Decision Processes, R. E. Machol, Ed. New York: McGraw-Hill, 1960, pp. 93-126. [23] J. Kieffer, “Strong converses in source coding relative to a fidelity criterion,” IEEE Transactions on Information Theory, vol. 37, no. 2, pp. 257–262, Mar. 1991. [24] I. Csisz´ar, “On an extremum problem of information theory,” Studia Scientiarum Mathematicarum Hungarica, vol. 9, no. 1, pp. 57–71, Jan. 1974. [25] A. Dembo and I. Kontoyiannis, “Critical behavior in lossy source coding,” IEEE Transactions on Information Theory, vol. 47, no. 3, pp. 1230–1236, 2001.
February 5, 2014
DRAFT
31
[26] E. C. Posner, E. R. Rodemich, and J. Rumsey, Howard, “Epsilon entropy of stochastic processes,” The Annals of Mathematical Statistics, vol. 38, no. 4, pp. 1000–1020, 1967. [27] E. C. Posner and E. R. Rodemich, “Epsilon entropy and data compression,” The Annals of Mathematical Statistics, pp. 2079–2125, 1971. [28] V. V. Petrov, Limit theorems of probability theory.
February 5, 2014
Oxford Science Publications, 1995.
DRAFT