Rate Distortion Behavior of Sparse Sources - CiteSeerX

0 downloads 0 Views 351KB Size Report
Dec 19, 2008 - rate distortion behavior is characterized using upper bounds. Sparse spikes may ... classifying quantization (MCQ) from the companion paper. [13] and applies ..... delta function, the “pdf” of the BG spike can be written as f(x) = (1 − p)δ(x) ..... over [−a, a] (since limb→∞ 2ac = 1), with an infinite tail contributing ...
SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY

1

Rate Distortion Behavior of Sparse Sources Claudio Weidmann and Martin Vetterli Submitted to IEEE Trans. Information Theory on December 19, 2008.

Index Terms—Sparse signal representations, rate distortion theory, memoryless systems, entropy, transform coding.

I. I NTRODUCTION HE success of wavelet-based coding, especially in image compression, is often attributed to the ability of wavelets to “isolate” singularities, something Fourier bases fail to do efficiently [1]. Thus, a piecewise smooth signal is mapped through the wavelet transform into a sparse set of non-zero transform coefficients, namely coefficients around discontinuities, as well as coefficients representing the general trend of the signal [2]. While this behavior is well understood in terms of nonlinear approximation power (that is, approximation by N largest terms of the wavelet transform, see [3] for a thorough treatment), the rate distortion behavior is more open. Early work on nonlinear approximation (NLA) of random functions [4] concentrated on the approximation error as a function of the number of approximation terms, neglecting the tradeoff between the rate needed to identify these terms and the rate used to quantize each term. The work by Mallat and Falzon [5] was the first to analyze the operational lowrate behavior of transform image coding, and showed the very

T

The material in this paper was presented in part at the Data Compression Conference, Snowbird UT, March 1999 and 2000, and at the IEEE International Symposium on Information Theory, Washington DC, June 2001. Part of this work stems from the Ph.D. thesis of the first author and was supported by an ETHZ/EPFL fellowship. Claudio Weidmann was with the Audiovisual Communications Laboratory, EPFL, Lausanne, Switzerland, and is now with ftw. Telecommunications Research Center Vienna, A-1220 Vienna, Austria. E-mail: [email protected]. Martin Vetterli is with the Audiovisual Communications Laboratory, EPFL, Lausanne, Switzerland and with the Department of EECS, UC Berkeley, Berkeley CA 94720.

MSE Distortion [dB]

0 −5 −10 −15 −20 −25 −30 −35 −40 0

0.5

1

1.5

2

2.5

3

3.5

4

2.5

3

3.5

4

Rate [bits]

(a) 100

Coefficients used [%]

Abstract—The rate distortion behavior of sparse memoryless sources is studied. Such sources serve as models for sparse representations and can be used for the performance analysis of “sparsifying” transforms like the wavelet transform, as well as nonlinear approximation schemes. Under the Hamming distortion criterion, R(D) is shown to be almost linear for sources emitting sparse binary vectors. For continuous random variables, the geometric mean is proposed as a sparsity measure and shown to lead to upper and lower bounds on the entropy, thereby characterizing asymptotic R(D) behavior. Three models are analyzed more closely under the mean squared error distortion measure: continuous spikes in random discrete locations, power laws matching the approximately scale-invariant decay of wavelet coefficients, and Gaussian mixtures. The latter are versatile models for sparse data, which in particular allow to bound the suitably defined coding gain of a scalar mixture compared to that of a corresponding unmixed transform coding system. Such a comparison is interesting for transforms with known coefficient decay, but unknown coefficient ordering, e.g. when the positions of highest-variance coefficients are unknown. The use of these models and results in distributed coding and compressed sensing scenarios is also discussed.

10

1

0.1

0.01 0

0.5

1

1.5

2

Rate [bits]

(b) Fig. 1. (a) Operational distortion rate points of a wavelet coder applied to the Lena image. The knee shape, leading from steep decay at low rates to the asymptotic −6 dB/bit slope, is typical for such image coders. (b) At low rates, only a small fraction of coefficients is quantized to nonzero values, all the others are not used in the reconstruction of the image.

different behavior with respect to classic linear KarhunenLo`eve transform (KLT) theory. In essence, at low rates, only a few wavelet coefficients are involved in the approximation of piecewise smooth functions, leading to a steeper decay of the distortion rate function compared to the classic exponential decay in the case of Gauss-Markov processes and the KLT. This result had been observed experimentally in low-rate image coding; see Figure 1 for an example displaying the typical steep distortion decay at low rates, which then gives way to the familiar −6 dB/bit slope at higher rates. It is probably worthwhile to contrast the KLT on jointly Gaussian processes with the wavelet transform on piecewise smooth processes. In the KLT case, the optimal strategy is waterfilling [3, Sec. 11.3], [6, Sec. 13.3.3], and the approximation process is linear (up to quantization), that is, the set of coefficients that are quantized and used for reconstruction is fixed a priori based on statistical signal properties. In the wavelet transform approach, the approximation is nonlinear, since the coefficients for reconstruction are chosen a posteriori based on the transformed signal realization. A key element of efficient compression is to “point to” the important coefficients (many data structures have been proposed just for this, for example zero trees [7]). This underlines the importance of “location” in compressing vectors with few important coefficients, see [8]

2

for a thorough analysis in the context of wavelets. The above-cited results indicate the interest to understand more fully the rate distortion behavior of sparse signal representations, in particular, to narrow down rates and distortions within constants, avoiding the loose factors in the exponent that are often present in approximation results. The approach taken in the present paper is to propose simple memoryless models for sparse sources and study their rate distortion behavior. Since unitary “sparsifying” transforms leave vector norms unchanged, for the mean squared error (MSE) distortion measure it is sufficient to study the rate distortion function of sparse sources that model the transform coefficients. This allows to understand the compression of signals, like piecewise smooth functions, that are “sparsified” by the wavelet transform (which in the case of the popular 9/7 biorthogonal wavelet is nearly unitary). We stress that although wavelet image transform coefficients serve as a practical example throughout this paper, its main theme is sparsity per se. Moreover, the analysis is not restricted to low rates (for which closed-form results are hard to obtain. . . ) as it considers also the relationship between sparsity and asymptotic compressibility. Recently, sparse sources have received renewed attention with the work on sparse sampling [9] and compressed sensing (CS) [10], [11]. In both cases, exact reconstruction is studied, and approximation is possible, but not necessarily well understood. In particular, the rate distortion behavior is mostly an open problem, with some initial results in [12]. Thus, the present paper fills a gap, giving either precise results or tight bounds on the rate distortion behavior of models that serve as benchmarks for these methods. The remainder of this paper is structured as follows. Section II presents three classes of models for sparse sources that will be studied, namely sparse binary vectors, mixed discrete/continuous “spikes” and continuous sparse sources. Section III looks at sparse binary vectors (sparsity patterns) as a model for position coding. In the deterministic case, when the number of non-zero entries is known a priori, it is possible to give closed-form R(D) expressions for Hamming distortion. Interestingly, for sparse spikes, R(D) is almost linear. In the non-deterministic case of a Bernoulli-p source, Theorem 3 states that R(D) is asymptotically linear as p → 0. Section IV considers a mixed discrete/continuous BernoulliGaussian spike source, which uses a Bernoulli source to switch a Gaussian source on or off (thus it emits strictly sparse sequences). So both position and value are important. The MSE rate distortion behavior is characterized using upper bounds. Sparse spikes may be used to model the steep distortion decay in low-rate NLA, but they fail to model the behavior at medium to high rates, for which continuous source models turn out to be more appropriate. Section V introduces the normalized geometric mean of a continuous source as a sparsity measure that leads to lower and upper bounds on the source entropy. Therefore it is a good means to characterize the asymptotic rate distortion behavior of sparse sources. It also gives a more quantifiable meaning to the notion of “compressible” signal. In compressed sensing, this denotes a signal that is not strictly sparse.

SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY

Section VI introduces upper bounds on D(R) of magnitude classifying quantization (MCQ) from the companion paper [13] and applies them to a popular power-law model for approximately scale-invariant data, such as wavelet coefficients. In order to overcome some of the limitations of the Bernoulli-Gaussian spike source, Section VII considers Gaussian mixture models, providing upper and lower bounds which capture the essential D(R) characteristics (the “knee” shape) of sparse sources. Based on MCQ, a notion akin to the classic coding gain of transform coding is introduced. In the case of Gaussian transform coefficients, it is possible to bound the loss in compression gain if the coefficients are randomly mixed (that is, if one knows only their variances, but not their positions). Finally, Section VIII briefly outlines how the results can be applied to distributed coding and compressed sensing scenarios. II. M ODELS FOR S PARSE S OURCES When using sparse signal representations as building blocks for lossy source coding, the goal is to concentrate most signal energy in as few coefficients as possible. Lossy compression then proceeds by selecting a subset of coefficients that will be quantized. Nonlinear approximation (NLA) methods will generally select the largest coefficients first, other (linear) methods might select a fixed set depending on the coding rate or some other criterion. Signal energy is measured by the square norm and hence the appropriate distortion measure is the mean squared error (MSE). Our approach in this paper is to model the coefficients representing the signal as coming from a sparse source, which emits an i.i.d. sequence of random variables. We will study the rate distortion behavior of three different classes of sparse source models: 1) Sparse binary vector sources might serve as a model of a significance map or sparsity pattern, that is, the binary map used to record the positions of significant coefficients in a NLA scheme (these are the coefficients which are actually used to reconstruct the signal). The natural distortion measure for this setting is the average Hamming distance. We will analyze both sources that emit vectors of length N containing exactly K ones and Bernoulli-p (binary memoryless) sources, emitting sequences of i.i.d. binary random variables. 2) Spike sources are a generalization of sparse binary sources, where each binary 1 is associated with a continuous random variable. In particular, we will study the product of a Bernoulli-p source (emitting 0 or 1) and a memoryless Gaussian source, using the MSE distortion measure. This might serve as a crude model of (very) low-rate wavelet-based NLA coding, when only a small subset of coefficients are used to represent the signal. 3) Continuous sparse sources are memoryless sources emitting i.i.d. continuous random variables with a peaked unimodal density (the mode is assumed to be zero). Examples are power laws, Laplacian and generalized Gaussian densities with exponent smaller than

WEIDMANN AND VETTERLI: RATE DISTORTION BEHAVIOR OF SPARSE SOURCES

3

Single Spike in N = 2, 3, 4, 6, 10, 20 Positions

one. Such sparse sources can be used as a zero-th order model for wavelet coefficients in e.g. image coding [14]. In particular, we will show that very simple Gaussian mixtures are sufficient to capture the key aspects of the rate distortion behavior of sparse wavelet coefficients.

A sparse binary source models e.g. the positions of significant coefficients in a nonlinear approximation scheme. The appropriate distortion measure for lossy coding is the Hamming distance. We will first study memoryless vector sources that emit exactly K ones in a vector of length N , after which we look at a simple scalar binary memoryless (Bernoulli-p) source. In both cases, the rate distortion function becomes linear for asymptotically sparse sources, for which the average Hamming weight approaches zero. The interest of the vector model lies mainly in the fact that it is possible to obtain analytic expressions for R(D), of which there are not many examples in rate distortion theory. A lossy source encoder maps a vector X to a reconstructed ˆ The fidelity of this approximation is measured by version X. the per-letter Hamming distance ˆ = dH (X, X)

N X

[1 − δ(Xi − Xˆi )].

(1)

i=1

This is equivalent to a frequency of error criterion where both types of errors have the same cost (coding a one when there is none and vice-versa). The rate distortion function R(D) gives the minimum rate R necessary to encode the source with fidelity D. Equivalently, R(D) may also be expressed as distortion rate function D(R). Definition 1 The binary (K, N ) source is a memoryless source that emits binary vectors of length N and  Hamming N weight K, with uniform probability over the K possibilities. Since the source alphabet size is finite, the rate distortion problem is not a proper multidimensional problem and can actually be solved with the methods for discrete memoryless sources summarized in Appendix A. A. Binary Vectors of Weight 1 We first analyze the simplest case of a binary (1, N ) source X, which is equivalent to a memoryless uniform source U with alphabet U = {1, 2, . . . , N }. Using the standard basis vectors ei we can write X = eU . It can be shown (see Theorem 14 from [15] in Appendix A) that just one additional reconstruction letter is needed to achieve the rate distortion bound, and it will map to the all-zero vector 0. To see that it can only be the all-zero vector, consider the source alphabet {e1 , e2 , . . . , eN }, which consists of all vectors of Hamming weight one. Any other non-zero vector will be at Hamming distance one or more from these vectors and thus can only worsen the distortion achieved by the all-zero vector, which is exactly one. If we define Uˆ = U ∪ {0} and e0 = 0, then

0.9

0.8

0.7

Hamming Distortion

III. S PARSE B INARY S OURCES

1

0.6

0.5

0.4

0.3

0.2

0.1

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

R/log(N)

Fig. 2. D(R) for the binary (1, N ) source with Hamming distortion, N = 2 . . . 20 (bottom to top curve). The rate has been normalized by log N . For N → ∞, D(R) becomes a straight line, see (2).

everything fits nicely. Using u ˆ = 0 corresponds to not coding the position. We get the following distortion measure: ρ(u; u ˆ) = dH (eu , euˆ ) = 2[1 − δ(u − u ˆ)] − δ(ˆ u). Thus “giving the right answer” has zero distortion, a wrong answer two, and not answering costs one distortion unit. Proposition 1 The rate distortion function of a binary (1, N ) source for N ≥ 2 and the Hamming distortion criterion (1) is ( 2 (1 − D) log(N − 1), N < D ≤ 1,  R(D) = D 2 log N − D 2 log(N − 1) − hb 2 , 0 ≤ D ≤ N , (2) where hb (p) = −p log p − (1 − p) log(1 − p) is the binary entropy function. The proof appears in Appendix B; Figure 2 shows a set of typical D(R) functions. As N becomes large, the linear segment dominates the rate distortion characteristics. Further we observe that in the special case N = 2, the solution degrades to twice the D(R) function of a binary symmetric source.

B. Binary Vectors of Weight K  N Now we consider a source emitting one of the K binary vectors of length N and Hamming weight K, uniformly at random. We look only at the case where the number of ones is K ≤ N/2, since the other case (N/2 ≤ K ≤ N ) is complementary. Proposition 2 The rate distortion function of a binary (K, N ) source with K ≤ N/2 is composed of a linear part: R(D) = (D − K) log β0 ,

D(β0 ) < D < K

4

SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY

K = 16, 12, 8, 4, 2 Spikes in N = 32 Positions

Proof: The rate distortion function of the BMS is R(D) = hb (p) − hb (D) for D ≤ p ≤ 21 [6, Thm. 13.3.1]. Therefore

1

0.9

hb (pd) R(pd) =1− hb (p) hb (p) pd log(pd) + (1 − pd) log(1 − pd) =1− , p log(p) + (1 − p) log(1 − p)

Hamming Distortion D/D max

0.8

0.7

0.6

0.5

from which

0.4

lim

0.3

p→0

R(pd) d log(pd) − d log(1 − pd) = 1 − lim p→0 hb (p) log p − log(1 − p) d/p + d2 /(1−pd) p→0 1/p + 1/(1−pd)

0.2

= 1 − lim

0.1

=1−d 0

0

0.1

0.2

0.3

0.4

0.5

R/Rmax

0.6

0.7

0.8

0.9

1

Fig. 3. D(R) for the binary (K, N ) source with Hamming distortion, K = 2 . . . 16 spikes (top to bottom curve)in N = 32 positions. Rate and distortion N are normalized to R∗ = R/ log K and D∗ = D/K, respectively.

and a part which can be expressed parametrically for 0 < β < β0 : D(β) =

K X

bd 2d,

d=1

R(β) = log

  X K K X N + bd log bd − bd log wd . K d=0

d=0

The quantities β0 , bd and wd are defined in the proof in Appendix C. Figure 3 shows that for sparse spikes (small K/N ) the linear segment again dominates the D(R) behavior. The consequence of this “almost linear” behavior for sparse spikes is the following: to build a close  to optimal encoder for intermediate N rates 0 < R < log K , we can simply multiplex between a  N rate 0 code (no spikes coded) and one with rate log K (all K spikes coded exactly). Put otherwise, if we have a bit budget to be spent in coding a sparse binary vector, we can simply go ahead and code the exact positions of the ones (the spikes) until we run out of bits. C. Sparse Binary Memoryless Sources The simplest model of a sparse binary source is a Bernoullip binary memoryless source (BMS) with p = Pr{X = 1}  1. It generalizes the above binary vector models, since blocks of N samples will contain close to pN ones on average, instead of a fixed number K. Theorem 3 Consider a Bernoulli-p source (p ≤ 12 ) with normalized Hamming distortion d = D/p. Then the normalized rate distortion function is asymptotically linear when p → 0: lim

p→0

R(pd) = 1 − d, hb (p)

0≤d≤1

Theorem 3 shows that if we normalize the rate and the distortion by their maxima, hb (p) and p, respectively, the rate distortion function becomes linear for sparse binary sources with p → 0. IV. S PIKE S OURCES The previous section studied sparse binary sources that model the position of significant coefficients. Now we also consider the values of those coefficients, by modeling them as continuous random variables. The resulting model is a discrete-time stochastic process that is zero almost all the time, except in a few positions, where spikes stick out. Distortion will be measured by the mean squared error (MSE). A simple model of a spike source can be obtained by multiplying the outputs of a binary source (emitting 0 or 1) and a memoryless continuous source. The binary source simply switches the value source on or off. Here we consider only Gaussiandistributed values, because they provide a worst-case benchmark for MSE distortion. Definition 2 (Bernoulli-Gaussian (BG) spike source) A Bernoulli-Gaussian spike source emits i.i.d. random variables that are the product of a binary random variable U with Pr{U = 1} = p and Pr{U = 0} = 1 − p and a zero-mean Gaussian random variable V with variance σv2 . Using Dirac’s delta function, the “pdf” of the BG spike can be written as f (x) = (1 − p)δ(x) + p √

2 2 1 e−x /2σv . 2πσv

(4)

BG spikes are mixed random variables that have both a discrete and a continuous component. From (4) it is clear that in general their distribution function is not absolutely continuous and therefore most results of standard rate distortion theory do not hold. The spike entropy cannot be computed with the usual integral, but only via mutual information conditioned on the discrete part [16, Ch. 2]. With this method, Rosenthal and Binia [17] derived the asymptotic rate distortion behavior of mixed random variables, as well as certain mixed random vectors. Their result coincides with the simple upper bound (5) presented below if the continuous part is Gaussian, otherwise their result is tighter. Later, Gy¨orgi et al. [18] extended these

WEIDMANN AND VETTERLI: RATE DISTORTION BEHAVIOR OF SPARSE SOURCES

0 −5

−10

−10

−15 −20 −25 −30 −35 −40

0 Bound (5) Bound (22) Numerical

−15 −20 −25 −30 −35

Bound (5) Bound (22) Numerical

−5 −10 MSE distortion [dB]

−5

MSE distortion [dB]

MSE distortion [dB]

0

5

−15 −20 −25 −30 −35

−40

−40

−45

−45

−50 0 0.2 0.4 0.6 0.8 1 1.2 Rate [bits]

−50 0 0.05 0.1 0.15 0.2 0.25 Rate [bits]

−50 0 0.05 0.1 0.15 0.2 0.25 Rate [bits]

(a)

(b)

(c)

−45

Bound (5) Bound (22) Numerical

Fig. 4. Distortion rate behavior of Bernoulli-Gaussian spikes for different values of the Bernoulli-p parameter (normalized to unit variance): (a) p = 0.11, (b) p = 0.01, (c) p = 0.005.

asymptotic results to random vectors with more general mixed distributions and to a wide class of sources with memory. A simple upper bound on D(R) of the spike source can be derived using an adaptive two-step code: √ 1. all samples with magnitudes above some  (0 <  < D) are classified as spikes and their positions encoded with a bitmap using hb (p) nats/sample; 2. the spike values are encoded with a Gaussian pσ 2 random codebook using 12 log Dv nats/spike [6, Thm. 13.3.2]. The resulting upper bound expressed as a function of rate is   2R − 2hb (p) 2 , R ≥ hb (p). (5) D(R) ≤ pσv exp − p This bound is loose at high distortions (i.e. low rates) and can be improved by coding only a fraction of the spikes. In particular, a tighter bound is obtained by varying the classification threshold  and optimizing over the resulting family of upper bounds; the result will be stated in Section VI. Figure 4 shows the bound (5) and the optimized low-rate bound (22) from Section VI, together with D(R) estimated numerically with the Blahut-Arimoto algorithm for different values of p. The asymptotic distortion decay is on the order of − p6 dB/bit, which can be much steeper than the −6 dB/bit typical of absolutely continuous random variables. This decay behavior is representative of spike sources, regardless whether the value is Gaussian or not. Comparing with Figure 1, we see that the spike D(R) behavior is very different from the one observed in actual image coders. Thus the spike source is certainly not a good general model for sparse transform coefficients. However, it explains the steep D(R) decay that can be achieved at very low rates, when only very few wavelet coefficients are used to represent the image. When the rate is higher, the spike model fails, because the abrupt change from zero to a Gaussian value distribution does not reflect the coefficient decay actually observed (i.e. the “continuous” sparsity that will be considered in the next section). There are other applications of spikes,

such as using them as a benchmark for transform coding. In the case of data-independent (linear) rate allocation, any transform of the spike process yields worse performance compared to nonlinear approaches [1]. For constant-value spikes (“1 in N ” as in Sec. III-A), any KLT basis contains the vector [1, 1, . . . , 1]T and thus always destroys sparsity [19]. V. E NTROPY OF C ONTINUOUS S PARSE S OURCES A. The Geometric Mean as a Sparsity Measure For the first two models studied, the natural measure of sparsity is the Hamming weight of a sample vector, normalized by the vector length. The case of a continuous sparse source is not as simple, since for a proper continuous random variable X we have Pr{X = 0} = 0 and therefore Hamming weight is useless, since it measures strict sparsity. In this section we will argue that the geometric mean, normalized by the standard deviation, is a useful single-parameter sparsity measure. A sequence of n positiveP real numbers, x1 , x2 , . . . , xn , n 1 has arithmetic mean A = n i=1 xi and geometric mean n Qn Gn = ( i=1 xi )1/n . The classic arithmetic-geometric mean inequality is Gn ≤ An , with equality iff all xi are equal. The geometric mean equals the side length of an n-cube with the same volume as the rectangular parallelepiped spanned by the xi . A small ratio Gn /An corresponds to a “thin” parallelepiped or a sparse sequence {xi }. Conversely, Gn /An = 1 yields a “fat” n-cube or the least sparse sequence {x1 = x2 = . . . = xn }. We will use the expected geometric mean of a block of sample magnitudes, in the limit of large block length, to measure the sparsity of a memoryless source. Definition 3 The geometric mean of a memoryless source X is G(X) = exp(E log |X|). To see that G(X) is well defined for a memoryless source X with density f (x) and is indeed the desired quantity, consider

6

SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY

a block of n i.i.d. samples from such a Q source. The geometric n mean of these n samples is Gn (x) = ( i=1 xi )1/n , while its expected value is Z Y n n Y E Gn (X) = |xi |1/n f (xi ) dx i=1

=

n Z Y

i=1

|xi |1/n f (xi ) dxi = (E |X|1/n )n ,

i=1

since Fubini’s theorem can be applied to the product density. If we let the block size go to infinity, we obtain the geometric mean of the source [20, p. 139]: n  G(X) = lim E |X|1/n n→∞

1/p

= lim (E |X|p ) p→0+

= exp(E log |X|).

(6)

For a fixed source variance, if more probability mass is concentrated around zero, G(X) will become smaller and a sample vector of X will look sparser. Due to the fixed variance, the density will become more heavy-tailed at the same time. Different sparsity measures have been proposed for a variety ofPapplications: a quite common one is the quasi-norm kxkp = n ( i=1 |xi |p )1/p with 0 < p ≤ 1; see for example [19] and references therein. The obvious question is: how to choose p? If x is a sample from a memoryless source, choosing p = 1/n will yield the geometric mean as n → ∞, by equation (6). This is a strong argument in favor of the geometric mean as a sparsity measure for continuous random variables. In this respect, it is also interesting to observe that limp→0+ kxkpp is equal to the Hamming weight wH (x), which is the strictest sparsity measure in the sense that only values that are exactly zero contribute to sparsity. In the remainder of this section we will show that in combination with the variance, the geometric mean can be used to bound the source entropy and therefore characterize the asymptotic R(D) behavior of sparse sources. The latter fact follows from the asymptotic tightness of the Shannon lower bound (SLB) RSLB (D) = h(X) −

1 log(2πeD), 2

(7)

that is R(D) − RSLB (D) → 0 as D → 0, where h(X) = − E log f (X) is the differential entropy of the source; see e.g. [15, Sec. 4.3.4]. B. Lower Bounds on Differential Entropy The logarithm1 of the geometric mean, E log |X|, yields a lower bound on the entropy of continuous random variables with one- or two-sided monotone densities. In turn, this can be used to bound asymptotic R(D). We first prove a weaker bound that has the appeal of displaying the relationship with an analogous bound for discrete entropy. Then we will prove a bound which is tight for the class of monotone densities considered. 1 All

logarithms in this paper are natural.

Notice that in general the geometric mean p has to be normalized by the standard deviation, σ = Var(X), before it can be used as sparsity measure. However, in the following results this is not done, since the entropy would also have to be normalized (as in h(σ −1 X) = h(X) − log σ) and the two normalizations cancel each other. Proposition 4 Let X be a finite variance random variable with a monotone one-sided pdf f and range [x0 , ∞) or (−∞, x0 ]. Then h(X) ≥ E log |X − x0 |. Proof: Without loss of generality, consider a pdf f which is monotone decreasing on [x0 , ∞). The monotonicity implies that f is Riemann-integrable, and the finite variance ensures that the entropy integral is finite (by the Gaussian upper bound on entropy, h(X) ≤ 21 log(2πeσ 2 ), [6, Thm. 9.6.5]). WeR will approximate the integral h(X) − E log |X − x0 | = ∞ − x0 f (x) log(|x − x0 |f (x)) dx by a Riemann sum with step size ∆. Let xi = x0 + i · ∆ and pi = f (xi )∆, for i = 1, 2, . . .. By monotonicity, we have p1 ≥ p2 ≥ . . . and hence 1≥

∞ X i=1

pi ≥

n X

pi ≥ npn .

(8)

i=1

Thus we can write h(X) − E log |X − x0 | = lim − ∆→0

= lim − ∆→0

≥ lim − ∆→0

∞ X

pn log(|xn − x0 |f (xn ))

n=1 ∞ X

 pn  pn log n∆ · ∆ n=1 ∞ X

pn log(1) = 0,

n=1

where the inequality follows from taking the logarithm of (8). Remark: The inequality (8) was used by Wyner to prove an analogous bound for discrete entropy [21]. By using a different proof technique, we obtain a stronger result: Theorem 5 Let X be a finite variance random variable with a monotone one-sided pdf f and range [x0 , ∞) or (−∞, x0 ]. Then h(X) ≥ E log |X − x0 | + 1, (9) with equality iff f is a uniform density. Proof: For simplicity, we assume f to be decreasing on [0, ∞). Let B be the set of all such monotone decreasing, finite variance densities on [0, ∞). It is easy to verify that B is a convex set. Its boundary ∂B is the set of all finite variance uniform densities: ∂B = {u(a, x) : a ∈ (0, ∞)}, where ( 1/a if 0 ≤ x ≤ a, u(a, x) = 0 else.

(10)

WEIDMANN AND VETTERLI: RATE DISTORTION BEHAVIOR OF SPARSE SOURCES

Too see that (10) is indeed the boundary of B, observe first that no uniform density u(a, x) can be written as a nontrivial convex combination of two distinct monotone decreasing densities. Moreover, any f ∈ B can be written as a convex combination of elements of ∂B: Z ∞ λ(a)u(a, x) da, (11) f (x) = 0

7

The bound (15) is asymptotically attained by a “uniform spike” with finite variance σ 2 and density  b(b+a)+a2 −3σ 2  , |x| ≤ a, c = 2ab(b+a) 3σ 2 −a2 f (x) = d = 2b(b a < |x| ≤ b, 2 −a2 ) ,   0, else, as the tail width b → ∞.

0

where λ(a) = −af (a), as can be shown with some simple calculus. λ(a) is a proper distribution if f (x) has finite variance (in particular, limx→∞ xf (x) = 0) and if f 0 (x) ≤ 0, which is indeed the case for monotone decreasing f . Using the standard extensions to distributions, (11) also holds if f contains a countable number of steps, e.g. if it is piecewise constant. In fact, (11) is nothing but a disguised version 2 of R ∞the “layer cake” representation of f , namely f (x) = χ{f >t} (x) dt, where χ{f >t} (x) is the indicator function of 0 the level set {f (x) > t}. The existence of this representation follows from the monotonicity of f . Looking at (9), we see that Z ∞ h(X) − E log X = − f (x) log(xf (x)) dx (12)

Proof: We view the weakly unimodal pdf f as a mixture of two non-overlapping monotone one-sided densities, fl (x) and fr (x), with weights α and 1 − α, respectively. Without loss of generality we can assume x0 = 0. Then, h(X) − E log |X| = − Ef log[|X|f (X)] Z 0 αfl (x) log(−xαfl (x)) =− Z −∞ ∞ − (1 − α)fr (x) log(x(1 − α)fr (x)) 0

= hb (α) − α Efl log[|X|f (X)] − (1 − α) Efr log[|X|f (X)] ≥ hb (α) + 1,

0

is a concave-∩ functional of f , since h(X) is concave and E log X is linear in f . Therefore a minimum of (12) over the convex set B must necessarily lie on its boundary ∂B. We insert an arbitrary boundary element u(a, x) (0 < a < ∞) in (12) to obtain Z ∞ h(X) − E log X = − u(a, x) log(xu(a, x)) dx Z0 a x 1 =− a log a dx 0 a = log a − xa (log x − 1) 0 = 1.

(13)

Since (13) holds for any a, we conclude that it is the global minimum, thus proving (9) and one part of the “iff”. To prove the other part, it suffices to observe that h(X) − E log X is a strictly concave functional and thus will be larger than (13) in the interior B \ ∂B. Definition 4 A weakly unimodal density with mode x0 is a pdf which is monotone increasing (non-decreasing) on (−∞, x0 ] and monotone decreasing (non-increasing) on [x0 , ∞).

where the last inequality follows from Theorem 5. √ It is easily verified that “uniform spikes” exist for a < 3σ < b. The asymptotic entropy is lim h(X) = lim −2ac log c − 2(b − a)d log d = log(2a)

b→∞

b→∞

and the asymptotic logarithm of the geometric mean is " Z # Z b a lim E log |X| = lim 2c log x dx + 2d log x dx b→∞

b→∞

0

a

= lim 2ac(log a − 1) b→∞

+ 2d(a − a log a − b + b log b) = log a − 1. Hence the lower bound (15) is asymptotically attained by a random variable concentrating its probability uniformly over [−a, a] (since limb→∞ 2ac = 1), with an infinite tail contributing only to its variance. A peakier density with the same variance and entropy will have a smaller geometric mean. C. Upper Bound on Differential Entropy

Corollary 6 Let X be a finite variance random variable with weakly unimodal pdf f such that Pr{X ≤ x0 } = α, where x0 is the mode. Then h(X) ≥ E log |X − x0 | + 1 + hb (α).

(14)

For a density that is symmetric about x0 , f (−x−x0 ) = f (x− x0 ), (14) reduces to h(X) ≥ E log |X − x0 | + 1 + log 2.

(15)

2 The term “layer cake” representation stems from the picture of cutting the area between f (x) and the abscissa into thin horizontal stripes with widths corresponding to the level sets [22, Sec. 1.13].

If both the variance and the geometric mean are known, an upper bound on the entropy can be easily obtained via the maximum entropy approach. Owing to the assumptions made in this variational approach, the results in this subsection hold for random variables which have an absolutely continuous distribution function F (x) with probability density f (x) = F 0 (x). Theorem 7 The maximum entropy pdf given the constraints E X 2 = σ 2 and E log |X| = θ is   u/2 u−1 2 |x| exp − ux (16) f (x) = [Γ( u2 )]−1 2σu2 2σ 2 ,

8

SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY

2

where Γ(z) is the gamma function R ∞(Euler’s integral of the second kind) defined as Γ(z) = 0 e−t tz−1 dt (Re z > 0) [23, 8.31]. The shape parameter u > 0 is obtained by solving −

1 2

!

u 2σ 2

log

= θ,

(17)

d dx

where ψ(x) = log Γ(x) [23, 8.36]. For any θ ≤ log σ there is a unique solution, since E log |X| is strictly monotone increasing in u. The resulting entropy is h(σ, θ) =

u 2



u−1 u 2 ψ( 2 )

+ log Γ( u2 ) −

1 2

log

u 2σ 2 .

(18)

Setting u = 1 yields the Gaussian density and thus the global entropy maximum given the variance constraint alone. The proof appears in Appendix D.

1

Differential entropy [nats]

E log |X| =

u 1 2 ψ( 2 )

1.5

0.5 0 −0.5 −1 −1.5 −2 −2.5 −3 −5

Corollary 8 The entropy of any random variable with probability density f satisfying E X 2 = σ 2 and E log |X| = θ is upper bounded by (18).

Upper bound Generalized Gaussian family Lower bound (symmetric unimodal pdf) Gaussian pdf Uniform pdf

−4.5

−4

−3.5

−3

−2.5

−2

θ = E log|X|

−1.5

−1

−0.5

0

Fig. 5. Differential entropy bounds for symmetric weakly unimodal densities (normalized to unit variance). The square denotes the uniform density for which the lower bound is tight.

The corollary is implied by the maximum entropy approach. Theorem 9 The maximum entropy (18) for a finite variance σ 2 has the following asymptotic behavior as the geometric mean eθ goes to zero, resp. θ → −∞: h(σ, θ) ' θ + log(−2eθ),

θ → −∞

Proof: Note that θ → −∞ corresponds to u → 0+. Let ∆ = h(σ, θ) − θ − log(−2eθ)  Γ( u2 ) u u u − 1 + log = − ψ u 2 2 2 −ψ( 2 ) + log

 u 2σ 2

. (19)

To prove limu→0+ ∆ = 0, which is slightly stronger than required, we use the functional relationships Γ(x + 1) = xΓ(x), ψ(x + 1) = ψ(x) + x1 and the truncated series expansions 2 Γ(x + 1) = 1 − γx + o(x2 ), ψ(x + 1) = −γ + π6 x + o(x2 ), both for |x| < 1 (see e.g. [23, 8.3]; γ = 0.5772 . . . is Euler’s constant). We have u ψ( u2 ) u→0+ 2

lim

= lim [−1 − γ u2 + u→0+

π2 2 24 u

+ o(u3 )] = −1,

hence limu→0+ ∆ is equal to the limit of the logarithm in (19). But lim

Γ( u2 ) + log

u→0+ −ψ( u ) 2

u 2σ 2

lim

u→0+ 2 u

+

= γ 2 u (1 − 2 u log 2σu2 + γ

+ o(u2 )) −

π2 12 u

+ o(u2 )

= 1.

This can be easily seen by extending the fraction by u2 and observing that limu→0+ u log u = 0. By putting these steps together we obtain limu→0+ ∆ = 0. Figure 5 shows the lower bound (15) and the upper bound (18) as a function of θ = E log |X| for unit-variance random variables with symmetric unimodal densities. The global maximum of the upper bound corresponds to the unit-variance Gaussian density, which has θ ≈ −0.635. As a consequence of Theorem 9, the gap between the lower and upper bounds is asymptotically equal to log(−θ). The crossing between upper and lower bounds is only a seeming contradiction, because

in fact it simply means that to the right of the crossing there exist no unimodal densities satisfying both the geometric mean and variance constraints. Also shown are the points (θ, h(X)) corresponding to the family of unit-variance generalized Gaus sian pdfs f (t) = β/(2αΓ(β −1 )) exp −(|t|/α)β with β as a parameter. It can be shown that for β → 0+ one has θ = E log |X| → −∞ and h(X) lies asymptotically halfway between upper and lower bound at distance 21 log(−θ). VI. D ISTORTION R ATE B OUNDS BASED ON M AGNITUDE C LASSIFYING Q UANTIZATION A. Two Upper Bounds This section briefly presents two upper bounds on MSE D(R) of continuous random variables which will be used as tools in the following sections, where the spike model will be generalized. Proofs and a detailed discussion can be found in the companion paper [13]. The bounds are obtained by classifying the magnitudes of the source samples using a threshold t and applying the Gaussian upper bound, D(R) ≤ σ 2 e−2R [6, Problem 13.8], to each of the two classes. They are upper bounds on the operational rate distortion function of magnitude classifying quantization (MCQ), which sends the classification as side information and uses it to switch between two codebooks. The samples with magnitude above threshold are called significant and are characterized by two incomplete moments, R −t R∞ the probability µ(t) = −∞ f (x) dx + t f (x) dx and the R −t R∞ second moment A(t) = −∞ x2 f (x) dx + t x2 f (x) dx, where A(0) = σ 2 is the source variance (we assume E X = 0 without loss of generality). From these we compute the conditional second moment of the significant samples, σ12 (t) = E[X 2 | |X| ≥ t] =

A(t) , µ(t)

as well as that of the insignificant samples, σ02 (t) = E[X 2 | |X| < t] =

σ 2 − A(t) . 1 − µ(t)

WEIDMANN AND VETTERLI: RATE DISTORTION BEHAVIOR OF SPARSE SOURCES

9

0

The solution of a simple rate allocation problem leads to the following.

Gaussian

−5

Theorem 10 [13], [24] (High-Rate Upper Bound) For all σ 2 (t) R ≥ Rmin (t) = hb (µ(t)) + 12 µ(t) log σ12 (t) , the distortion 0 rate function of a memoryless source is upper bounded by

MSE distortion [dB]

D(R) ≤ Bhr (t, R) = c(t)σ 2 e−2R ,

(20)

where  c(t) = exp 2hb (µ(t)) + (1 − µ(t)) log

σ02 (t) σ2

+ µ(t) log

σ12 (t) σ2



The best asymptotic upper bound for R → ∞ is obtained by numerically searching the t∗ ≥ 0 that minimizes c(t). Since limt→0+ c(t) = 1, the Gaussian upper bound is always a member of this family. A low-rate bound is obtained by upper bounding only the significant samples, while the other samples are quantized to zero, thus yielding a distortion floor. Theorem 11 [13], [24] (Low-Rate Upper Bound) The distortion rate function of a memoryless source is upper bounded by D(R) ≤ B(t, R),

∀ t ≥ 0,

(21)

where   b (µ(t)) B(t, R) = A(t) exp −2 R−hµ(t) + σ 2 − A(t). In the neighborhood of a fixed threshold t the tightest such bound is D(R∗ (t)) ≤ B(t, R∗ (t)),

∀ t ≥ 0,

(22)



with the rate R (t) given by h R∗ (t) = hb (µ(t)) − 12 µ(t) 2h0b (µ(t)) + γ(t)  i 0 + W−1 −γ(t)e−2hb (µ(t))−γ(t) , where γ is the reciprocal normalized second tail moment 2 µ(t) 2 t = E[X 2t| X≥t] and W−1 is the second real γ(t) = A(t) branch of Lambert’s W function, taking values on [−1, −∞). (W(x) solves W(x)eW(x) = x.) We may use (22) to trace an upper bound on D(R) by sweeping the threshold t = 0 . . . t∗ . The low-rate and highrate bounds coincide in the minimum of the latter (achieved at t∗ ), there is a smooth transition between the two bounds. For proofs, see [13]. We remark that results by Sakrison [25] and Gish-Pierce [26] imply that the operational distortion rate function δ(R) of a magnitude classifier followed by a Gaussian scalar quantizer (adapted to the class variance) will be at most a factor of πe/6 (1.53 dB) above these bounds. Actually, this gap is even smaller at low rates, since the distortion D0 (0) = σ02 is trivially achieved for the insignificant samples. The high-rate bound (20) does not apply to the spike source (4), since its distribution is not continuous. However, the low-rate bound (22) holds for any threshold t > 0, as the significant samples then have a continuous density. In the limit of arbitrarily small positive t, such that µ → p, (21) becomes the simple spike upper bound (5).

x

−10

=0.25

0.5

0.20 0.15

−15 0.10

−20

.

0.05 γ=0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

1.4

1.5

−25 0.025

−30 0

0.2

0.4

0.6

0.8 Rate [bit]

1

1.2

1.4

1.6

Fig. 6. Beginning of high-rate region (approximate location where the knee in the D(R) curve ends) for the power-law wavelet coefficient model (23). Also shown are the bounds (20) and (22) for γ = 0.9, x0.5 = 0.05 and the Gaussian D(R) (all with unit variance).

B. Bounding the high-rate region in wavelet image compression As an example, we apply the above bounds to a powerlaw model for wavelet coefficients studied e.g. in [5], [3, ∗ Sec. 11.4]. The rate Rmin = Rmin (t∗ ) in the optimized high-rate bound (20) provides an estimate of the beginning of the high-rate region, in which distortion decays with −6 dB/bit. Together with the corresponding distortion bound ∗ ∗ = Bhr (t∗ , Rmin ) it localizes the end of the typical knee Bhr between the low-rate region with fast distortion decay and the high-rate region. The power-law model is based on the observation that when wavelet coefficients are ranked in order of decreasing magnitude, they decay approximately as x(z) ≈ Cz −γ up to about z = 0.5, where z is the normalized rank (the exponent γ is on the order of one for typical images). The rank z equals µ(t) in our setting; in fact, the threshold t can be eliminated altogether by substituting it with µ in the Rµ integral defining A, yielding A(µ) = 0 x2 (z) dz. Since z −γ is generally not square integrable, we change the model to x(z) = C(µ0 + z)−γ . Finally, the coefficient decay above z = 0.5 is almost linear (i.e. the magnitudes of the 50% smallest coefficients are almost uniformly distributed). This results in the following composite model: R µ 2 −2γ  dz, 0 ≤ µ ≤ 0.5, R0 C (µ0 + z) 0.5 2 −2γ A(µ) = C (µ0 + z) dz 0  Rµ  + 0.5 4x20.5 (1 − z)2 dz, 0.5 < µ ≤ 1. (23) The median magnitude x0.5 and the exponent γ can be estimated from a sample, the constant C is C = x0.5 (µ0 + 0.5)γ , while µ0 can be determined numerically from the condition A(µ = 1) = σ 2 . ∗ ∗ Figure 6 displays a grid of points (Rmin , Bhr ) obtained with the model (23) for a range of parameter values. Interestingly, the parameters have nearly orthogonal influences over ∗ a wide range: the exponent γ affects mainly Rmin , while the

10

SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY

∗ median magnitude x0.5 affects Bhr . In terms of the source pdf, a small x0.5 implies that most of the source energy is in the pdf tail; in turn, γ controls the tail decay, which will be slower for smaller γ (“heavy tail”). Points that lie on a line with slope −6 dB/bit correspond to asymptotically equal upper bounds, i.e. to sources that can be compressed equally well at high rates. The median x0.5 can be seen as an indicator of sparsity that has a strong influence on asymptotic compressibility. The exponent γ controls how fast the asymptotic regime is reached; to have high compression at low rates, both x0.5 and γ need to be small, i.e. the source pdf must be peaked at zero and heavy-tailed at the same time. Also shown in Figure 6 are the bounds (20) and (22) for γ = 0.9, x0.5 = 0.05, which are the approximate parameters of the wavelet coefficients used to draw Figure 1. The estimated ∗ ∗ start of the high-rate region is Rmin = 0.72 bits, Bhr = −21.5 dB, matching quite well with Figure 1. Due to the roundness of the knee it is hard to visually estimate where the asymptotic decay of −6 dB/bit begins. Gaussian mixture models (treated below) may show much sharper knees. The power-law model (23) provides a valuable empirical tool for analyzing wavelet coefficients or other approximately scale-invariant data. However, it lacks the generality and versatility, as well as the theoretical apparatus, of the Gaussian mixture models that will be studied next. In particular, not every sparsifying transform will necessarily produce the approximately scale-invariant coefficients implied by the powerlaw model.

VII. G AUSSIAN M IXTURES AND C ODING G AIN The discussion on spikes in Section IV pointed out that continuous densities are more appropriate for modeling sparse transform coefficients. Gaussian mixtures are a popular approach to model and estimate unknown densities and have been used quite successfully in various applications, see e.g. [27] and references therein. In this section we will study a simple memoryless Gaussian mixture (GM) source model with pdf N X ws fs (x), (24) f (x) = s=1

mixing N zero-mean Gaussian components with variances σm,s , 2 2 1 e−x /2σm,s , fs (x) = √ 2πσm,s PN with weights ws ≥ 0 satisfying s=1 ws = 1. The spike model of Section IV may be regarded as a special case of a two-component GM, where one source has variance zero. For a general GM source X (with possibly nonzero component means), the Shannon lower bound (7) is tight for all 2 D < D∗ = mins {σm,s }, since then X may be expressed via a “backward test channel” as the sum of a GM with variances 2 {σm,s − D} and independent noise Z ∼ N (0, D) [28]; D∗ is also known as critical distortion. Thus the asymptotic D(R) behavior is determined by the GM entropy, which in general cannot be expressed in closed form. This motivates the bounds presented in the first two subsections, which are followed by

a discussion of the relationship with coding gain and some examples. A. Distortion rate bounds for Gaussian Mixtures The upper bounds introduced in Section VI are easily computed for GM models, but they do not exploit the particular model structure. A GM source may be viewed as containing a hidden discrete memoryless source S that switches between 2 |S| = N Gaussian sources N (0, σm,s ) with selection probabilities ws = P r{S = s}. A lower bound on D(R) is found by assuming that an oracle provides the hidden variable S to ˆ form a Markov chain, the source encoder. Since S → X → X we have ˆ ˆ I(X; X|S) ≤ I(X; X). Computing the lower bound Rlb (D) = ˆ with QD = {p(ˆ x|x, s) : minp(ˆx|x,s)∈QD I(X; X|S), ˆ 2 ≤ D}, is equivalent to solving the following E(X − X) standard rate allocation problem: X 2 Dlb (R) = min ws σm,s 2−2Rs (25) {Rs }

subject to X

ws Rs = R and Rs ≥ 0.

This yields the lower bound D(R) ≥ Dlb (R), which can also be seen as a special case of a conditional rate distortion function [29]. The lower bound may be turned into an upper ˆ as follows: bound by expanding I(S, X; X) ˆ = I(X; X) ˆ + I(S; X|X) ˆ ˆ I(S, X; X) = I(X; X) ˆ + I(X; X|S) ˆ ˆ = I(S; X) ≤ I(X; X|S) + H(S), using the fact that the mixing variable S is discrete. Thus we have ˆ ˆ minQD I(X; X|S) ≤ R(D) ≤ minQD I(X; X|S) + H(S). (26) Clearly, these bounds are not very tight in the case of a GM with large N = |S| and close to uniform distribution of S. Using Fano’s inequality we may see that if S can be estimated ˆ with low probability of error, then R(D) will be close from X ˆ to the upper bound. Conversely, for large H(S|X) ≤ H(S|X) it will be harder to estimate S and R(D) will be closer to the lower bound. As an example, we used the EM algorithm to estimate the parameters of a two-component GM (24) modeling the wavelet coefficients of the Lena image (transformed with the classic 9/7 biorthogonal wavelet). The parameters obtained 2 2 are w1 = 0.9141, σm,1 = 0.01207 and σm,2 = 11.51 (normalized to unit variance). Plots of the bounds (22), (20) and (26) appear in Figure 7 together with a numerical estimate of D(R) computed with the Blahut-Arimoto algorithm (the gap in (26) is H(S) = hb (w1 ) = 0.42 bits wide). Also shown in Figure 7 are the (R, D) points achieved by a simple embedded (successive refinement) scalar quantizer (see e.g. [3, Sec. 11.5]), applied to 3 · 105 pseudo-random samples. The significance maps were entropy coded, sign and refinements

WEIDMANN AND VETTERLI: RATE DISTORTION BEHAVIOR OF SPARSE SOURCES

0

−10 MSE Distortion [dB]

B. Entropy bounds for Gaussian mixtures

Gaussian upper bound GM upper/lower bounds Upper bounds (20, 22) Blahut−Arimoto alg. Embedded scalar quantizer

−5

11

The sparsity of a GM source may be measured by the geometric mean, as proposed in Section V-A, leading to entropy bounds that characterize the asymptotic R(D) behavior. The logarithm of the geometric mean of a GM with density (24) is

−15 −20 −25

θ = E log|X| =

N X

2 ws log σm,s −

1 2

log 2 − 12 γ,

(27)

s=1

−30 −35 −40 0

1 2

0.5

1

1.5

2 Rate [bits]

2.5

3

3.5

4

Fig. 7. Distortion rate bounds for two-component Gaussian mixture model of wavelet (detail) coefficients.

bits were left uncoded. It can be seen that at low rates, thresholding with simple scalar quantization performs very close to the D(R) optimum. Up to the typical knee, distortion decays faster than −6 db/bit, since mainly the sparse coefficients from the highvariance source are retained by the thresholding operation. At higher rates, the coefficients from the low-variance source also start being significant. If the model (24) is extended to N ≥ 3 Gaussian components, the knee in D(R) becomes rounder, but the basic behavior is unchanged. From these observations we can reach two conclusions: first, two-component GMs suffice to capture the essential features of image coding D(R), and second, the rate Rmin (t∗ ) in the high-rate bound (Theorem 10) is confirmed as estimate of the beginning of the high-rate compression region (see Section VI-B). The first observation is also supported by [30], which considers the joint numerical optimization of a classifier and (high-rate) uniform quantizers for each of N classes corresponding to GM components. Simulation results in [30] suggest that for typical image data N = 2 components yield a substantial improvement over a single Gaussian, while adding more components gives only minor additional gains. A general theoretical framework for classified vector quantization (CVQ) of Gaussian mixtures has been introduced by Gray in [31]. Ideally, CVQ would be applied to samples from a multivariate Gaussian mixture having distinct modes, leading to reliable classification. The high-rate bound (20) could be seen as a special case of CVQ, where two mixture components differ only in variance. However, the CVQ approach is different in spirit, since CVQ will usually be applied directly to the data, without first transforming it, while our work focuses on a transform coding setting where a (often blind) transform generates a sparse signal representation, which is then quantized. The transform coefficients are thought as coming from a sparse memoryless source, which in the above example is shown to be modeled quite well by a mixture of two univariate Gaussian densities.

where γ = 0.5772 . . . is Euler’s constant. The result follows directly from integral 4.333 in [23] and leads to a lower bound on the mixture entropy h(X) via Corollary 6. However, this can be tightened by the same approach as in Section VII-A, namely by lower bounding the GM entropy by conditioning on the hidden variable selecting the mixture components: X 2 h(X) ≥ h(X|S) = 21 ws log(2πeσm,s ). (28) This improves the lower bound (15) by the constant 12 γ + 1 π 2 log e , since also (28) is a linear function of θ. From the expansion I(X; S) = h(X) − h(X|S) = H(S) − H(S|X), we obtain the upper bounds h(X) ≤ h(X|S) + H(S) ≤ h(X|S) + log N,

(29)

with h(X|S) given in (28). Figure 8 plots the different bounds that hold for mixtures of zero-mean Gaussians in general and two-component GM in particular. Also shown is a set of points (E log |X|, h(X)) corresponding to different two-component GMs. The geometric mean is mainly affected by the ratio σm,2 /σm,1 , while the mixing weights determine H(S) and thus the gap between the lower bound (28) and the tighter upper bound in (29). For large σm,2 /σm,1 , it is easy to estimate S from X and so h(X) will be close to the upper bound. The lower bound can be asymptotically attained with w1  w2 (then H(S) → 0) for any σm,2 /σm,1 ≥ 1; this parallels the “uniform spike” attaining the lower bound in Corollary 6. Mixtures of a finite number of zero-mean Gaussian components may be considered as a special case of continuous Gaussian scale mixtures [32], which have also been proposed in the context of wavelet coefficient models [1, Sec. VIII.A]. It turns out that the maximum entropy pdf (16) can be expressed as a Gaussian scale mixture. Proposition 12 The maximum entropy pdf (16) that satisfies the constraints E X 2 = σ 2 and E log |X| = θ can be expressed as a continuous Gaussian mixture Z σ2 /u x2 1 √ f (x) = e− 2s2 g(s2 ) ds2 (30) 2πs2 0 with mixing density √ −(1+u)/2  πs2 ( u2 )u/2 1 u 2 g(s ) = 4 u σ 1−u − , σ2 s Γ( 2 )Γ( 2 ) s2

(31)

where the shape parameter 0 < u < 1 is obtained by solving (17).

12

SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY

2 w1w2 −0.5

ˆ 2 = kY − Yˆ k2 . kX − Xk

−1

Also, the average variance of the transform coefficients Yi is equal to the variance of X:

−1.5 σm,1 1 and thus a better definition of coding gain loss is the ratio of the high-rate upper bound to the Shannon lower bound: ∗

∆CG(SLB) =





ΓSLB e2hb (µ ) [σ0∗ ]2(1−µ ) [σ1∗ ]2µ = . exp (2h(X) − log(2πe)) ΓMCQ

(36)

The GM differential entropy h(X) has to be computed with numerical integration methods. Figure 10 plots the coding gain ΓSLB and the coding gain loss ∆CG(SLB) for that case. The loss is remarkably low over a wide range of parameter values, which shows that the magnitude classification quantization approach is very effective for such sources. In this example, the optimal MCQ threshold t∗ was always larger than thepthreshold for the maximum likelihood classification, tML = log λ2 /(1 − λ−2 )σm,1 . This is quite expected, since

the goal of the classification is a tight distortion bound, not the optimal distinction of the two component sources. 2) Mixture versus Vector Coding Gain: The above example considered two-component Gaussian mixtures as models for sparse sources and compared different measures of coding gain. Here we will extend that model to N Gaussian components and exploit the simple relationship between their variances and the geometric mean of their mixture. This can be used to bound the coding gain of a Gaussian mixture as a function of the coding gain for the unmixed sources. Two results from Section VII-B will be useful. The logarithm of the geometric mean (lgm) of N zero-mean Gaussians, 2 log G(σ12 , σ22 , . . . , σN )=

N 1 X log σi2 , N i=1

(37)

WEIDMANN AND VETTERLI: RATE DISTORTION BEHAVIOR OF SPARSE SOURCES

100

VIII. D ISTRIBUTED CODING AND COMPRESSED SENSING

90

The aim of this section is to give a brief overview of the rate distortion behavior of sparse sources in distributed coding (Wyner-Ziv) settings, as well as the relationship to compressed sensing.

Mixture coding gain Γ

SLB

[dB]

80 70 60 50

A. Sparse sources in Wyner-Ziv settings

40 30 20 10 0 0

Fig. 11.

15

Upper bound Lower bound 20

40 60 Vector coding gain ΓTC [dB]

80

100

Bounds for Gaussian mixture vs. vector coding gain.

differs only by a constant from the lgm of their mixture (27) with uniform weights wi = 1/N (chosen for simplicity; the PNfollowing results extend to general wi ). Letting σ 2 = N1 i=1 σi2 , the vector coding gain of N -dimensional Gaussian transform coding (32) relates to the lgm (37) as N 1 X log σi2 = log(σ 2 /ΓTC ). N i=1

(38)

Now (37) can also be used to lower bound the mixture entropy h(X) through (27) and (28), which leads to an upper bound on the mixture coding gain (34). Combining this with (38) yields ΓSLB ≤ ΓTC , which simply means that mixing does not necessarily inflict a performance penalty. More interestingly, the same approach can be used with the upper bound on h(X) in Corollary 8, which then yields a lower bound on ΓSLB as a function of ΓTC . If N is known, the second bound in (29) leads to a tighter lower bound, but only for large ΓTC (the gap to the upper bound will be 20 log10 N [dB]). Figure 11 displays the upper and lower bounds for mixture vs. vector coding gain. What exactly are we comparing? On the one hand, we have the classical vector coding gain for N independent Gaussian sources. On the other hand, the coding gain for a mixture source that outputs one of these N sources uniformly at random. As an example, consider a transform that outputs N independent zero-mean Gaussian components. If we know the variance of each component, like e.g. in the KLT case, we can achieve the vector (transform) coding gain. If however only the distribution of the variances is known, then we can design a codebook for the corresponding scalar mixture source and still achieve the mixture coding gain. This is the case of transforms with “known eigenvalue distribution”, but “unknown positions”. Intuitively, wavelet transforms lie between these two extremes, since e.g. coefficient variances are correlated across scales (but this also violates the independence assumption in the definition of coding gain). In summary, the lower curve in Figure 11 bounds the maximum performance loss of a “naive” one-dimensional system compared to one with perfect side information.

The Wyner-Ziv (WZ) problem [35], [36], [6, Sec. 14.9] considers pairs (X, Y ) of dependent random variables and WZ asks for the minimal rate RX|Y (D) required to describe the source X within average distortion D when side information Y is available only at the decoder. We limit our discussion to absolutely continuous X, Y and quadratic distortion measure WZ d(ˆ x, x) = (ˆ x −x)2 . The WZ rate distortion function RX|Y (D) is lower-bounded by the corresponding conditional rate distortion function RX|Y (D), which is in turn lower-bounded by the conditional Shannon lower bound [29], WZ RX|Y (D) ≥ RX|Y (D) =

inf

{U ∈R: E(U −X)2 ≤D}

I(X; U |Y )

1 log 2πeD. (39) 2 These lower bounds are asymptotically tight for the quadratic WZ distortion measure [37]; in particular, the rate loss RX|Y (D)− RX|Y (D) is zero for all D when (X, Y ) are jointly Gaussian [36] and, more generally, when X = Y + Z with Y independent from Z, where only Z needs to be Gaussian [38]. We further specialize to models where X and Y have zero mean and their difference can be modeled by an independent memoryless source, yielding the following two correlation models that are common in the WZ literature. 1) X = Y + Z, where Y, Z are independent memoryless 2 sources: The case of Y sparse and Z ∼ N (0, σZ ) is trivial,  σ2 1 WZ since [38] implies RX|Y (D) = RX|Y (D) = 2 log+ DZ , ≥ h(X|Y ) −

where log+ x = max{log x, 0}. More interesting is the case of sparse Z, since by [37] one has 1 . WZ RX|Y (D) = h(X|Y ) − log 2πeD 2 = h(Z) −

1 log 2πeD, (40) 2

. where = denotes asymptotic equality for D → 0 and the righthand side is the SLB for RZ (D), the MSE rate distortion function for Z (this can also be shown directly using a Gaussian forward test channel, see Fig. 4 in [36]). Thus all the presented techniques for bounding the asymptotic behavior of RZ (D) apply in this WZ case as well. 2) Y = X + Z, where X, Z are independent memoryless sources: A Gaussian upper bound is obtained by bounding the rate Rg (D) achieved with a Gaussian forward test channel: ! 2 σX|Y 1 + WZ , (41) RX|Y (D) ≤ Rg (D) ≤ log 2 D σ2 σ2

2 where the conditional variance σX|Y = σ2X+σZ2 , with equality X Z on both sides iff X, Z are jointly Gaussian.

16

SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY

By inserting h(X|Y ) = h(X) − h(X + Z) + h(Z) in (39) and observing that h(X + Z) ≥ max{h(X), h(Z)}, WZ one sees that RX|Y (D) is asymptotically upper bounded by min{RX (D), RZ (D)}. If |h(X) − h(Z)| < 12 log 2, lower bounding h(X + Z) with the entropy power inequality [6, Sec. 16.7] yields the slightly tighter asymptotical upper bound max{RX (D), RZ (D)} − 12 log 2. Upper bounds from the previous sections may be applied; however, if tighter bounds are desired, all three entropies h(X), h(Z), h(X + Z) need to be bounded individually, which may require sharper tools than those presented. Better bounds exist if Z is a Gaussian mixture of the form (24), with mixture components Zs ∼ N (0, σs2 ) and weights ws . By assuming that both the encoder and the decoder have access to the hidden mixing random variable S, one obtains the lower bound X inf I(X; U |Y ) ≥ inf ws I(X; U |X + Zs ), s

which can be evaluated using rate allocation like in Section VII-A (a version of this bound for binary S first appeared in [39]). Asymptotically for D → 0, this simplifies to . X WZ RX|Y (D) ≥ ws [h(X|X + Zs ) − h(X − U )] s



X

ws h(X|X + Zs ) −

s

1 log 2πeD, 2

where h(X|X + Zs ) = h(X) + h(Zs ) − h(X + Zs ). For 2 ), this further reduces to Gaussian X ∼ N (0, σX .

WZ RX|Y (D) ≥

2 σX|X+Z 1X s ws log , 2 s D

(42)

σ2 σ2

2 where the σX|X+Z = σ2X+σs2 are the conditional variances s s X given a single mixture component Zs . A slightly tighter bound may be obtained by assuming that only the decoder has access to S, using techniques from [40], which are however unlikely to yield analytic expressions even in the Gaussian case. An asymptotic upper bound for D → 0 is obtained from I(X; S|Y ) = h(X|Y ) − h(X|Y, S) = H(S|Y ) − H(S|Y, X) ≤ H(S) (F. Bassi, personal communication) as 1 . WZ RX|Y (D) = h(X|Y ) − log 2πeD 2 X 1 ≤ ws h(X|X + Zs ) − log 2πeD + H(S), 2 s 2 which for Gaussian X ∼ N (0, σX ) becomes .

WZ RX|Y (D) ≤

2 σX|X+Z 1X s ws log + H(S). 2 s D

(43)

Whether this is sharper than the Gaussian upper bound (41) needs to be checked on a case-by-case basis. If X, Z are jointly Gaussian (and thus S constant), the asymptotic bounds (42) WZ and (43) coincide and are equal to RX|Y (D) for all D ≤ 2 Dmax = σX|X+Z . Clearly, (42) and (43) mirror (28) and (29) in the standard R(D) case. The same comments on tightness made in Section VII-A after (26) apply by substituting (U, Y ) ˆ for X.

B. Connections with Compressed Sensing We briefly outline how lossy coding of sparse sources is related to compressed sensing (CS) [10], [11] (see also the special issue [41]). A typical example of a CS problem is the compressive representation of a signal vector x ∈ RN of the form x = Ψs, where Ψ is an orthonormal N -by-N matrix and s ∈ RN has at most K non-zero components (we say that x is strictly K-sparse with respect to Ψ). The problem is then to determine a sampling/compression mechanism for x without using the sparsifying basis Ψ at the encoder (e.g. for complexity reasons). CS typically involves sampling x using a fat M -by-N random measurement matrix Φ that has low coherence with Ψ. The key question concerns the number M of real-valued samples (the height of Φ) needed for the exact (lossless) reconstruction of all K-sparse signal vectors s with high probability. Compression is thus achieved in the sense of needing a number of samples M that may be much smaller than N . A distributed CS (DCS) problem might consider a signal which is known to have a sparse difference with respect to a reference signal y (side information) available only at the decoder. This can be extended to multiple correlated signals, which may be composed of sparse and non-sparse components, and have to be sampled and encoded independently [42]. In practice, the samples Φx must be quantized (say with R bits each) if they are to be sent to a remote decoder. This implies some loss in the reconstruction of s (if it succeeds at all), which will depend on the total rate M R. It is thus interesting to study an information-theoretic view of the CS problem for benchmarking purposes, by considering the rate needed for approximate (lossy) reconstruction of almost all s, i.e. the rate distortion behavior for an appropriate random model of s. To simplify the analysis, one may consider asymptotically long sequences from a sparse memoryless source, e.g., a BG spike with p = K N for modeling strictly sparse signals, or a Gaussian mixture model for sparse spikes with background noise, under the MSE distortion measure. In the simple case Ψ = IN , when the signal x = s is sparse in the standard basis, the results in this paper may be directly used to benchmark the operational rate distortion behavior of practical CS systems. This also extends to distributed scenarios like the model JSM-3 in [42], which can be related to the Wyner-Ziv setting X = Y + Z mentioned above. The situation is more delicate when Ψ 6= IN . Then the results only apply if the quantization operation Q commutes with Ψ, which is the case if the codebook defining Q is spherically symmetric.6 However, the only codebook distribution that has i.i.d. marginals and is spherically symmetric for any dimension N is the Gaussian N (0, σ 2 IN ) [43]. Thus only the trivial Gaussian upper bound D(R) ≤ σ 2 2−2R is guaranteed to hold, and of course also lower bounds obtained assuming Ψ = IN remain valid. The work [12] is among the first to give sharper bounds on the D(R) behavior of quantized CS of strictly sparse sources, but it does not provide a purely informationtheoretic analysis framework. More recently, [44] suggested 6 More precisely, if C ⊂ RN is a random N -dimensional codebook (the set of reconstruction points) defining Q, then the rotated codebook C 0 = {Ψ−1 x : x ∈ C} should yield the same average distortion.

WEIDMANN AND VETTERLI: RATE DISTORTION BEHAVIOR OF SPARSE SOURCES

such a framework by considering Ψ as side information available only at the decoder. CS with incoherent measurements may be viewed as a doubly non-adaptive coding scheme that is oblivious of both the sparsifying basis Ψ and the location of nonzero samples. When Ψ = IN , this becomes a simply non-adaptive scheme that may be implemented with a lossy block code. In [45], such a code has been constructed by combining a q-ary nested uniform scalar quantizer with a q-ary syndrome source code. The scheme works in a Wyner-Ziv setting if the nonzero values are bounded, while in the standard case without side information, one may introduce a compander to gain a little extra performance, which for a Bernoulli-Gaussian spike (Sec. IV) √ p 6π 3σv2 2−2Rv , where p asymptotically becomes D(Rv ) ' 12 is the probability of a spike, σv its variance and Rv = log2 q the quantizer rate. The total rate is R(Rv ) ' hb (p) + pRv , which is the same rate that a “nonlinear” adaptive code would need, see (5). It is interesting to remark that in a CS setting with sparsity p = K/N , this corresponds to needing approximately K “measurements” with Rv bits each, while CS methods would in general require 2K real-valued measurements in order to correctly recover the sparsity pattern. IX. C ONCLUSIONS Sparsity is the key to nonlinear approximation and compressed sensing. Work in these areas is generally more concerned with the number of real-valued samples required for achieving a certain approximation error, respectively exact reconstruction, rather than with the rate distortion tradeoff that is implicit when samples are quantized. This paper studied the rate distortion behavior of sparse memoryless sources modeling that situation. We proposed the geometric mean as a sparsity measure and used it to bound asymptotic R(D) via the entropy, and to compare different types of transform coding gain, thus directly connecting the notions of sparsity and compressibility. These results apply to the MSE distortion criterion, while for Hamming distortion we showed that R(D) can be computed exactly in some cases and that it becomes almost linear for very sparse sources. A PPENDIX

17

(“description rate”) induced by Q is I(Q) =

X

P (j)Q(k|j) log

j,k

Q(k|j) , Q(k)

(45)

P where Q(k) = j P (j)Q(k|j). The rate distortion function R(D) is defined as R(D) = min I(Q) Q∈QD

This convex optimization problem can be solved with the method of Lagrange multipliers [15], [6, Sec. 13.7]. We start with the functional X X J(Q) = I(Q) + λd(Q) + νj Q(k|j), j

k

where the last term comes from the constraint Pthat Q(k|j) is a proper conditional distribution, i.e. satisfies k Q(k|j) = 1. The minimizing conditional distribution can be computed as Q(k)e−λρ(j,k) . 0 −λρ(j,k0 ) k0 Q(k )e

Q(k|j) = P

(46)

ˆ = |Xˆ | The marginal Q(k) has to satisfy the following N conditions: X

P (j)e−λρ(j,k) = 1, 0 −λρ(j,k0 ) k0 Q(k )e

if Q(k) > 0,

(47)

P (j)e−λρ(j,k) ≤ 1, 0 −λρ(j,k0 ) k0 Q(k )e

if Q(k) = 0.

(48)

P j

X

P j

It can be shown that the conditions (48) are necessary and sufficient for the solutions of (47) to yield a point on the R(D) curve, either directly as in [15, Theorem 2.5.2], or via the Kuhn-Tucker conditions [6, Sec. 13.7]. The solution is further simplified through the following theorem by Berger: Theorem 14 [15, Theorem 2.6.1] No more than N reproducing letters need be used to obtain any point on the R(D) curve that does not lie on a straight-line segment. At most, ˆ = N + 1 reproducing letters are needed for a point that N lies on a straight-line segment.

A. Rate distortion function of a discrete memoryless source (DMS)

B. Rate distortion of binary (1, N ) sources

Definition 8 (Rate distortion function of a DMS) Let X ∼ P be a discrete random variable, ρ(x, x ˆ) a single-letter distortion measure, QX|X (k|j) a conditional distribution ˆ (defining a random codebook), and PX,Xˆ (j, k) = P (j)Q(k|j) the corresponding joint distribution. The average distortion associated with Q(k|j) is X d(Q) = P (j)Q(k|j)ρ(j, k). (44)

Proof: The following derivation relies heavily on the rate distortion results for discrete memoryless sources summarized in Appendix A. There it is shown that R(D) can be computed by solving a set of equations involving the marginal (random codebook) distribution Q(k) on the reconstruction alphabet. The symmetry of the input distribution, P (j) = 1/N (j = 1, . . . , N ), suggests the following marginal distribution (with a slight abuse of notation):

j,k

If a conditional probability assignment satisfies d(Q) ≤ D it is called D-admissible. The set of all D-admissible Q is QD = {Q(k|j) : d(Q) ≤ D}. The average mutual information

Q = (q0 , q1 = q2 = . . . = qN =

1 − q0 ). N

(49)

Let us first assume that qk > 0 holds for all k. Then the N +1 conditions (47) have to be met. We make the substitution

18

SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY

β = e−λ and insert our Q(k) into the equation, first for k 6= 0: β0 1−q0 1 q0 β + N (β 0 + (N − 1)β 2 ) (N − 1)β 2 1 + = = N, 1−q P (j) q0 β 1 + N 0 (β 0 + (N − 1)β 2 ) which after some algebra becomes q0 ((N − 1)β 2 − N β + 1) = 0.

(50)

For k = 0 we get almost the same equation: q0

β1

+

1 N β1 = = N, P (j) + (N − 1)β 2 )

1−q0 0 N (β

which becomes (1 − q0 )((N − 1)β 2 − N β + 1) = 0.

(51)

The solution β = 1 corresponds to the point (0, Dmax ) (with Dmax = 1) in the (R, D) plane, which is achieved by setting q0 = 1. Therefore the interesting solution is β = 1/(N − 1), which when inserted into (46) yields Q(k|j) = qk (N − 1)1−ρ(j,k) .

(52)

Putting (52) into (44) we get the average distortion d(Q) = 1 − NN−2 (1 − q0 ) and from (45) the rate I(Q) = NN−2 (1 − q0 ) log(N −1). Noting that these hold for q0 > 0, we combine them to eliminate q0 and get N −2 log(N − 1). N (53) This proves the first part of equation (2). When R reaches its upper bound in (53), D reaches 2/N and we have q0 = 0. At that point, equation (50) will be satisfied for all β. According to condition (48), equation (51) now becomes an inequality: D(R) = 1 −

R log(N − 1)

for R
0 obtained by solving (17). The lemma can be derived using the calculus of variations [6, Ch. 11], which leads to a maximizing density of the form 2

f (x) = eλ1 −1 xλ3 eλ2 x . R∞ The three constraints 0 f (x) dx = 1, E X 2 = σ 2 and E log X = θ are satisfied by setting λ3 = u−1, λ2 = −u/2σ 2 and eλ1 −1 = 2(−λ2 )u/2 /Γ(u/2). We need to show that E log |X| is monotone increasing, so that the mapping between θ and u is one-to-one. By Jensen’s inequality we have E log |X| ≤ 12 log E X 2 = log σ. Let v = u/2 and φ(v) = 2(E log |X|− 12 log E X 2 ) = ψ(v)−log v. Using a standard integral representation for ψ(v) [23] we obtain Z ∞ 1 t dt v > 0. (63) φ(v) = − − 2 2 + v 2 )(e2πt − 1) 2v (t 0 The first derivative, φ0 (v) = +

1 +4 2v 2

Z 0



vt dt , (t2 + v 2 )2 (e2πt − 1)

(64)

is strictly positive for v > 0, so E log |X| is indeed monotone increasing. By bounding the integral in (63) one can further show that limu→∞ E log |X| = log σ. This proves the lemma. For a random variable X with a general pdf f , the lemma implies that the pdf of its magnitude |X| must be of the form (62), or f (−x) + f (x) = g(|x|). Furthermore, the entropy cannot be maximal unless f (−x) = µg(x) and f (x) = (1 − µ)g(x) for x ≥ 0 andR some 0 ≤ µ ≤ 1. Now, if ∞ we R ∞write the entropy integral as − 0 f (−x) log f (−x) dx − f (x) log f (x) dx, we see immediately that this is maximal 0 iff µ = 1/2, that is f (−x) = f (x) = g(x)/2. This proves the theorem. ACKNOWLEDGMENTS The authors wish to thank Emre Telatar for helpful discussions, as well as the reviewers for their valuable suggestions and comments. R EFERENCES [1] D. L. Donoho, M. Vetterli, R. DeVore, and I. Daubechies, “Data compression and harmonic analysis,” IEEE Trans. Inform. Theory, vol. IT-44, pp. 2435–2476, Oct. 1998. [2] M. Vetterli, “Wavelets, approximation, and compression,” IEEE Signal Processing Magazine, vol. 18, pp. 59–73, Sep. 2001. [3] S. Mallat, A Wavelet Tour of Signal Processing. Academic Press, 1998. [4] A. Cohen and J.-P. D’Al`es, “Nonlinear approximation of random functions,” SIAM J. Appl. Math., vol. 57, no. 2, pp. 518–540, Apr. 1997. [5] S. Mallat and F. Falzon, “Analysis of low bit rate image transform coding,” IEEE Trans. Signal Proc., vol. 46, pp. 1027–1042, Apr. 1998. [6] T. M. Cover and J. A. Thomas, Elements of Information Theory. John Wiley & Sons, 1991. [7] J. M. Shapiro, “Embedded image coding using zerotrees of wavelet coefficients,” IEEE Trans. Signal Proc., Special Issue on Wavelets and Signal Processing, vol. 41, no. 12, pp. 3445–3462, Dec. 1993.

19

[8] A. Cohen, I. Daubechies, O. Guleryuz, and M. Orchard, “On the importance of combining wavelet-based nonlinear approximation with coding strategies,” IEEE Trans. Inform. Theory, vol. 48, pp. 1895–1921, Jul. 2002. [9] M. Vetterli, P. Marziliano, and T. Blu, “Sampling signals with finite rate of innovation,” IEEE Trans. Signal Proc., vol. 50, no. 6, pp. 1417–1428, Jun. 2002. [10] E. Cand`es, J. Romberg, and T. Tao, “Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information,” IEEE Trans. Inform. Theory, vol. 52, no. 2, pp. 489–509, Feb. 2006. [11] D. L. Donoho, “Compressed sensing,” IEEE Trans. Inform. Theory, vol. 52, no. 4, pp. 1289–1306, Apr. 2006. [12] A. K. Fletcher, S. Rangan, and V. K Goyal, “On the rate-distortion performance of compressed sensing,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), vol. 3, Apr. 15–20 2007, pp. III–885–III–888. [13] C. Weidmann and M. Vetterli, “Oligoquantization or the quantization of the few,” 2009, to be submitted to IEEE Trans. Inform. Theory. [14] E. P. Simoncelli, “Statistical modeling of photographic images,” in Handbook of Video and Image Processing, 2nd ed., A. C. Bovik, Ed. Academic Press, 2005, pp. 431–442. [Online]. Available: http://www.cns.nyu.edu/pub/eero/simoncelli05a-preprint.pdf [15] T. Berger, Rate Distortion Theory: A Mathematical Basis for Data Compression. Prentice-Hall, 1971. [16] M. Pinsker, Information and Information Stability of Random Variables and Processes. New York: Holden-Day, 1964. [17] H. Rosenthal and J. Binia, “On the epsilon entropy of mixed random variables,” IEEE Trans. Inform. Theory, vol. IT-34, pp. 1110–1114, Sep. 1988. [18] A. Gy¨orgy, T. Linder, and K. Zeger, “On the rate-distortion function of random vectors and stationary sources with mixed distributions,” IEEE Trans. Inform. Theory, vol. IT-45, pp. 2110–2115, Sep. 1999. [19] B. B´enichou and N. Saito, “Sparsity vs. statistical independence in adaptive signal representations: A case study of the spike process,” in Beyond Wavelets, ser. Studies in Computational Mathematics, G. V. Welland, Ed. Academic Press, 2003, vol. 10, ch. 9, pp. 225–257. [Online]. Available: http://www.math.ucdavis.edu/∼saito/publications/ [20] G. Hardy, J. Littlewood, and G. P´olya, Inequalities, 2nd ed. Cambridge University Press, 1952. [21] A. D. Wyner, “An upper bound on the entropy series,” Inform. Contr., vol. 20, pp. 176–181, 1972. [22] E. H. Lieb and M. Loss, Analysis, 2nd ed. American Mathematical Society, 2001. [23] I. S. Gradshteyn and I. M. Ryzhik, Table of Integrals, Series, and Products, 5th ed. New York: Academic Press, 1994. [24] C. Weidmann, “Oligoquantization in low-rate lossy source coding,” Ph.D. dissertation, EPFL, Lausanne, Switzerland, July 2000. [25] D. J. Sakrison, “Worst sources and robust codes for difference distortion measures,” IEEE Trans. Inform. Theory, vol. IT-21, pp. 301–309, May 1975. [26] H. Gish and J. N. Pierce, “Asymptotically efficient quantizing,” IEEE Trans. Inform. Theory, vol. IT-14, pp. 676–683, Sep. 1968. [27] M. S. Crouse, R. D. Nowak, and R. G. Baraniuk, “Wavelet-based statistical signal processing using hidden Markov models,” IEEE Trans. Signal Proc., vol. 46, pp. 886–902, Apr. 1998. [28] A. M. Gerrish and P. M. Schultheiss, “Information rates of non-Gaussian processes,” IEEE Trans. Inform. Theory, vol. IT-10, pp. 265–271, Oct. 1964. [29] R. M. Gray, “A new class of lower bounds to information rates of stationary sources via conditional rate-distortion functions,” IEEE Trans. Inform. Theory, vol. IT-19, pp. 480–489, Jul. 1973. [30] A. Hjørungnes and J. M. Lervik, “Jointly optimal classification and uniform threshold quantization in entropy constrained subband image coding,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), vol. 4, 1997, pp. 3109–3112. [31] R. M. Gray, “Gauss mixture vector quantization,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), vol. 3, May 2001, pp. 1769–1772. [32] D. F. Andrews and C. L. Mallows, “Scale mixtures of normal distributions,” J. Royal Statistical Society, Series B, vol. 36, no. 1, pp. 99–102, 1974. [33] V. K Goyal, “Theoretical foundations of transform coding,” IEEE Signal Processing Mag., vol. 18, no. 5, pp. 9–21, Sep. 2001. [34] A. Gersho and R. M. Gray, Vector Quantization and Signal Compression. Kluwer Academic Publishers, 1992.

20

[35] A. D. Wyner and J. Ziv, “The rate-distortion function for source coding with side information at the receiver,” IEEE Trans. Inform. Theory, vol. IT-22, no. 1, pp. 1–10, Jan. 1976. [36] A. D. Wyner, “The rate-distortion function for source coding with side information at the decoder-II: General sources,” Inform. Contr., vol. 38, pp. 60–80, 1978. [37] R. Zamir, “The rate loss in the Wyner-Ziv problem,” IEEE Trans. Inform. Theory, vol. 42, pp. 2073–2084, Nov. 1996. [38] S. Pradhan, J. Chou, and K. Ramchandran, “Duality between source coding and channel coding and its extension to the side information case,” IEEE Trans. Inform. Theory, vol. 49, no. 5, pp. 1181–1203, May 2003. [39] F. Bassi, M. Kieffer, and C. Weidmann, “Source coding with intermittent and degraded side information at the decoder,” in Proc. ICASSP, Las Vegas, NV, USA, Mar. 30 – Apr. 4, 2008. [40] C. T. K. Ng, C. Tian, A. J. Goldsmith, and S. Shamai (Shitz), “Minimum expected distortion in Gaussian source coding with uncertain side information,” in Proc. Information Theory Workshop (ITW), Lake Tahoe, CA, USA, Sep. 2–6 2007, pp. 454–459. [41] R. G. Baraniuk, E. Cand`es, R. Nowak, and M. Vetterli (guest editors), “Sensing, sampling, and compression,” IEEE Signal Processing Mag., vol. 25, no. 2, Mar. 2008. [42] M. Wakin, M. Duarte, S. Sarvotham, D. Baron, and R. Baraniuk, “Recovery of jointly sparse signals from few random projections,” in Proc. Neural Information Processing Systems (NIPS), Dec. 2005. [43] D. Kelker, “Distribution theory of spherical distributions and a locationscale parameter generalization,” Sankhya, vol. 32, Ser. A, pp. 419–430, 1970. [44] V. K Goyal, A. K. Fletcher, and S. Rangan, “Compressive sampling and lossy compression,” IEEE Signal Processing Mag., vol. 25, no. 2, pp. 48–56, Mar. 2008. [45] C. Weidmann, F. Bassi, and M. Kieffer, “Practical distributed source coding with impulse-noise degraded side information at the decoder,” in Proc. EUSIPCO, Lausanne, Switzerland, Aug. 2008.

SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY

Suggest Documents