EMPIRICAL QUANTIZATION FOR SPARSE SAMPLING ... - CiteSeerX

0 downloads 0 Views 228KB Size Report
The design is based on empirical diver- gence maximization, an approach akin to the well-known technique of empirical risk minimization. We show that the ...
EMPIRICAL QUANTIZATION FOR SPARSE SAMPLING SYSTEMS Michael A. Lexa The University of Edinburgh Institute for Digital Communications ABSTRACT We propose a quantization design technique (estimator) suitable for new compressed sensing sampling systems whose ultimate goal is classification or detection. The design is based on empirical divergence maximization, an approach akin to the well-known technique of empirical risk minimization. We show that the estimator’s rate of convergence to the “best in class” estimate can be as fast as n−1 , where n equals the number of training samples. Index Terms— empirical estimators, Kullback-Leibler divergence, compressed sensing, quantization for classification 1. INTRODUCTION The theory of compressed sensing (CS) clearly challenges traditional methods of converting analog waveforms into digital signals. For signals that have a sparse representation, that is for signals that can be represented, or well approximated, by a small number of basis vectors, the theory of compressed sensing tells us that it is possible to recover these vectors from far fewer samples than previously thought. This result quickly spawned new research in radically different analog signal sampling schemes [1, 2]. Less recognized, however, is the opportunity CS affords researchers to inject new ideas into quantization: the second necessary step in analog-to-digital conversion. In this paper, we aim at the latter effort and introduce an empirical design strategy that is capable of producing quantization rules with fast rates of convergence and that explicitly shows the parameters with which these rates depend. The empirical nature of the strategy is particularly advantageous for newly proposed CS analog sampling schemes because they often process waveforms in peculiar ways prior to sampling. Consider, for example, the random demodulator scheme proposed in [1] that multiplies the analog signal by a random waveform taking values ±1. In this case, if even one has an accurate probability model of the analog waveform, it may be difficult to propagate the model through the sampling process. Fast convergence is also advantageous for CS analog sampling schemes because their primary goal is to sample at rates as slow as possible (sub-Nyquist). Thus, strategies that converge quickly to the theoretically “best in class” rule hold significant advantage over strategies with slower convergence. Because the present work is centered on a performance measure specifically related to signal detection, it is best viewed as an extension of quantization for classification, a type of problem whose goal is to design quantization rules that optimize performance measures related to detection (classification) [3]. The presumption, therefore, is that the quantized samples will later be used to make a decision about the input analog waveform. Here, the measure of interest is the Kullback-Leibler (KL) divergence. Let P and Q be This work was funded by SPC (Spouse’s Pay Check).

two probability measures defined on [0, 1]d and let p and q denote their respective density functions. Any quantization rule φ : Rd 7→ {0, . . . , L − 1} that operates on a random vector X (distributed according to P or Q) induces the probability mass functions (pmfs), p(φ) = (p0 (φ), . . . , pL−1 (φ)) and q(φ) = (q0 (φ), . . . , qL−1 (φ)), where pi (φ) = P (φ(X) = i) and similarly for qi (φ). In this context, the KL divergence is defined as DKL (p(φ), q(φ)) =

L−1 X i=0

„ −pi (φ) log

« qi (φ) . pi (φ)

The KL divergence is a well-known quantity related to an optimal detector’s performace [4]. Generally speaking, signals characterized by p and q become easier to distinguish (classify or detect) as their divergence increases. Thus, we are interested in constructing quanbn based solely on 2n samples (n samples from p and tization rules φ n from q) that induce maximally divergent pmfs. In statistical learning theory when p and q are assumed unknown, empirical risk estimators minimize the empirical risk (empirical probability of error) by searching among a pre-specified class of candidate estimates [5]. We adopt this approach, but seek to maximize an empirical form of the KL divergence over a pre-specified class of quantization rules Φ. More concretely, we seek an estimator of the form bn = arg max Dn (φ), φ φ∈Φ

where Dn (φ) represents an, as yet unspecified, empirical estimate of the KL divergence. 2. EMPIRICAL DIVERGENCE ESTIMATION The form of Dn (φ) is taken from recent work by Nguyen et al. [6] and relies on rewriting the convex function − log(·) appearing in the definition of the KL divergence. 2.1. Convex conjugates The notion of a convex conjugate is based on the observation that a curve can be described as either by its graph or by an envelope of tangents curves. More concretely, we say that a (closed) convex function f : R 7→ R is the pointwise supremum of the collection of all affine functions h(x) = xx∗ −µ∗ such that h ≤ f , (x∗ , µ∗ ∈ R). See Figure 1. From this perspective, we can describe f by the set of slope-intercept pairs (x∗ , −µ∗ ) that parameterize the envelope of f . We observe that the condition h(x) ≤ f (x) holds for every x if and only if µ∗ ≥ sup {xx∗ − f (x), x ∈ R}. x

This inequality implies that f is the pointwise supremum of a collection of affine functions h such that the set of all pairs (x∗ , µ∗ ) lie

within the epigraph of a function f ∗ (x∗ ) defined by

5

5

4

f ∗ (x∗ ) = sup {xx∗ − f (x)}.

4

f*

3

x

3

2

Analogously, we can describe f ∗ as the pointwise supremum of the affine functions g(x∗ ) = x∗ x−µ such that (x, µ) lie in the epigraph of f . This duality means that we can write f (x) = sup {x∗ x − f ∗ (x∗ )}.

(1)

x∗

2

1

f

(-4,2.4)

0

0

(-1,1)

-1

-1 (-0.5,0.3)

-2 -5

f

1

-4

-3

-2

-1

0

1

2

-2 3

4

5

-5

-4

-3

-2

-1

Now, suppose γ : [0, 1]d 7→ {0, . . . , L − 1} is an arbitrary quantization rule characterized by the partitioning sets {Ri }L−1 i=0 . Using (1), we write the divergence between the pmfs induced by γ as DKL (p(γ), q(γ)) =

L−1 X

„ pi (γ)f

i=0

=

L−1 X i=0

qi (γ) pi (γ)

« (2a)

o n q (γ) i pi (γ) · sup x∗ − f ∗ (x∗ ) , (2b) pi (γ) x∗

where f (x) = − log(x), x > 0, +∞, otherwise. Calculating the convex conjugate, one finds ( −1 − log(−x∗ ) if x∗ < 0 ∗ ∗ f (x ) = +∞ if x∗ ≥ 0. Substituting this expression into (2b), we have the following expressions for the KL divergence DKL (p(γ), q(γ)) =

L−1 X

n pi (γ) sup

− x∗ i ∈R

i=0

=

L−1 X

o qi (γ) + 1 + log(−x∗i ) pi (γ)

n pi (γ) sup

i=0

=1+

x∗i

L−1 X

cRi ∈R+

sup

log(cRi ) − cRi

o qi (γ) +1 pi (γ)

n o P (Ri ) log(cRi ) − cRi Q(Ri ) ,

+ i=0 cRi ∈R

where in the second step we let cRi = −x∗i , and in the last step used the fact that pi (γ) = P (Ri ) with Ri = {x : γ(X) = i}. The validity of last expression is easily verified by differentiating it with respect to cRi and solving for the maximizers. By defining the piecewise constant function φ(x) =

L−1 X

cRi 1Ri (x),

cRi ∈ R+ ,

(3)

Dn (φ) = 1 +

φ

,

(4)

[0,1]d

[0,1]d

where 1 denotes the indicator function, and the supremum is taken over all functions of the form (3). 2.2. Empirical estimator For positive constants m > 0 and M < ∞, define the candidate function class ( ) L−1 X Φ(m, M, L) = φ(x) = cRi 1Ri (x) : m ≤ cRi ≤ M . i=0

n n 1X 1X log φ(Xip ) − φ(Xiq ), n i=1 n i=1

3

4

5

(5)

and consider the following empirical estimator bn = arg max Dn (φ), φ

(6)

φ∈Φ(m,M,L) q n where {Xip }n i=1 and {Xi }i=1 in (5) are observations distributed acbn is an empirical divergence maxcording to p and q respectively. φ imization estimator akin to the familiar empirical risk minimization estimators [5]. Note Dn (φ) is not in general a KL divergence; it can in fact be negative for some φ ∈ Φ. It is a consistent estimator, however, converging to the “best in class” estimator as n → ∞ [6]. The best in class estimate φ∗ is that element in Φ that maximizes D(φ),

φ∗ = arg max D(φ), φ∈Φ(m,M,L)

Z

where

(7)

Z

log(φ) dP − [0,1]d

)

2

Φ is a set of piecewise constant functions with L levels that are bounded and positive. Here, we do not specify the partition {Ri }L−1 i=0 nor a class to which it belongs. For the moment, however, consider Φ to be a function class that generally does not contain the theoretically optimal quantization rule. Define the function Dn (φ) as the empirical counterpart of (4),

D(φ) = 1 +

φ dQ

log(φ) dP −

1 + sup

1

Fig. 1. Left: The graph of the convex function f (x) = − ln(x) is shown, as well as three representative affine tangents (dashed lines) creating a portion of the envelope of f . Three slope-intercept pairs (x∗ , −µ∗ ) are indicated. The graph of the convex conjugate f ∗ (x∗ ) = −1 − ln(−x∗ ) is also shown. Note the slopeintercept pairs (dots) lie within the epigraph of f ∗ . Right: The graph of f is again shown along with three functions of the form x∗ x + 1 + ln(−x∗ ), whose maxima over x∗ for a given x equals f (x). The f (x)’s computed are the tangent points shown on the left.

i=0

we can write DKL (p(γ), q(γ)) in integral form: (Z Z

0 x* x

x* x

φ dQ. [0,1]d

Tsitsiklis [7] showed that without any restrictions on the quantization rules’ partitions, the theoretically optimal quantization rule can always be constructed by thresholding the likelihood ratio. In other words, the quantization rule that maximizes the KL divergence of the pmfs it induces, can always be chosen to be a piecewise constant function defined on a partition whose boundary sets are level sets of the likelihood ratio q(x)/p(x). Let γ ∗ denote this theoretically optimal rule. 3. ESTIMATION ERROR bn by characterizing the decay rate of the We gauge the quality of φ estimation error. This error is defined as the difference D(φ∗ ) −

bn ) and quantifies the error caused by computing φ bn without D(φ knowledge of p and q. Approximation error is defined as D(γ ∗ ) − D(φ∗ ) and arises in cases where γ ∗ ∈ / Φ. Investigations of these errors typically begin with two babn : sic inequalities that follow from the definitions of φ∗ and φ ∗ ∗ b b D(φ ) − D(φn ) ≥ 0 and Dn (φ ) − Dn (φn ) ≤ 0. They imply that the estimation error is upper bounded by a difference of empirical processes bn ) 0 ≤ D(φ∗ ) − D(φ bn ) − D(φ bn ))] ≤ −[(Dn (φ∗ ) − D(φ∗ )) − (Dn (φ √ ∗ bn ))/ n, = −(νn (φ ) − νn (φ where the second inequality results from adding and subtracting bn ), and νn (γ) = √n(Dn (γ) − D(γ)). By Dn (φ∗ ) and Dn (φ adding the approximation error to both sides of the above inequality, we have that the total error is bounded by the two component errors. bn ) 0 ≤ D(γ ∗ ) − D(φ | {z } total error

√ bn ))/ n + D(γ ∗ ) − D(φ∗ ). ≤ −(νn (φ∗ ) − νn (φ | {z } | {z } upper bound on est. error

(8)

approx. error

We use this expression below as a basis to investigate the expected decay rate of the estimation error. 3.1. Decay Rate Result bn ) − νn (φ∗ )|/√n as By writing |νn (φ ˛ n ˛ X ˛1 (log φ(Xip ) − log φ∗ (Xip )) − Ep (log φ(X) − log φ∗ (X)) ˛ ˛n i=1 ˛ n ˛ 1X ˛ q ∗ q ∗ + (φ(Xi ) − φ (Xi )) − Eq (φ(X) − φ (X))˛ , ˛ n i=1

it is clear that for a given quantization rule φ, the empirical averages above converge almost surely to their respective values by the strong bn can potentially be any elelaw of large numbers. But because φ ment in Φ, any characterization of the convergence rate must hold uniformly over φ ∈ Φ. It is well-known that uniform rates of convergence depend on the complexity of the function class from which the empirical estimators are drawn [8]. Here, we use the notion of bracketing entropy to characterize complexity of Φ. Roughly speaking, the bracketing entropy of a function class G equals the logarithm of the minimum number of function pairs that upper and lower bound (bracket) all the members in G with respect to some norm. We denote the bracketing entropy by HB (δ, G, L2 (P )) and say G has bracketing complexity α > 0 if HB (δ, G, L2 (P )) ≤ Aδ −α for all δ > 0 and for some constant A > 0. Condition 1. The bracketing complexity of Φ satisfies 0 < α < 2. In addition to the complexity condition, we incorporate a second condition that Mammen,Tsybakov [9] and Tsybakov [10] introduced in related function estimation and Bayesian classification problems that involves a key parameter in achieving rates faster than n−1/2 . We first state the condition and then comment on its interpretations after we present the convergence result. Condition 2. There exists constants K > 0 and κ ≥ 1 such that for all φ ∈ Φ, D(γ ∗ ) − D(φ) ≥ kγ ∗ − φkκL2 /K. (9)

The last condition is a technical condition. Condition 3. The densities p and q are uniformly bounded, that is c ≤ p(x), q(x) ≤ C for all x ∈ [0, 1]d , c > 0, C < ∞. bn , φ∗ , and γ ∗ be as defined above and suppose Main Result. Let φ Conditions 1, 2 and 3 are met for some constants A, α, K, κ, c and C. Then for κ > 1 − α/2 and for any 0 < ² < 1 we have “ ” bn ) ≤ 1 + ² D(γ ∗ ) − ED(φ 1−² – » κ − 2(κ−1)+α ∗ ∗ const(A, α, K, κ, c, C) n + D(γ ) − D(φ ) , for sufficiently large n. Depending on the value of κ and α, the decay rate of the estimation error can be as fast as n−1 and no worse than n−1/2 . In fact, if P , Q, and Φ are such that κ = 1 and if α = 1, the rate n−1 is guaranteed. Observe that if D(γ ∗ ) − D(φ) is lower bounded by a positive constant for all φ ∈ Φ, Condition 2 can be met with κ = 1. In other words, if the approximation error is nonzero, convergence rates as fast as n−1 are possible. If γ ∈ Φ, this statement no longer holds. Therefore, to achieve faster rates one must exclude elements of Φ that achieve optimal performance, that is exclude elements φ0 for which D(φ0 ) = D(γ ∗ ). In the special case where all the partitions of φ ∈ Φ coincide with that of γ ∗ , it can be shown that Condition 2 is met with κ = 2. One can consider Condition 2 either as an assumption on Φ, or an assumption on P and Q, since P and Q determine γ ∗ . Finally, note that as κ tends to infinity, the decay rate tends to n−1/2 regardless of the bracketing complexity of Φ. 4. PROOF OF MAIN RESULT This proof is a modification of a result by van de Geer [11, pp. 206207]. Supporting lemmas are stated below the main proof. Define the random variable Zn =

bn ) − νn (φ∗ )| |νn (φ bn C (α+2)/4 kφ

− φ∗ kβL2 ∨ n−(2−α)/2(2+α)

,

(10)

where x ∨ y = max (x, y) and β = 1 − α/2. We consider the bn − φ∗ kL > C −1/2 n−1/(2+α) and following two cases: (1) kφ 2 ∗ −1/2 bn − φ kL ≤ C (2) kφ n−1/(2+α) . 2 Case 1. Under this case (10) simplifies to Zn =

bn ) − νn (φ∗ )| |νn (φ . bn − φ∗ kβ C (α+2)/4 kφ

(11)

L2

For φ = φ∗ , (11) is defined to be zero. Recalling the inequality (8), we have √ bn ) ≤ −(νn (φ∗ ) − νn (φ bn ))/ n + D(γ ∗ ) − D(φ∗ ) D(γ ∗ ) − D(φ √ bn − φ∗ kβ / n ≤ Zn C (α+2)/4 kφ L2 (12) + D(γ ∗ ) − D(φ∗ ). Assumption 2 implies bn kβ ≤ K β/κ (D(γ ∗ ) − D(φ bn ))β/κ kγ ∗ − φ L2 kγ ∗ − φ∗ kβL2 ≤ K β/κ (D(γ ∗ ) − D(φ∗ ))β/κ .

(13)

Lemma 4.1 (van de Geer [11]). We have for all positive v, t, and ², and κ > β,

Hence, by the triangle inequality and (13), we have bn − φ∗ kβ ≤ kγ ∗ − φ bn kβ + kγ ∗ − φ∗ kβ kφ L2 L2 L2 bn ))β/κ ≤ K β/κ (D(γ ∗ ) − D(φ

(14)

+ K β/κ (D(γ ∗ ) − D(φ∗ ))β/κ .

bn ) is less than or equal to Using (14) in (12), shows D(γ ∗ ) − D(φ h bn ))β/κ + C (α+2)/4 C (α+2)/4 K β/κ n−1/2 Zn (D(γ ∗ ) − D(φ i K β/κ n−1/2 Zn (D(γ ∗ ) − D(φ∗ ))β/κ + D(γ ∗ ) − D(φ∗ ). We now apply Lemma 4.1 to each of the terms within the brackets to obtain h i bn ) ≤ ² (D(γ ∗ ) − D(φ bn )) + (D(γ ∗ ) − D(φ∗ )) D(γ ∗ ) − D(φ + 2C

−κβ 2(κ+β)

“K ”

β κ−β

²

κ κ−β

Zn

κ − 2(κ−β)

n





+ D(γ ) − D(φ ).

By rearranging the previous expression and dropping a factor of 1 < 1, we have 1+² “ ” bn ) ≤ 1 + ² D(γ ∗ ) − D(φ 1−² » – κ −κβ β κ − 2C 2(κ+β) (K/²) κ−β Znκ−β n 2(κ−β) + D(γ ∗ ) − D(φ∗ ) . (15) For any r ≥ Lemma 4.2 that

κ , κ−β

we have by Jensen’s inequality [4], and

κ κ ˆ ˜ κ κ EZnκ−β = E(Znr ) r(κ−β) ≤ EZnr r(κ−β) ≤ c2κ−β .

(16)

Taking the expectation of (15) and applying (16), we conclude “ ” bn ) ≤ 1 + ² D(γ ∗ ) − ED(φ 1−² » – κ −κβ ` ´ β κ−β κ − 2(κ−β) ∗ ∗ κ−β 2(κ+β) 2C c2 n + D(γ ) − D(φ ) , K/² bn − φ kL > C for kφ 2 ∗

−1/2

n

−1/(2+α)

.

From the fundamental inequality √ bn ) ≤ −(νn (φ∗ ) − νn (φ bn ))/ n + D(γ ∗ ) − D(φ∗ ) D(γ ∗ ) − D(φ ≤ Zn n−(2−α)/2(2+α) n−1/2 + D(γ ∗ ) − D(φ∗ ) (17)

Taking the expectation of (17) and applying Lemma 4.2 yields bn ) ≤ c3 n−2/(2+α) + D(γ ∗ ) − D(φ∗ ), D(γ ∗ ) − ED(φ

bn , φ∗ , and γ ∗ be as defined above. Then under Lemma 4.2. Let φ Conditions 1 and 2, we have 0 1r ∗ |ν (φ) − ν (φ )| n n A ≤ cr2 , E@ sup 1−α/2 ∗ −1/2 −1/(2+α) ∗ kφ − φ kL2 n φ∈Φ, kφ−φ kL2 >C and 0 E@

sup φ∈Φ, kφ−φ∗ kL2 ≤C −1/2 n−1/(2+α)

1 |νn (φ) − νn (φ∗ )| A ≤ c3 , n−(2−α)/2(2+α)

for some positive constants c2 , r, and c3 . Proof (Sketch). The proof relies on a concentration inequality due to van de Geer [8, Lemma 5.13]. To apply it, Conditions 2 and 3 are needed. Details to be included in a forthcoming paper. 5. REFERENCES [1] J. Laska, S. Kirolos, M. Duarte, R. Baraniuk, and Y. Massoud, “Theory and implementation of an analog-to-information converter using random demodulation,” Proc. IEEE Int. Symp. Circuits and Systems (ISCAS), pp. 1959–1962, May 2007. [2] M. Mishali and Y.C. Eldar, “From theory to practice: SubNyquist sampling of sparse wideband analog signals,” [Online]. Available: arXiv:0902.4291v1 [cs.IT], 2009. [3] H. V. Poor and J. B. Thomas, “Applications of Ali-Silvey distance measures in the design of generalized quantizers for binary decision systems,” IEEE Trans. on Communications, vol. 25, no. 9, pp. 893–900, Sep 1977. [4] T.M. Cover and J.A. Thomas, Elements of Information Theory, Wiley, 1991.

[6] XuanLong Nguyen, Martin J. Wainwright, and Michael I. Jordan, “Estimating divergence functionals and the likelihood ratio by convex risk minimization,” Technical report, Department of Statistics, University of California, Berkeley, Sep 2008.

bn ) − νn (φ∗ )| |νn (φ . −(2−α)/2(2+α) n

= Zn n−2/(2+α) + D(γ ∗ ) − D(φ∗ ).

.

[5] L. Devroye, L. Gy¨orfi, and G. Lugosi, A Probabilistic Theory of Pattern Recognition, Springer, New York, 1996.

Case 2. For this case, Zn =

β − κ−β

κ

vtβ/κ ≤ ²t + v κ−β ²

(18)

bn − φ∗ kL ≤ n−1/(2+α) . The rate attained in (18) is a for kφ 2 bn − φ∗ kL > (strictly) faster rate of decay than that obtained for kφ 2 κ 1 − C −1/2 n−1/(2+α) (n− 2+α < n 2(κ−β) for κ > β, β = 1 − α/2, 0 < α < 2). Therefore, the decay of the total divergence loss is governed by the slower rate found in case 1. ¤

[7] J.N. Tsitsiklis, “Extremal properties of likelihood-ratio quantizers,” IEEE Trans. Comm., vol. 41, no. 4, pp. 550–558, Apr 1993. [8] S. van de Geer, Empirical Processes in M-Estimation, Cambridge University Press, 2000. [9] E. Mammen and A. B. Tsybakov, “Smooth discrimination analysis,” Ann. Statist., vol. 27, no. 6, pp. 1808–1829, 1999. [10] A. B. Tsybakov, “Optimal aggregation of classifiers in statistical learning,” Ann. Statist., vol. 32, no. 1, pp. 135–166, 2004. [11] S. van de Geer, Lectures on Empirical Processes: Theory and Statistical Applications, chapter Oracle Inequalities and Regularization, pp. 191–252, European Mathematical Society, 2007.

Suggest Documents