Fast binary embeddings, and quantized compressed sensing ... - arXiv

2 downloads 0 Views 463KB Size Report
Jan 26, 2018 - Fast binary embeddings, and quantized compressed sensing with structured matrices. Thang Huynh. Rayan Saab. January 29, 2018. Abstract.
Fast binary embeddings, and quantized compressed sensing with structured matrices

arXiv:1801.08639v1 [cs.IT] 26 Jan 2018

Thang Huynh

Rayan Saab

January 29, 2018

Abstract This paper deals with two related problems, namely distance-preserving binary embeddings and quantization for compressed sensing . First, we propose fast methods to replace points from a subset X ⊂ Rn , associated with the Euclidean metric, with points in the cube {±1}m and we associate the cube with a pseudo-metric that approximates Euclidean distance among points in X . Our methods rely on quantizing fast Johnson-Lindenstrauss embeddings based on bounded orthonormal systems and partial circulant ensembles, both of which admit fast transforms. Our quantization methods utilize noise-shaping, and include Sigma-Delta schemes and distributed noise-shaping schemes. The resulting approximation errors decay polynomially and exponentially fast in m, depending on the embedding method. This dramatically outperforms the current decay rates associated with binary embeddings and Hamming distances. Additionally, it is the first such binary embedding result that applies to fast Johnson-Lindenstrauss maps while preserving ℓ2 norms. Second, we again consider noise-shaping schemes, albeit this time to quantize compressed sensing measurements arising from bounded orthonormal ensembles and partial circulant matrices. We show that these methods yield a reconstruction error that again decays with the number of measurements (and bits), when using convex optimization for reconstruction. Specifically, for Sigma-Delta schemes, the error decays polynomially in the number of measurements, and it decays exponentially for distributed noise-shaping schemes based on beta encoding. These results are near optimal and the first of their kind dealing with bounded orthonormal systems.

1

Introduction

In signal processing and machine learning, given a linear map y = Ax, where x is an unknown signal in Cn and A is an m × n matrix in Cm×n , one would often like to quantize or digitize y, i.e., map each entry yi of y to a finite discrete set, say A. On the one hand, viewing A as a measurement operator, quantization is the step that allows us to move from the analog world of continuous measurements to the digital world of bit-streams. It thereby allows computer processing of the signals, as well as their storage and transmission. On the other hand, one can view A as a linear embedding, taking some vectors x ∈ T ⊂ Rn to their image in Rm . For a vector x ∈ T , the quantized version of Ax can be viewed as a non-linear embedding of x ∈ Rn into Am . It can therefore be used to expedite such tasks as information retrieval (say via nearest neighbor searches) provided the action of A and the quantization both admit fast computation, and provided "distance" computations in the embedding space can be done efficiently as well. 1

1.1

A brief introduction to compressed sensing

While in the remainder of the paper we address binary embeddings before we address quantization in the compressed sensing context, we find it more convenient here to first introduce compressed sensing. Suppose x ∈ Cn is an unknown vector which we wish to reconstruct from m linear measurements. Specifically, suppose we collect the measurements yj = haj , xi + wj where each aj , j = 1, ..., m, is a known measurement vector in Cn and wj denotes unknown noise. When x is sparse (i.e., most of its coefficients are zero) or compressible (i.e., well approximated by sparse vectors) x can be recovered (or approximated) from the measurements, even when m ≪ n. Indeed, the study of recovery algorithms and reconstruction guarantees from such measurements is the purview of compressed sensing (CS) and one of the most popular reconstruction methods in this setting is ℓ1 -minimization. Here, the vector x is approximated by x ˆ, the solution to min kzk1

z∈Cn

subject to kAz − yk2 ≤ η.

(1.1)

Above A is the m × n matrix with aj as its jth row, y = (yj )m j=1 is the vector of acquired measurements, and η is an upper bound on the norm of the noise vector w = (wj )m j=1 . Some of the earliest results on compressed sensing (e.g., [12, 25, 47, 3]) show that for a certain class of subgaussian random measurement matrices A, including those with i.i.d. standard Gaussian or ±1 Bernoulli entries, with high probability on the draw of the matrix and uniformly on x, the solution x b to (1.1) obeys kx − x bk2 . η +

σk (x)1 √ , k

(1.2)

provided m & k log(n/k). Such guarantees are often based on the measurement matrix A satisfying the so-called Restricted Isometry Property (RIP), defined precisely in Section 3. The above results on subgaussian random matrices are mathematically important both for the techniques they introduce and as a proof of concept that compressed sensing is viable. Nevertheless, for physical reasons one may not have full design control over the measurement matrix in practice. Moreover, for practical reasons (including reconstruction speed), one may require that the measurement matrices admit fast multiplications. Such reasons motivated the study of structured random matrices in the compressed sensing context (see, e.g., [13, 66, 60, 57, 28]). Perhaps two of the most popular classes of structured random measurement matrices are those drawn from a bounded orthogonal ensemble (BOE) or from a partial circulant ensemble (PCE). Indeed, studying the interplay between structured random matrices and quantization is one focus of this paper. Herein, we are interested in the case when A is drawn from a bounded orthogonal ensemble, or from a partial circulant ensemble, both of which we now introduce. Definition 1.1 (Bounded Orthogonal Ensemble (BOE)). Let √1n U ∈ Cn×n be any unitary matrix with |Uij | ≤ 1 for all entries Uij of U . A matrix A ∈ Cm×n is drawn from the bounded orthogonal ensemble associated with U by picking each row ai of the matrix A uniformly and independently from the set of all rows of U .

2

Examples of U include the n × n discrete Fourier matrix and the n × n Hadamard matrix. For physical reasons, BOEs arise naturally in various important applications of compressed sensing, including those involving Magnetic Resonance Imaging (MRI) (see, e.g., [32, 46, 49, 67]). An additional practical benefit of BOEs, e.g., those based on the Fourier or Hadamard transforms, is implementation speed. When viewed as linear operators, such matrices admit fast implementations with a number of additions and multiplications that scales like n log n. This is in contrast to the cost of standard matrix vector multiplication which scales like n2 . As such, BOEs have also appeared in various results involving fast Johnson-Lindenstrauss embeddings (e.g., [2, 1, 43]). Definition 1.2 (Partial Circulant Ensemble (PCE)). For z ∈ Cn , the circulant matrix Hz ∈ Cn×n is given by its action Hz x = z ∗ x, where ∗ denotes convolution. Fix Ω ⊂ {1, 2, . . . , n} of size m arbitrarily. A matrix A is drawn from the partial circulant ensemble associated with Ω by choosing a vector σ uniformly at random on {−1, 1}n (i.e., the entries σi of σ are independent ±1 Bernoulli random variables), and setting the rows of A to be the rows of Hσ indexed by Ω. Partial Circulant Ensembles also appear naturally in various applications of compressed sensing, including those involving radar and wireless channel estimation mainly due to the fact that the action of a circulant matrix on a vector is equivalent to its convolution with a row of the circulant matrix. We refer to [33, 58, 59, 27] for a discussion on these applications. As with BOEs, PCEs also admit fast transformations, as convolution is essentially diagonalized by the Fourier transform, which admits fast implementation. As such, PCEs also admit an implementation with a number of addition and multiplications that scales with n log n as opposed to n2 . For this reason PCEs, like BOEs, feature prominently in constructions for fast Johnson-Lindenstrauss embeddings. An additional benefit in applications is that the memory required for storing a PCE scales like n (versus mn for an unstructured matrix). Candés and Tao were the first to study structured random matrices drawn from a BOE, in the compressed sensing context. In [13], they show that such matrices satisfy the restricted isometry property with high probability, provided that the number of measurements m & k log6 (n). Later, many researchers (e.g., [60, 15, 10, 34]) improved the dependence of the number of measurements m on n, i.e., they improved the logarithmic factors. To our knowledge, the current best result is by Haviv and Regev [34] and achieves a lower bound m & k log2 (n). Similarly, many researchers have studied recovery guarantees for compressed sensing with random measurement matrices drawn from the partial circulant ensembles (e.g., [33, 53, 40, 48]), eventually showing a similar dependence of m on k and n, i.e., m & k log4 n in [40]. Our own results will apply to both BOEs and PCEs (in both the compressed sensing and binary embedding contexts) with the added caveat of randomizing the signs of their rows. It is therefore useful to introduce the following construction. Construction 1.3. An admissible distribution on m × n matrices corresponds to constructing Φ ∈ Cm×n as follows. 1. Draw A either from a bounded orthogonal ensemble associated with a matrix U as in Definition 1.1 or from a partial circulant ensemble associated with a set Ω of size m as in Definition 1.2. Let aj be the j-th row of A. 2. Let ǫj be independent ±1 Bernoulli random variables which are also independent of A. 3. Let Φ be an m × n matrix whose j-th row is ǫj aj . 3

1.2

A brief introduction to quantization

Consider the linear measurements y = Ax of a signal x. Quantization is the process by which the measurement vector y is replaced by a vector of elements from a finite set, say A, known as the quantization alphabet. The finiteness of A allows its elements to be represented with finite binary strings, which in turn allows digital storage and processing. Indeed, such digital processing is necessary in the field of binary embedding [38, 54, 55, 70] and compressed sensing (see, e.g., [20]), where careful selection of A and A can yield faster, memory-efficient algorithms. In binary embedding one wishes to design a quantization map Q : Cm → Am which approximately preserves the distance between two signals (see below). On the other hand, in compressed sensing one typically requires a reconstruction algorithm R : Am → Cn such that given the quantized measurements Q(Ax), x ˆ = R(Q(Ax)) ensures a small reconstruction error kx − x ˆk2 . Various choices of quantization maps and reconstruction algorithms have been proposed in the context of binary embedding and compressed sensing. These have ranged from the most intuitive quantization approaches, namely memoryless scalar quantization [38, 54, 23], to more sophisticated approaches based on noise shaping quantization, including Σ∆ quantization [31, 41, 42] and distributed noise shaping quantization [17, 35], as well as others [4, 37]. Indeed, the afore-mentioned noise shaping quantization techniques combine computational simplicity and the ability to yield favorable error guarantees as a function of the number of measurements. We will use them to obtain our advertised results and thus provide a necessary, more detailed, overview in Section 4.

1.3

A brief introduction to binary embeddings

Low distortion embeddings have played an important role recently in signal processing and machine learning as they transform high dimensional signals into low-dimensional ones while preserving geometric properties. The benefits of such embeddings are direct consequences of the dimensionality reduction and they include the potential for reduced storage space and computational time associated with the (embedded) signals. Perhaps one of the most important embeddings is given by the Johnson-Lindenstrauss (JL) lemma [39] which shows that one can embed N points in Rn into a m = O(δ−2 log N )-dimensional space and simultaneously preserve pairwise Euclidean distance up to δ-Lipschitz distortion. More recently, binary embeddings have also attracted a lot of attention in the signal processing and machine learning communities (e.g., [69, 56, 63, 45, 29, 71]), in part due to theoretical interest, and in part motivated by the potential benefits of further reductions in memory requirements and computational time. Roughly speaking, they are nonlinear embeddings that map high dimensional signals into a discrete cube in lower-dimensional space. As each signal is now represented by a short binary code, the memory needed for storing the entire data set is reduced considerably, as is the cost of computation. To be more precise, let T be a set of vectors in Rn and let {−1, 1}m be the binary cube. Given a distance dT on T , an α-binary embedding of T is a map f : T → {−1, 1}m with an associated function d on {−1, 1}m so that |d(f (x), f (˜ x)) − dT (x, x˜)| ≤ α,

for all x, x ˜∈T.

Recent works have shown that there exist such α-binary embeddings [38, 54, 55, 70]. For T a finite set of N points in the unit sphere S n−1 , endowed with the angular distance dT (x, x˜) = arccos(kxk−1 xk−1 ˜i), these works consider the map 2 k˜ 2 hx, x fA (x) = sign(Ax), 4

(1.3)

where A is an m × n matrix whose entries are independent standard Gaussian random variables. Then with probability exceeding 1 − η, fA is an α-binary embedding for T into {−1, 1}m P 1 ˜i |, provided that associated with the normalized Hamming distance dH (q, q˜) = 2m i |qi − q −2 m & α log(N/η). While achieving the optimal bit complexity for Hamming distances as shown in [70], the above binary embeddings rely on A being Gaussian. In general, they suffer from the following drawbacks. Drawback (i): To execute an embedding, one must incur a memory cost of order mn to store A, and must also incur the full computational cost of dense matrix-vector multiplication each time a point is embedded. Drawback (ii): The embedding completely loses all magnitude information associated with the original points. That is, all points cx ∈ Rn with c > 0 embed to the same image. To address the first point above, researchers have tried to design other binary embeddings which can be implemented more efficiently. For example, such embeddings include circulant binary embedding [71, 52, 24], structured hashed projections [16], binary embeddings based on WalshHadamard matrices and partial Gaussian Toeplitz matrices with random column flips [70]. To our knowledge, all of these results do not address the second point above; they use the sign function (1.3) (which is an instance of Memoryless Scalar Quantization) to quantize the measurements. While simple, this quantization method cannot yield much better distance preservation when we have more measurements, i.e., when m increases, as discussed in Section 4. Drawback (iii): At best, the approximation error associated with the embedding distance decays slowly as m, the embedding dimension (and number of bits) increases. We resolve these issues in this paper.

1.4

Notation

A vector is k-sparse if it has at most k nonzero entries, i.e., it belongs to the set Σnk := {z ∈ Cn , |supp(z)| ≤ k}. Roughly speaking, a compressible vector x is one that is close to a k-sparse vector. More precisely, we can use the best k-term approximation error of x in ℓ1 , denoted by σk (x)1 := minv∈Σnk kx−vk1 , to measure how close x is to being k-sparse. We use the notation B2n to refer to the unit Euclidean ball in Rn . In what follows A . B means that there is a universal constant c such that A ≤ cB, and & is defined analogously. We use the notation Eg to indicate expectation taken with respect to the random variable g (i.e., conditional on all other random variables). The operator norm of a matrix A, kAkp→q , is defined as kAkp→q := max kAxkq , though we may simply use kAk in place of kAk2→2 .

1.5

kxkp =1

Roadmap

The rest of the paper is organized as follows. In Section 2 we highlight the main contributions of the paper, and in Section 3 we review some necessary technical definitions and tools from compressed sensing, embedding theory, and probability. Section 4 is dedicated to a review of 5

Algorithm 1 Fast binary embeddings Require: (i) a matrix Φ ∈ Rm×n drawn according to Construction 1.3, (ii) a stable noise-shaping quantization scheme Q (Σ∆, as in Section 4.4 or "distributed", as in Section 4.5) with alphabet A = {±1} and associated matrix Ve ∈ Rp×m as in Definition 4.3 or 4.6, (iii) a diagonal matrix Dǫ ∈ Rn×n with random signs, (iv) any points a, b ∈ T ⊂ Rn with kak2 ≤ 1, kbk2 ≤ 1. Ensure: dVe (a, b) ≈ ka − bk2 . ΦDǫ b m m ǫa 1: Compute f (a) = Q( 6ΦD log m ) ∈ {±1} and f (b) = Q( 6 log m ) ∈ {±1} . e (f (a) − f (b))k2 . 2: Return d e (a, b) = kV V

quantization theory, and in particular the noise-shaping techniques that are instrumental to our results. Sections 5 and 6 are dedicated to our main theorems and proofs on binary embeddings and quantized compressed sensing. Section 7 provides a proof of a critical technical result, namely that certain matrices associated with our methods satisfy a restricted isometry property. Finally Section 8 provides proofs of some technical lemmas that we require along the way.

2

Contributions

Below, we describe our main results as they pertain to binary embeddings and quantized compressed sensing, and we summarize the main technical contribution that allows us to obtain our results. To that end, let Dǫ be an n × n diagonal matrix with random signs, Φ an m × n matrix as in Construction 1.3, Q be a noise-shaping quantizer, i.e., either QΣ∆ a Σ∆ quantization or Qβ a distributed noise-shaping quantization (as in Sections 4.4 and 4.5). Let Vb be the associated linear operator defined in (2.2) below and Ve = 6(log m)Vb .

2.1

Contributions related to binary embeddings

Let T ⊂ Rn be a subset of the unit ball. In Algorithm 1 we construct quasi-isometric (i.e., distance preserving) embeddings between (T , k · k2 ) and the binary cube {−1, +1}m endowed with the pseudometric q , q) := kVe (˜ q − q)k2 . dVe (˜

As they rely on Construction 1.3 our embeddings f : T → {±1}m support fast computation, and despite their highly quantized non-linear nature, they perform as well as linear JohnsonLindenstrauss methods up to an additive error that decays exponentially (or polynomially) in m.

When T is finite we show that with high probability and for prescribed distortion α m&

log (|T |) log4 n α2

=⇒

d e (f (x), f (˜ x)) − kx − x ˜k2 ≤ αkx − x ˜k2 + cη(m), V

where η(m) −−−−→ 0. m→∞

6

Algorithm 2 Fast Quantized Compressed Sensing Require: (i) a matrix Φ ∈ Cm×n drawn according to Construction 1.3, (ii) A noise-shaping quantization scheme Q (as in Sections 4.4 or 4.5) with associated matrix Vb as in (2.2), quantization alphabet A, and appropriate error parameter η. (iii) any point x ∈ Σnk with kxk2 ≤ 1.

Ensure: Accurate reconstruction from quantized compressed sensing measurements 1: Quantize the compressed sensing measurements: q = Q(Φx). 2: Reconstruct an approximation x ˆ := arg min kzk1 z

subject to

kVb Φz − Vb qk2 ≤ η.

(2.1)

Above, η(m) decays polynomially fast in m (when Q is a Σ∆ quantizer), or exponentially fast (when Q is a distributed noise shaping quantizer). In fact, we prove a more general version of the above result, which applies to infinite sets T . When T is arbitrary (possibly infinite) we show that with high probability and for prescribed distortion α m&

log4 n ω(T )2 · α2 rad(T )2

=⇒

√ d e (f (x), f (˜ x)) − kx − x ˜k2 ≤ max( α, α) rad(T ) + cη(m) V

where η(m) is as before and where ω(T ) and rad(T ) denote the Gaussian width and Euclidean radius of T , defined in Section 3. We remark that our results are even more general in the sense that we can embed into any finite set Am , where A is the quantization alphabet used by the noise-shaping quantization scheme (see Section 4.4 for details). In particular, this means that one may choose an alphabet A that contains more than two elements and obtain improved constants in the additive part (i.e., cη(m)) of our error bounds. Thus, in effect our results amount to fast quantized JohnsonLindenstrauss embeddings.

2.2

Contributions related to the quantization of compressed sensing measurements arising from random structured matrices

We provide the first non-trivial quantization results, with optimal (linear) dependence on sparsity, in the setting of compressed sensing with bounded orthonormal systems. Our results also apply to partial circulant ensembles and our approach is summarized in Algorithm 2. We show that noise-shaping quantization approaches, namely Σ∆ quantization (e.g., [31, 41, 20]) and distributed noise-shaping quantization [17] significantly outperform the naive (but commonly used) quantization approach, memoryless scalar quantization. They yield polynomial and exponential error decay, respectively, as a function of the number of measurements. In contrast, the error decay of memoryless scalar quantization can be no better than linear. Noise shaping quantization techniques produce vectors q whose entries belong to a finite set. Equally importantly, q is related to the vector of measurements y = Ax via a noise-shaping relation of the form y − q = Hu, 7

where H is a so-called noise shaping operator and u is a vector with bounded entries. Recalling that in compressed sensing, one wishes to recover the (approximately) sparse vector x, our reconstruction approach is to solve the optimization problem min kzk1 z

subject to

kVb Φz − Vb qk2 ≤ η.

Above, we choose the matrix Vb and we set the scalar η depending on the quantization scheme. As a result we can show that m & k log4 n

=⇒

kx − x ˆk2 . η(m) +

σk (x)1 √ k

with high probability, for all x ∈ Rn . Above, the estimate x ˆ is produced by Algorithm 2 and as before η(m) decays polynomially fast in m (when Q is a Σ∆ quantizer), or exponentially fast (when Q is a distributed noise shaping quantizer).

2.3

Technical considerations

The choice of Vb is critical to the success of our reconstruction, as we require Vb Φ to satisfy a restricted isometry property [12, 13], see Section 4. Once Vb Φ satisfies this property, the reconˆ with kˆ x − xk2 ≤ Cη by standard compressed struction step of Algorithm 2 yields an estimate x sensing results (see, e.g. [28]). Thus, the main technical challenge is to construct Vb in such a way that Vb Φ satisfies the restricted isometry property. This is non-trivial due to the dependencies implicit in the random model which generates bounded orthonormal systems and partial circulant matrices. Simultaneously, the scalar η that we choose must be small enough to yield the desired polynomial or exponential decay rates in the number of measurements m. The choice of Vb that achieves our goals is 1 (2.2) Vb = √ (Ip ⊗ v), kvk2 p

where v = (v1 , . . . , vλ ) ∈ Rλ depends on the quantization scheme (v is given explicitly in Sections 4.4 and 4.5), Ip is the p × p identity matrix and λp = m. Here, ⊗ denotes the Kronecker product. We show that with high probability, for matrices Φ drawn according to Construction 1.3 m&

k log4 n α2

=⇒

sup kVb Φxk22 − kxk22 ≤ α,

x∈Tk

where Tk denotes the set of k-sparse signals on the unit sphere. That is, we show that Vb Φ has the restricted isometry property. This restricted isometry property is also crucial for our results on binary embeddings.

3 3.1

Definitions and Tools The restricted isometry property and its implications

We now recall the restricted isometry property of matrices along with some of its implications to compressed sensing, as well as theorems showing that BOEs and PCEs satisfy it with high probability if m & k polylog n.

8

Definition 3.1 (The restricted isometry property (RIP)). We say that a matrix A ∈ Cm×n satisfies the (δk , k)-RIP if for every x ∈ Σnk , we have (1 − δk )2 kxk2 ≤ kAxk22 ≤ (1 + δk )2 kxk2 . It is by now well-known that, if a matrix A satisfies (δk , k)-RIP with an appropriate small constant δk , ℓ1 -minimization is robust. We present one such result as a representative. Theorem 3.2 (See Theorem 6.12 in [28]). Suppose that the matrix A ∈ Cm×n satisfies the (δ2k , k)-RIP with δ2k (A) < √441 . Then for any x ∈ Cn and b ∈ Cm with kAx − bk2 ≤ η, a solution x# of

min kzk1 subject to kAz − bk2 ≤ η

z∈Cn

approximates the vector x with D kx − x# k2 ≤ Cη + √ σk (x)1 , k where the constants C, D > 0 depend only on δ2k . Indeed several ensembles of random matrices satisfy the RIP and in this paper we are particularly interested in bounded orthonormal ensembles and partial circulant matrices for which we recall the following results. Theorem 3.3 (See, e.g., [60, 28]). Let A ∈ Cm×n be drawn from the bounded orthogonal ensemble. Fix α ∈ (0, 1). Then there exists a universal constant C > 0 such that 1 2 2 (3.1) E sup kAxk2 − kxk2 ≤ α, x∈Tk m provided

m log(9m)

≥ C k log

2

(4k) log(8n) . α2

Theorem 3.4 (See, e.g., [40]). Let A ∈ Rm×n be drawn from the partial circulant ensemble associated with an index set Ω of size m. Fix α ∈ (0, 1). Then there exists a universal constant C > 0 such that 1 2 2 E sup kAxk2 − kxk2 ≤ α, (3.2) x∈Tk m provided m ≥ C k log

3.2

2

(k) log2 (n) . α2

Embedding tools

We begin with some useful definitions of quasi-isometric embeddings, and of Gaussian widths of sets. Definition 3.5 (see, e.g., [11]). A function f : X → Y is called a quasi-isometry between metric spaces (X , dX ) and (Y, dY ) if there exist C > 0 and D ≥ 0 such that 1 dX (x, x˜) − D ≤ dY (f (x), f (˜ x)) ≤ CdX (x, x ˜) + D C for all x, x ˜ ∈ X , and E > 0 such that dY (y, f (x)) < E for all y ∈ Y. 9

Remark 1. We will relax the above definition by only requiring that dY be a pseudo-metric, so we require dY (x, y) = dY (y, x) and we require dY to respect the triangle inequality, but we allow dY (x, y) = 0 for x 6= y. Definition 3.6. Let C = {−1, 1}m be a discrete cube in Rm and let V ∈ Rp×m with p ≤ m. We define a pseudo-metric dV on C by dV (q, q˜) = kV (q − q˜)k2 . Definition 3.7. For a set T ⊂ Rn , the Gaussian mean width ω(T ) is defined by ω(T ) := E[suphv, gi] v∈T

where g is a standard Gaussian random vector in Rn . Moreover, rad(T ) = sup kvk2 v∈T

is the maximum Euclidean norm of a vector in T . Krahmer and Ward [43] showed that matrices satisfying the RIP can be used to create Johnson-Lindenstrauss, i.e., near-isometric, embeddings of finite sets. Theorem 3.8 ([43]). Fix ν > 0 and α ∈ (0, 1). Consider a finite set T ⊂ Rn . Set k ≥ 40(log(40|T |) + ν), and suppose that Ψ ∈ Rp×n satisfies (α/4, k)-RIP. Let Dǫ ∈ Rn×n be a random sign diagonal matrix. Then, with probability exceeding 1 − e−ν , the matrix ΨDǫ satisfies kΨDǫ xk22 − kxk22 ≤ αkxk22 simultaneously for all x ∈ T .

We will need the following “multiresolution” version of the restricted isometry property, used by Oymak et. al in [51] to prove embedding results for arbitrary, i.e., not necessarily finite, sets. Definition 3.9 (Multiresolution RIP [51]). Let L = ⌈log2 n⌉. Given δ > 0 and a number s ≥ 1, for ℓ = 0, 1, . . . , L, let (δℓ , sℓ ) = (2ℓ/2 δ, 2ℓ s) be a sequence of distortion and sparsity levels. We say a matrix A ∈ Rm×n satisfies the Multiresolution Restricted Isometry Property (MRIP) with distortion δ > 0 at sparsity s, if for all ℓ ∈ {1, 2, . . . , L}, (δℓ , sℓ )-RIP holds. Theorem 3.10 ([51]). Let T ⊂ Rn and suppose the matrix Ψ ∈ Rm×n obeys the Multiresolution RIP with sparsity and distortion levels s = 150(1 + η)

and

α ˜=

α · rad(T ) , C max{rad(T ), ω(T )}

with C > 0 an absolute constant. Then, for a diagonal matrix Dǫ with an i.i.d. random sign pattern on the diagonal, the matrix A = ΨDǫ obeys sup kAxk22 − kxk22 ≤ max{α, α2 }(rad(T ))2 , x∈T

with probability at least 1 − e−η .

10

3.3

Tools from probability

We start with a definition of the γ2 functional, followed by a result by Krahmer et al. [40] that will be very useful in proving our technical results. Definition 3.11 ([65], also see [40]). For a metric space (T, d), an admissible sequence of T is r a collection of subsets of T , {Tr : r ≥ 0}, such that for every s ≥ 1, |Tr | ≤ 22 and |T0 | = 1. For β ≥ 1, the γβ functional is given by the following infimum, taken over all admissible sequences of T : ∞ X γβ (T, d) = inf sup 2r/β d(t, Tr ). t∈T r=0

Theorem 3.12 (See, e.g., [40]). Let M ⊂ Cp×n be a symmetric set of matrices, i.e., M = −M. Let ǫ be a ± Bernoulli vector of length n. Then E sup kM ǫk22 − EkM ǫk22 . dF (M)γ2 (M, k · k2→2 ) + γ2 (M, k · k2→2 )2 , M ∈M

where dF (M) = supM ∈M kM kF and γ2 (·) is as in Definition 3.11.

Next we present a version of Bernstein’s inequality that we will use in proving that our matrix Vb Φ satisfies an appropriate RIP with high probability.

Theorem 3.13 (Theorem 8.42 in [28]). Let F be a countable set of functions F : Cn → R. Let Y1 , . . . , YM be independent random vectors in Cn such that EF (Yℓ ) = 0 and F (Yℓ ) ≤ K almost surely for all ℓ ∈ [M ] and for all F ∈ F with some constant K > 0. Introduce Z = sup

M X

F ∈F ℓ=1

F (Yℓ ).

Let σℓ2 > 0 such that E[F (Yℓ )2 ] ≤ σℓ2 for all F ∈ F and ℓ ∈ [M ]. Then, for all t > 0,   t2 /2 , P(Z ≥ EZ + t) ≤ exp − 2 σ + 2KEZ + tK/3 where σ 2 =

4

PM

2 ℓ=1 σℓ .

Quantization: Background and preliminaries

As previously mentioned, various choices of quantization maps and reconstruction algorithms have been proposed in the context of binary embedding and compressed sensing. Below, we first review the lower bounds associated with optimal (albeit highly impractical) quantization and decoding, which no scheme can outperform. We then review the basics of memoryless scalar quantization as well as noise shaping quantization, as we will need these for our results.

4.1

Optimal error bounds via vector quantization

For a class X of signals (e.g., X = Σnk ), the performance of a practical data acquisition scheme consisting of three stages (i.e., measurement, quantization, and reconstruction) can be measured against the optimal error bounds. Specifically, the performance of any such a scheme can be 11

measured by the tradeoff between the number of bits (rate) and reconstruction error (distortion) and we now describe how to examine the optimal rate-distortion relationship. Given a desired worst-case approximation error ǫ, measured in the ℓ2 norm, we can cover the class X with balls of radius ǫ associated with the Euclidean norm k · k2 . The smallest number of such ǫ-balls is called the covering number of X and denoted by N (X , k · k2 , ǫ). An optimal scheme consists of encoding each signal x in the class X using R := log2 N (X , k · k2 , ǫ) bits by mapping x to the center of an ǫ-ball in which it lies. A simple volume argument yields the optimal lower bound of the approximation error in term of the rate (see, e.g. [8, 9]) which, when X = Σnk , satisfies n ǫ(R) & 2−R/k . k On the other hand, the distortion of a three-stage scheme comprised of measurements using a matrix A, quantization via the map Q, and reconstruction using an algorithm R is defined by D(A, Q, R) := sup kx − R(Q(Ax))k2 . x∈X

Thus, trivially,

n −R/k 2 , k and one seeks practical1 schemes that approach this lower bound. In the case of Gaussian (and subgaussian) matrices A, there exist schemes [5, 17, 61] that approach the optimal error bounds, albeit with various tradeoffs. Among these, the results in [5] apply only to Gaussian random matrices A, while those of [17] and [61] rely on noiseshaping quantization techniques and apply to a more general class of subgaussians. In the quantization direction, our work extends the noise-shaping results of [17] and [61] to the case of structured random matrices which are selected according to Construction 1.3. Before describing the necessary technical details pertaining to noise-shaping we quickly describe the most basic (but highly suboptimal) approach to quantization, i.e., memoryless scalar quantization, as the most widely assumed approaches to binary embeddings and quantization of compressed sensing measurements use it. D(A, Q, R) &

4.2

Memoryless Scalar Quantization

The most intuitive approach to quantize linear measurements y = Ax is the so-called memoryless scalar quantization (MSQ) in which the quantization map QMSQ is defined by qi := (QMSQ (y))i := arg min |yi − r|. r∈A

In other words, we round each measurement yi to the nearest element in the quantization alphabet A. For instance, in the case of binary embedding, i.e., A = {−1, 1}, (QMSQ (y))i = sign(yi ). Here sign(t) = 1 if t ≥ 0 and −1 otherwise. If the quantization alphabet A = δZ for some step size δ > 0 is to be used, we can bound √ ky − qk2 ≤ 21 δ m. Thus, in the setting of binary embedding, MSQ defines a quasi-isometric embedding 1 (1 − α)kv − wk2 − δ ≤ √ kQ(Av) − Q(Aw)k2 ≤ (1 + α)kv − wk2 + δ m 1

(4.1)

Note that the optimal scheme described herein is impractical as it requires direct measurements of x which may not be available, and it requires finding the ball to which x belongs. As the number of such balls scales exponentially with the ambient dimension, this task becomes prohibitively expensive.

12

with high probability if A is a Gaussian random matrix [36]. On the other hand, if we consider the quantization problem in compressed sensing, the robust recovery result (1.2) guarantees that kx − x ˆMSQ k2 . δ

(4.2)

when x is a k-sparse vector. In spite of its simplicity, MSQ is not optimal when one oversamples, i.e., one has more measurements than needed. In particular, the error bounds (4.1) and (4.2) do not improve with increasing the number of measurements m. This is due to the fact that MSQ ignores correlations between measurements as it acts component-wise. While one can in principle still get a reduction in the error by using a finer quantization alphabet (i.e., a smaller δ), this is often not feasible in practice because the quantization alphabet is fixed once the hardware is built, or not desirable as one may prefer a simple, e.g., 1-bit embedding. In order to address this problem, noise-shaping quantization methods, such as Sigma-Delta (Σ∆) modulation and alternative decoding methods (see, e.g., [21, 30, 6, 31, 41, 17, 26, 68, 27, 20, 9, 35]), have been proposed in the settings of quantization of bandlimited functions, finite frame expansions, and compressed sensing measurements. However, these methods have not been studied in the framework of binary embedding problems. As stated before, the main contributions of our work are to extend noise-shaping approaches to embedding problems and to compressed sensing with structured random matrices.

4.3

Noise-shaping quantization methods

Noise-shaping quantizers, for example Σ∆ modulation, were first proposed for analog-to-digital conversion of bandlimited functions (see, e.g., [21, 30, 64]). Their success is essentially due to the fact they push the quantization error toward the nullspace of their associated reconstruction operators. These methods have been successfully extended to the frameworks of finite frames (see, e.g., the survey [20] and [6, 7, 44, 31, 41, 42]) and compressed sensing (see, e.g. [31, 41, 17, 26, 35, 68, 27, 20, 9]). In fact, the approaches based on Σ∆ quantization [62] and beta encoders [17] achieve near-optimal bounds for sub-Gaussian measurements. To explain these methods, consider a real quantization alphabet AL,δ consisting of 2L symmetric levels of spacing 2δ, i.e., AL,δ := {aδ | a = −2L + 1, . . . , −1, 1, . . . , 2L − 1},

(4.3)

and let A := AL,δ + iAL,δ be a complex quantization alphabet. Remark 2. We mention the complex case here for the sake of completeness and remark that all our results apply in the complex setting, but after this discussion we will restrict our attention to real valued matrices and vectors. A noise-shaping quantizer Q : Cm → Am , associated with the so-called noise transfer operator H, is defined such that for each y ∈ Cm the resulting quantization q := Q(y) satisfies the noiseshaping relation y − q = Hu, (4.4)

where u ∈ Cm and kuk∞ ≤ C for some constant C independent of m. Here, H is an m × m lower triangular Toeplitz matrix with unit diagonal. Noise-shaping quantizers do not exist unconditionally because of the requirement that kuk∞ is uniformly bounded in m. However, 13

under certain suitable assumptions on H and A, they exist and can be simply implemented via recursive algorithms [19]. For example, the following lemma provides conditions under which the schemes are stable. e where H e is strictly lower triangular, Lemma 4.1. Let A := AL,δ +iAL,δ . Assume that H = I − H and µ ≥ 0 such that e ∞→∞ + µ/δ ≤ 2L. kHk Suppose that |yj |∗ := max{|ℜ(yj )|, |ℑ(yj )|} ≤ µ for all j ∈ [m]. For each s ≥ 1, let ws := Ps−1 e ys + j=1 Hs,s−j us−j , qs := (Q(y))s = arg min |ws − r|∗ , (4.5) r∈A

and

us := ws − qs .

Then the resulting √ q satisfies the noise shaping relation (4.4) with kuk∞ ≤ strictly real y, the 2 factor can be replaced by 1.



(4.6) 2δ. In the case of

The proof of the above is simply by induction and can be found, for example, in [19]. Our algorithm are inspired by the approach to reconstruction from noise-shaping quantized measurements used in [17] and [61] (see also [20]). We first introduce a so-called condensation operator V : Cm → Cp , for some p ∈ [m], defined by   v  v    V := Ip ⊗ v =  (4.7) , ..  .  v

where v is a row vector in Rλ . Here, for simplicity, we suppose that λ = m/p is an integer. In this approach, to embed or recover the signal x, we first apply V to the noise-shaping relation (4.4) and obtain V Φx − V q = V Hu. (4.8)

Our goal is to design a p × m matrix V and an m × m matrix H such that the norm kV Huk2 is exponentially (or polynomially depending on the quantization scheme) small in m and our quantization algorithm is stable. This in turn helps us achieve exponentially and polynomially small errors in the settings of binary embedding and compressed sensing. We now present two candidates for the pair (V, H) based on Σ∆ quantization and distributed noise shaping.

4.4

Sigma-Delta quantization

The standard rth-order Σ∆ quantizer with input y computes a uniformly bounded solution u to the difference equation y − q = D r u. (4.9)

This can be achieved recursively by choosing q := QΣ∆ (y) and u as in the expressions (4.5) and (4.6), respectively, for the matrix H = D r . Here the matrix D is the m × m first-order difference matrix with entries given by  if i = j,  1 Dij = −1 if i = j + 1,  0 otherwise. 14

Constructing stable Σ∆ schemes of arbitrary order r, i.e., where kyk∞ ≤ µ =⇒ kuk∞ ≤ c(r) with a constant c(r) that is independent of dimensions (but dependent on the order) is nontrivial for a fixed quantization alphabet. Nevertheless, there exist several constructions (see [21, 30, 22]) with exactly this boundedness property, albeit with different constants c(r). For our results herein, when we assume a fixed alphabet, we use the stable Σ∆ schemes of [22]. The relevant result for us is the following proposition from [41] summarizing the results of [22]2 . Proposition 4.2. ([22], see [41] Proposition 1) There exists a universal constant C > 0 such that for any alphabet A = {±(2ℓ − 1)δ : 1 ≤ ℓ ≤ L, ℓ ∈ Z}, there is a stable Σ∆ scheme such that r   π2 e (4.10) ·r . kyk∞ ≤ µ =⇒ kuk∞ ≤ Cδ (cosh−1 (2L − µ/δ))2 π In particular, this guarantees that even with the 1-bit alphabet, i.e., A = {±1} and δ = 1, as assumed in binary embeddings, stability can be guaranteed as long as kyk∞ ≤ µ < 1 with a corresponding bound kyk∞ ≤ µ < 1 =⇒ kuk∞ ≤ C · c(µ)r · r r . (4.11) Previous works (e.g., [31, 62]) using Σ∆ schemes in compressed sensing achieve a polynomially small error bound (in m). However, these approaches assume gaussian or subgaussian random matrices and the proof techniques are not easily extended to the case of structured random matrices. This is in part due to the role that D −r , applied to Φ, plays in the associated proofs; one essentially requires D −r Φ to have a restricted isometry property. Due to the dependencies among the rows of Φ in the case of BOEs and PCEs, such a property is difficult to prove. Nevertheless, in [27], the authors were first to study PCEs and prove error bounds that decay polynomially in m. The approach proposed in [27] uses a different sampling scheme to generate the PCE, a different reconstruction method (whose complexity scales polynomially in m, versus p in our case), and requires a different proof technique. To circumvent the above issue, and generate the first results for BOEs, we will need the following condensation operator. Let ˜−r+1 λ := m/p =: r λ ˜ for some integer λ. Definition 4.3 (Σ∆ Condensation Operator). Let v be a row vector in Rλ whose entry vj is ˜ the jth coefficient of the polynomal (1 + z + . . . + z λ−1 )r . Define the Σ∆ condensation operator VΣ∆ : Cm → Cp by   v  v    VΣ∆ = Ip ⊗ v =  (4.12) , ..  .  v

2 In the relevant proposition of [41], there is a typo (in equation (12)) as there is a missing r r factor. This can be seen from comparing their equations (11) and (12). We remedy this in the version presented here.

15

and define its normalized versions

and

6 log m VeΣ∆ = √ VΣ∆ , kvk2 p VbΣ∆ =

1 √ VΣ∆ . kvk2 p

Example 4.4. In the case r = 1, we have v = (1, ..., 1) ∈ Rλ , while when r = 2, v = ˜ − 1, λ, ˜ λ ˜ − 1, . . . , 2, 1) ∈ Rλ . (1, 2, . . . , λ Lemma 4.5. For a stable rth order Σ∆ quantization scheme we have kVe D r k∞→2 ≤ 6(8r)r+1 λ−r+1/2 log m.

The proof of Lemma 4.5 is given in Section 8.

4.5

Distributed noise shaping

In [18], Chou and Güntürk achieved an exponentially small error bound in the quantization of unstructured random (e.g., Gaussian) finite frame expansions by using a so-called distributed noise-shaping approach. Our work in this paper extends this approach handle the practically important cases of compressed sensing and binary embeddings, both with structured random matrices drawn from a BOE or a PCE. To that end, we now review distributed noise-shaping. Fixing β > 1, let Hβ be a λ × λ noise transfer operator given by  if i = j,  1 (Hβ )ij = −β if i = j + 1,  0 otherwise. Moreover, let H : Cm → Cm be the block-diagonal distributed given by  Hβ  Hβ  H = Ip ⊗ H β =  ..  .

noise-shaping transfer operator



    

(4.13)

In analogy with Definition 4.3 we now introduce the distributed noise-shaping condensation operator. Definition 4.6 (Distributed Noise-Shaping Condensation Operator). Define the row vector vβ := [β −1 β −2

...

β −λ ]

and define the distributed noise-shaping condensation operator Vβ : Cm → Cp by   vβ   vβ   Vβ = Ip ⊗ vβ =  . ..   . vβ 16

(4.14)

Define its normalized versions

and

Notice that in this setting, and consequently

6 log m Veβ = √ Vβ , kvβ k2 p Vbβ =

1 √ Vβ . kvβ k2 p

Vβ H = Ip ⊗ (vβ Hβ )

kVβ Hk∞→∞ = β −λ =⇒ kVeβ Hk∞→2 ≤

β −λ 6 log m ≤ 6β −λ+1 log m kvβ k2

(4.15)

since vβ Hβ = [0 0 . . . β −λ ] and kvβ k2 ≥ β −1 .

5

Main results on binary embeddings

We are now ready to present our main results on binary (and more general) embeddings, showing that Algorithm 1 yields quasi-isometric embeddings that preserve Euclidean distances. We start with the case of finite sets, and we then present an analysis of our methods for infinite sets. Theorem 5.1 (Binary embeddings of finite sets). Consider a finite set T ⊂ B2n . Fix α ∈ (0, 1) and ρ ∈ (0, 1). Let Φ ∈ Rm×n , Dǫ ∈ Rn×n , Ve ∈ Rp×m be as in Algorithm 1 and e = Φ and let Q : Rm → {±1}m , be the stable suppose λ = m/p is an integer. Define Φ 6 log m quantization scheme corresponding either to the rth order Σ∆ quantization or to the distributed noise shaping quantization with β ∈ (1, 10/9], that is used in Algorithm 1. e ǫ . Assume that k ≥ 40(log(8|T |) + ν). Define the embedding map f : T → {±1}m by f = Q◦Φ◦D Then there exist constants, C1 > 0 and C2 > 0, that depend on the quantization such that if m ≥ p ≥ C1

k log4 n , α2 ρ2

the map f satisfies d e (f (x), f (˜ x)) − kx − x ˜k2 ≤ αkx − x ˜k2 + C2 η(λ), V

∀x, x ˜∈T

with probability exceeding 1 − ρ − e−ν − 2m−1 . Here, η(λ) = λ−r+1/2 log m if we use the Σ∆ quantization scheme and η(λ) = β −λ+1 log m if we use the distributed noise-shaping scheme. Proof. The proof follows from Theorem 3.8 and our main technical result, Theorem 7.1, respectively showing that RIP matrices with randomized column signs provide Johnson-Lindenstrauss e satisfy the appropriate RIP. The proof also requires embeddings, and that the matrices Ve Φ bounds on kVe Hk∞→2 to obtain the advertised decay rates. Indeed, we have d e (f (x), f (˜ e ǫ (x − x x)) − kx − x ˜k2 ≤ dVe (f (x), f (˜ x)) − kVe ΦD ˜)k2 V e ǫ (x − x + kVe ΦD ˜)k2 − kx − x ˜k2 (5.1) so we must bound the two summands on the right to obtain our result and we start with the second term. 17

By (7.1) in Theorem 7.1, with αρ in place of α, and Markov’s inequality, for any ρ ∈ (0, 1), e satisfies (α, k)-RIP with probability exceeding 1 − ρ, provided that p ≥ C1 k log4 n/(ρ2 α2 ) for Ve Φ some constant C1 > 0. Hence by Theorem 3.8, with probability at least 1 − ρ − e−ν , the matrix e ǫ satisfies Ve ΦD kVe ΦD e ǫ (x − x ˜k22 , ˜)k22 − kx − x ˜k22 ≤ αkx − x

which yields

kVe ΦD e ǫ (x − x ˜)k2 − kx − x ˜k2 ≤ αkx − x ˜k2

(5.2)

simultaneously for all x, x ˜ ∈ T . To bound the first summand in (5.1), note that in the case of Σ∆ quantization, for any z ∈ T , (4.9) implies that e ǫ zk2 = kVe D r uz k2 ≤ kVe D r k∞→2 kuz k∞ ≤ C1 c2 (µ)r r 2r λ−r+1/2 log m. kVe f (z) − Ve ΦD

To obtain the last inequality above, use Lemma 4.5 to bound kVe D r k∞→2 , and note that e ǫ zk∞ ≤ 8/9 with probability exceeding 1 − 2m−1 by Lemma 8.2, which allows us to inkΦD voke the stability bound (4.11) to control kuz k∞ . Consequently by the triangle inequality we have kVe (f (x) − f (˜ x))k2 − kVe ΦDǫ (x − x ˜)k2 ≤ 2C1 c2 (µ)r r 2r λ−r+1/2 log m, (5.3)

which yields the desired result for Σ∆ quantization when combined with (5.1), (5.2). The exact same technique yields the advertised bound for distributed noise shaping, albeit we now use (4.15) in place of Lemma 4.5. Remark 3. (The condition on β) Note that the condition β ∈ (1, 10/9] is just a byproduct of the e that we chose, and the resulting bound on kΦD e ǫ k2→∞ . Different convenient normalization of Φ normalizations would have allowed pushing β closer to 2.

Remark 4. (Root-exponential error decay) In the case of Σ∆ quantization, one can optimize the bound (5.3) by selecting r as a function of λ. Choosing the Σ∆ scheme of order r ∗ , where r ∗ minimizes the function c2 (µ)r r 2r λ−r+1/2√(see, e.g., [30, 41] for a similar detailed calculation) yields a quantization error decay of exp(−c λ) for some constant c so that √ −c λ d e (f (x), f (˜ ≤ αkx − x x )) − kx − x ˜ k ˜ k + Ce log m, 2 2 V

∀x, x ˜∈T.

Remark 5. (Optimizing the dimension p) The parameter α in Theorem 5.1 behaves essentially ˜k2 ≤ 2 for points in the unit ball. So, one can consider choosing p to improve as √1p , and kx − x √ the bounds C( √1p +e−c m/p log m) and C( √1p +β −m/p+1 log m) in the case of Σ∆ and distributed noise-shaping quantization respectively. Slightly sub-optimal but nevertheless good choices of p are then pΣ∆ ≈ m/ log2 (m) and pdist. ≈ m/ log(m), where the ≈ notation hides constants. q

log m √ m and In turn this yields a quantized Johnson-Lindenstrauss error decay of the form log m m for the two schemes, respectively. In other words, quantizing the Johnson-Lindenstrauss Lemma only costs us a logarithmic factor in the embedding dimension.

Remark 6. (Embedding into Am ) It is easy to see from the proof that one could replace the cube {±1}m by the set Am generated by the 2L level alphabet in (4.3) provided (2L − 1)δ ≥ 1. In turn, this simply reduces the constant C2 in Theorem 5.1 by a factor of L. 18

e in the proof Remark 7. (Unstructured random matrices) Note that the only requirement on Φ e satisfies an appropriate restricted isometry property. It is not hard to see that matriis that Ve Φ e ces Φ with independent entries drawn from appropriately normalized Gaussian or subgaussian distributions yield such a property. Hence a version of our results holds for such matrices, and we leave the details to the interested reader. We now present our result on embedding arbitrary subsets of the unit ball, and comment that the contents of Remarks 5, 6, and 7 apply to the theorem below as well, with minor modification. Theorem 5.2 (Binary embeddings of general sets). Consider the same setup as Theorem 5.1 albeit now with any set T ⊂ B2n . Let γ = kvk1 /kvk2 . Suppose that n o ω 2 (T −T ) max 1, (rad(T −T ))2 . (5.4) m ≥ p ≥ C1 γ 2 (1 + ν)2 log4 n 2 α e ◦ Dǫ Then, with probability at least 1 − e−ν − 2m−1 , the map f : T → {−1, 1}m given by f = Q ◦ Φ ensures that √ d e (f (x), f (˜ x)) − dℓ2 (x, x ˜) ≤ max{ α, α} rad(T − T ) + C2 η(λ), for all x, x ˜∈T. V

Here η := λ−r+1/2 log m if we use a stable Σ∆ quantization scheme; and η := β −λ+1 log m if we use the distributed noise-shaping scheme. Proof. To start, as in [51], let L = ⌈log2 n⌉. Fix k and α, then by Theorem 7.1 we have that for any level ℓ ∈ {0, 1, . . . , L} ! ee 2 P sup kV Φxk2 − kxk22 ≥ 2ℓ/2 α ≤ e−ν x∈T2ℓ k

if p & k(log4 n + γ 2 ν)/α2 . Notice that by Theorem 7.1, coupled with a union bound applied to e satisfies the Multiresolution L levels and a change of variable ν 7→ ν + log L, we have that Ve Φ RIP with sparsity k and distortion α > 0 with probability exceeding 1 − e−ν whenever m≥p&

k[log4 n + γ 2 (ν + log(L + 1))] . α2

Since log(L + 1) ≤ log4 n and log4 n + ν ≤ (1 + ν) log4 n, we can achieve (5.5) by setting m ≥ p & γ 2 (1 + ν)

k log4 n . α2

e in place of H) with Consequently, we may now apply Theorem 3.10 (with Ve Φ k = 150(1 + ν) and α ˜=

to conclude that

α   ω(T −T ) C max 1, rad(T −T )

e 2 2 2 2 sup kVe ΦD ǫ xk2 − kxk2 ≤ max{α, α }(rad(T − T )) ,

x∈T −T

19

(5.5)

e ǫ zk∞ ≤ 8/9 for all z ∈ T with probability of both happening and Lemma 8.2 to conclude that kΦD −ν simultaneously exceeding 1 − e − 2m−1 . Indeed, this happens whenever o n ω 2 (T −T ) max 1, (rad(T −T ))2 . m ≥ p ≥ C1 γ 2 (1 + ν)2 log4 n 2 α Conditioning on the events happening simultaneously, we have that for all x ∈ T − T 2 kVe ΦD kVe ΦD max{α, α2 }(rad(T − T ))2 e ǫ xk2 e ǫ xk22 − 1 ≤ − 1 . (5.6) ≤ kxk2 kxk22 kxk22

This yields

√ ee V ΦD xk − kxk k ǫ 2 2 ≤ max{ α, α} rad(T − T ) for all x ∈ T − T .

(5.7)

Then for any w and v in T , applying the quantization state equations (4.4), we have (as in the proof of Theorem 5.1) and proceeding as in the proof of that theorem, kVe Q(ΦD e ǫ w) − Ve Q(ΦD e ǫ v)k2 −kw − vk2 e ǫ v) − (Ve Huw − Ve Huv )k2 − kw − vk2 e ǫ w − Ve ΦD = k(Ve ΦD e e ≤ kVe ΦD ǫ (w − v)k2 − kw − vk2 + V H(uw − uv ) √ ≤ max{ α, α} rad(T − T ) + C2 η. Remark 8. In distributed noise-shaping quantization, we can bound γ 2 ≤ 10 9 ,

so we can set the condition (5.4) as m ≥ p ≥ Cβ C1

(1 + ν)2 log4 (n) max

2β β−1 n

=: Cβ if 1 < β ≤ o ω 2 (T −T ) −2 1, (rad(T −T ))2 α .

2 In Σ∆ quantization,n we can bound o γ ≤ λ = m/p. Hence, the condition (5.4) becomes m ≥ p ≥ √ ω(T −T ) C1 m log2 (n) max 1, (rad(T −T )) α−1 . Note that this restriction on dimensions in the Σ∆ case, √ m i.e., m . p ≤ m is not problematic as one should use p ≈ log(m) 2 as described in Remark 5, and this choice satisfies the restriction.

6

Main results on compressed sensing with BOE and PCE: Quantizers, decoders, and error bounds

Herein, we present our results on the compressed sensing acquisition and reconstruction paradigm outlined in Algorithm 2. First, we cover the case when the acquired BOE or PCE measurements Φx are quantized using a Σ∆ scheme, and we show polynomial error decay of the quantization error. Second, we present an analogous result with distributed noise shaping quantization, and we show exponential error decay of the quantization error as a function of the number of measurements. Theorem 6.1 (Recovery results for Σ∆ quantization). Let k and p be in {1, . . . , m} such that the sampling ratio λ := m/p is an integer. Let Φ ∈ Cm×n and Vb ∈ Rp×m be as in Algorithm 20

2. Denote by QrΣ∆ a stable3 rth-order scheme with the alphabet A. For ν > 0, there exists an absolute constant c such that whenever n o √ m ≥ p ≥ c max k log4 n, kmν

the following holds with probability at least 1 − e−ν on the draw of Φ. Let x ∈ Rn such that kΦxk∞ ≤ µ < 1 and q := QrΣ∆ (Φx), then the solution x b to (2.1) satisfies  −r+1/2 m σk (x)1 kb x − xk2 ≤ C1 δ + C2 √ p k where C1 and C2 are explicit constants given in the proof.

Proof. The proof essentially boils down to finding an upper bound for kVb D r uk2 which we do by

controlling Vb D r and kuk∞ . Indeed, by Lemma 4.5 ∞→2



b r

V D

∞→2

≤ (8r)r+1 λ−r+1/2 .

Moreover, since kΦxk∞ ≤ µ < 1, the quantization scheme is stable, i.e., kuk∞ ≤ cδ, where c may depend on r. The noise-shaping relation (4.9) implies that kVb Φx − Vb qk2 = kVb D r uk2 ≤ (8r)r+1 λ−r+1/2 kuk∞ ≤ C1 (r)λ−r+1/2 δ.

where in the case of noise shaping schemes as in Lemma 4.1 we have C1 (r) = (8r)r+1 . On the other hand, in the case of stable coarse schemes we have C1 (r) = (8r)r+1 C(r), with C(r) being the r-dependent constant on the right hand side of the bound in Prop. 4.2. We will now apply Theorem 7.1 with γ 2 := kvk21 /kvk22√ ≤ m/p, as it is in the case of Σ∆ 4 n kmν b quantization. Thus we deduce that if p ≥ c max{ k log α }, V Φ satisfies (α, k)-RIP with α2 , −ν probability at least 1 − e . Therefore, robustness of the convex program (2.1) (see Theorem 3.2) implies that the solution, x b, to (2.1) satisfies kb x − xk2 ≤ C1



m p

−r+1/2

σk (x)1 δ + C2 √ . k

Theorem 6.2 (Recovery Bounds for Distributed Noise-shaping Quantization). Fix β ∈ (1, 2L − µ/δ) and consider the same setup as Theorem 6.1 albeit now with the distributed noise shaping quantization.There exist positive constants c, C, and D such that whenever m ≥ p ≥ ck(log4 (n) + ν) the following holds with probability at least 1 − e−ν on the draw of Φ. Let x ∈ Rn such that kΦxk∞ ≤ µ < 1. and let q := Qβ (Φx) be its resulting distributed noise-shaping quantization as in Lemma 4.1, then the solution x b to (2.1) satisfies kx − x bk2 ≤ Cβ

+1 −m p

σk (x)1 δ+D √ . k

e.g., QrΣ∆ can be either as in Proposition 4.2 or as in Lemma 4.1 with 2r + 1/δ ≤ 2L and H = Dr in the latter case. 3

21

Proof. The proof is identical to the one of Theorem 6.1. The noise-shaping relation (4.4) implies that Vb Φx − Vb q = Vb Hu, and Lemma 4.1 yields kuk∞ ≤ δ. Hence, using the first equality in (4.15) we have

√ √ kVb Φx − Vb qk2 = kVb Huk2 ≤ pkVb Huk∞ ≤ pkVb Hk∞→∞ kuk∞ ≤ δβ −λ+1 . 4

2β , Vb Φ = kvkV2Φ√p satisfies (α, k)and p ≥ c k(log α(n)+ν) By Theorem 7.1 with γ = kvk1 /kvk2 ≤ β−1 2 RIP with probability 1 − e−ν . This completes the proof by using the robustness of the convex program (2.1) as before.

Remark 9. (Noise robustness) We note that by modifying the reconstruction technique slightly, non-quantization noise can be handled in a robust way, in an analogous manner to the method in [61]. We leave the details to the interested reader.

7

The Restricted Isometry Property of Vb Φ

Our main technical theorem shows that a scaled version of V Φ satisfies the restricted isometry property. Our proof of Theorem 7.1 below is based on the technique of [50], which in-turn relies heavily on [60]. The rough architecture of the proof is as follows. We first show that the desired RIP property holds in expectation, then we leverage the expectation result and a generalized Bernstein bound to obtain the RIP property with high probability. The proof that the RIP holds in expectation in turn comprises several steps. A triangle inequality shows that the expected value of the RIP constant is bounded by the sum of the expected RIP constant of a BOS or PCE matrix and the expected supremum of a chaos process. To bound the chaos process, we use Theorem 3.12 which requires controlling a Dudley integral and hence certain covering numbers. Here again, we use a technique adapted from the works of Nelson et al. [50], and Rudelson and Vershynin [60]. Theorem 7.1. Fix α ∈ (0, 1) and ν > 0. Let k and p be in {1, . . . , m} with λ := m/p an integer. Let 1 V := Ip ⊗ v and Vb := √ V kvk2 p for a row vector v ∈ Rλ , and let Φ be as in Construction 1.3. Then there exist universal constants C1 and C2 that may depend on the distribution of Φ but not the dimensions, such that if m ≥ p ≥ C1 we have

k log4 n , α2

E sup kVb Φxk22 − kxk22 ≤ α;

(7.1)

x∈Tk

and if m ≥ p ≥ C2 k max



log4 n γ 2 ν , 2 α2 α



or

we have 22

m ≥ p ≥ C2

k(log4 n + γ 2 ν) , α2

P( sup kVb Φxk22 − kxk22 ≤ α) ≥ 1 − e−ν . x∈Tk

Here, γ := kvk1 /kvk2 .

Proof. We begin with the bound on the expectation. Step (I): Let Sℓ = {(ℓ − 1)λ + 1, . . . , ℓλ} for ℓ = 1, . . . , p. Let Λ be a diagonal matrix having the random signs ǫj from Step 2 of Construction 1.3 as its diagonal entries and let aj be the jth row of A. Then P       ǫ1 ha1 , xi ha1 , xi j∈S1 vj ǫj haj , xi       .. .. V Φx = V ΛAx = V Λ  ...  = V  = , . . P ǫm ham , xi ham , xi j∈Sp vj−(p−1)λ ǫj haj , xi

and the expected squared modulus, with respect to the random vector ǫ = (ǫj )m j=1 , of an entry of V Φx is 2 X X vj−(ℓ−1)λ ǫj haj , xi = Eǫ Eǫ vj−(ℓ−1)λ vj ′ −(ℓ−1)λ ǫj ǫj ′ haj , xihaj ′ , xi j∈Sℓ j,j ′∈Sℓ X 2 vj−(ℓ−1)λ |haj , xi|2 . = j∈Sℓ

Then, defining AΩj as the restriction of A to its rows indexed by Ωj = {j, j + λ, . . . , j + (p − 1)λ} ⊂ {1, ..., m}, for j = 1, . . . , λ we have Eǫ kV Φxk22 = =

p X X

ℓ=1 j∈Sℓ

λ X j=1

=

λ X j=1

vj2

2 vj−(ℓ−1)λ |haj , xi|2

X

j ′ ∈Ωj

|haj ′ , xi|2

vj2 kAΩj xk22 .

By the triangle inequality λ λ X X 1 1 1 1 2 2 2 2 2 2 2 2 vj ( kAΩj xk2 − kxk2 ) vj kAΩj xk2 + kvk2 p kV Φxk2 − kxk2 ≤ kvk2 p kV Φxk2 − 2 p kvk2 j=1 2 2 j=1 λ λ X 1 X 2 1 1 2 2 2 2 2 vj kAΩj xk2 − kxk2 , vj kAΩj xk2 + kV Φxk2 − ≤ 2 2 p kvk2 p kvk2 j=1 j=1

which implies that for

Tk := {x ∈ Σnk , kxk2 = 1} 23

we have EA,ǫ

1 1 2 2 sup kV Φxk2 − kxk2 ≤ EA,ǫ sup kV Φxk22 − Eǫ kV Φxk22 2 2 kvk p kvk p

x∈Tk

2

2

x∈Tk

λ 1 1 X 2 2 2 vj EA sup kAΩj xk2 − kxk2 . + 2 kvk2 j=1 x∈Tk p

The second summand above is the expected value of the RIP constant of matrices AΩj drawn from the BOE or PCE, and this quantity is bounded by αA provided p & k log4 n/α2A by Theorems 3.3 and 3.4. Therefore, 1 1 2 2 kV Φxk2 − kxk2 ≤ EA,ǫ sup kV Φxk22 − Eǫ kV Φxk22 + αA . EA,ǫ sup 2 2 kvk2 p x∈Tk x∈Tk kvk2 p

Here, αA is the RIP constant in either (3.1) or (3.2). Step (II): Now, we need to bound EA,ǫ supx∈Tk kV Φxk22 − Eǫ kV Φxk22 . To control this quantity, we will use Theorem 3.12, which requires a little bit of setting up. To that end, let   v1 a(j−1)λ+1   .. Avj =  , . vλ ajλ

and observe that  v1 ha1 , xi . . . vλ haλ , xi  v1 ha1+λ , xi . . . vλ ha2λ , xi  V Φx =  ..  . 

  =:  

(Av1 x)∗

=: Avx ǫ.

(Av2 x)∗





ǫ1   ǫ2      ..  ..  .  . v ∗ (Ap x) ǫm

v1 ha1+(p−1)λ , xi . . . vλ hapλ , xi (7.2)

(7.3)

Let Av be the set of matrices given by Av := {Avx , x ∈ Tk }. Then, conditioning on A, by Theorem 3.12 Eǫ sup kV Φxk22 − Eǫ kV Φxk22 = Eǫ sup kAvx ǫk22 − Eǫ kAvx ǫk22 x∈Tk

Avx ∈Av

. dF (Av )γ2 (Av , k · k2→2 ) + γ2 (Av , k · k2→2 )2 .

(7.4)

Our next step will be taking expectations over A, for which we will obtain an upper bound on the γ2 functional, and a bound on the expectation of dF (Av ). We start with the latter term, for

24

 ǫ1   ǫ2      ..   .  

ǫm

which we observe that kAvx k2F =



2 2 j=1 vj kAΩj xk2 .

Hence, taking expectations over A, we have

v u λ uX v E[dF (A )] = E sup t vj2 kAΩj xk22 x∈Tk

(7.5)

j=1

v u λ uX vj2 E sup kAΩj xk22 ≤t v u λ uX v2 ≤t

! 1 pE sup kAΩj xk22 − kxk22 + pE sup kxk22 x∈Tk p x∈Tk

j



p

(7.6)

x∈Tk

j=1

j=1

2pkvk2

(7.7) (7.8)

where the third inequality is due to the fact the expected value of the RIP constant of matrices AΩj is bounded by 1. Now to control the γ2 term, we note that since kAvx − Avz k2→2 = max k(Avx − Avz )wk2 kwk2 =1 v u λ uX v 2 |h(wi )jλ = max t i=1+(j−1)λ , Aj (x − z)i| kwk2 =1

j=1

= max kAvj (x − z)k2 , j

we have kAvx − Avz k2→2 = maxj kAvj (x − z)k2 = kx − zkX where we have defined the seminorm kzkX := max kAvj zk2 .

(7.9)

γ2 (Av , k · k2→2 ) = γ2 (Tk , k · kX ).

(7.10)

j≤p

Hence, Then, EA,ǫ sup kV Φxk22 − Eǫ kV Φxk22 ≤ EA [dF (Av )γ2 (Tk , k · kX )] + EA γ2 (Tk , k · kX )2 . x∈Tk

To bound γ2 (Av , k · kX ), we use Dudley’s inequality (e.g., [65]) γ2 (Tk , k · kX ) .

Z

Q:=supx∈T kxkX k

0

p

log N (Tk , k · kX , u) du

coupled with an upper bound on each of Q and the integrand. To start, observe that Q2 ≤ sup max kAvj xk22 ≤

x∈Tk kvk22

j

sup kxk21

x∈Tk

≤ kvk22 k. 25

(7.11)

Here we use the fact that each enty of the matrix A has magnitude at most 1. Next, by Lemma 8.3 the integrand satisfies v u √ !  u  2kvk k n 2   1+ (7.12)  tk log p k u log N (Tk , k · kX , u) .  p   kvk2 k log p log(2n + 1)   . (7.13) u

We are now ready to bound Z Qp log N (Tk , k · kX , u) du γ2 (Tk , k · kX ) . 0

Taking s =

√1 , k

√ √ Z Q/kvk2 k q √ = kvk2 k log N (Tk , k · kX , tkvk2 k) dt 0 # "Z s   Z 1p s √ log p log(2n + 1) 2 n dt k log + log(1 + ) dt + . kvk2 k k t t s 0 ! " # r Z sr √ √ p 2 n . kvk2 k k s log + log(1 + ) dt − log p log(2n + 1) log s k t 0   r   √ √ p p n −1 k s log + s log(e(1 + 2s ) − log p log(2n + 1) log s . . kvk2 k k

we obtain

γ2 (Tk , k · kX ) . kvk2

p

k log p log(2n + 1) log k.

(7.14)

Finally, substituting (7.14) and (7.8) into a scaled version of (7.11), we obtain p 1  1 √ 2 2 kV Φxk − EkV Φxk (kvk2 p)kvk2 k log p log(2n + 1) log k E sup 2 2 . 2 2 kvk p kvk p x∈Tk

2

2

1 kvk22 k log p log(2n + 1) log2 k kvk22 p !2 p p k log p log(2n + 1) log k k log p log(2n + 1) log k . + √ √ p p p k log p log(2n + 1) log k . √ p s k log4 n . p +

≤ α1 , provided that n & p & k log4 n/α21 . Therefore, 1 2 2 EA Eǫ sup kΦxk2 − kxk2 ≤ α1 + αA = α, 2 x∈Tk kvk2 p 26

(7.15)

provided that p ≥ C1 k log4 n/α2 . Step (III): The probability estimate. We will use Theorem 3.13 and follow the technique that was used in [28] (Chapter 12) to prove a similar result for BOE’s (i.e., without the con∗ b densation S operator V ). To that end, let Xℓ be the ℓ-th row of the matrix V Φ, and define Qk,n = S⊂[n],|S|≤k QS,n with QS,n = {(z, w) : kzk2 = kwk2 = 1, supp(z), supp(w) ⊂ S}. The restricted isometry constant δk satisfies

p

* p +

X

X

∗ ∗ pδk = sup = sup ((Xℓ )S (Xℓ )S − (In )S ) (Xℓ Xℓ − In )z, w

S⊂[n],|S|=k (z,w)∈Qk,n ℓ=1

=

sup

p X

(z,w)∈Qk,n ℓ=1

ℓ=1

2→2

h(Xℓ Xℓ∗ − In )z, wi .

Let Q∗k,n be a dense countable subset of Qk,n. Then pδk =

p X

sup

(z,w)∈Q∗k,n ℓ=1

fz,w (Xℓ ),

where fz,w (X) = h(XX ∗ − In )z, wi. To apply Theorem 3.13, we first check the boundedness of fz,w for (z, w) ∈ Qk,n . That is, |fz,w (X)| ≤ |h(XX ∗ − In )z, wi|

≤ kXS XS∗ − (In )S k2→2

≤ kXS XS∗ − (In )S k1→1 .

1 Pλ ∗ ∗ Recall that any row of V Φ/kvk2 is of the form kvk ℓ=1 vℓ φℓ where φℓ is a row of Φ. Since 2 kφℓ k∞ ≤ 1, with δs,s′ denoting the Kronecker delta function, we now have

! λ X 1 X ∗ |fz,w (X)| ≤ − (In )S k1→1 = max vℓ φℓ (s) kvk2 s∈S ′ ℓ=1 s ∈S X kvk2 1 ≤ max kφℓ′ k∞ + 1 ≤ 2γ 2 k, kvk2 sup kφℓ k∞ sup s∈S ′ ′ ℓ 2 ℓ kXS XS∗

! λ 1 X ∗ ′ vℓ φℓ (s ) − δs,s′ kvk2 ℓ=1

s ∈S

where γ = kvk1 /kvk2 . We also observe that

E|fz,w (X)|2 ≤ E|h(XX ∗ − In )z, wi|2 ≤ Ek(XS XS∗ − In )zk22 = EkXS k22 |hX, zi|2 − 2E|hX, zi|2 + 1. We bound kXS k22 and E|hX, zi|2 as follows: kXS k22

=

X j∈S

2 λ X 1 X vℓ φℓ (j) |XS (j)| = kvk2 2

j∈S

ℓ=1

X 1 kvk21 kφℓ k2∞ ≤ γ 2 k, ≤ kvk22 j∈S

27

and   2 λ X X 1 1  E|hX, zi|2 = E E vℓ hφℓ , zi = vℓ vℓ′ hφℓ , zihφℓ′ , zi 2 kvk2 kvk22 ′ ℓ=1

=

1 kvk22

1 = kvk22

X

ℓ,ℓ

vℓ vℓ′ Ehφℓ , zihφℓ′ , zi =

ℓ,ℓ′

X ℓ

1 X 2 vℓ E|hφℓ , zi|2 kvk22 ℓ

vℓ2 kzk22 = 1.

Therefore, E|fz,w (X)2 | ≤ γ 2 k − 1 < γ 2 k. Applying Theorem 3.13 with t = α2 p, we obtain   (α2 p)2 /2 P(δk ≥ α + α2 ) ≤ P(pδk ≥ Epδk + α2 p) ≤ exp − 2 , pγ k + 4γ 2 kpEδk + 2α2 pγ 2 k/3 q 4 where Eδk ≤ C k logp n ≤ α < 1, by the first part of the theorem. Hence, 

1 α2 p P(δk ≥ α + α2 ) ≤ exp − 2 2 2kγ 1 + 4α + 2α2 /3





α2 p ≤ exp −c(α) 2 2 2kγ



where c(α) := (1 + 4α1 + 2/3)−1 ≥ (1 + 4 + 2/3)−1 = 3/17. The above term is less than e−ν provided that 34 γ 2 kν p≥ . 3 α22 In conclusion, by taking α2 = α and rescaling, we have that δk ≤ α with probability exceeding 1 − e−ν provided that   k log4 n γ 2 kν k(log4 n + γ 2 ν) p ≥ C2 max , or p ≥ C 2 α2 α2 α2 for some absolute constant C2 .

8 8.1

Technical Lemmas and Proofs Proof of Lemma 4.5

We first need the following lemma Lemma 8.1. For any integer t ≥ 0 λ X ℓ=1

r

vℓ (∆ u)t+ℓ

λ r X X r = (−1) (−1)j (∆j−1 v)j (∆r−j u)t . (∆ v)ℓ+r ut+ℓ + r

(8.1)

j=1

ℓ=1

Here ∆ denotes the first order difference operator. (∆a)j = aj − aj−1 .

28

That is, for any sequence (aj )j∈Z ,

Proof. The proof is by induction. First, let’s check the case r = 1. λ X

vℓ (∆u)t+ℓ =

ℓ=1

=

λ X

ℓ=1 λ X ℓ=1

λ X

vℓ ut+ℓ −

ℓ=1 λ X

vℓ ut+ℓ −

= (−1)

λ X

ℓ=1

vℓ ut+ℓ−1 vℓ+1 ut+ℓ + vλ+1 ut+λ − v1 ut

(∆v)ℓ+1 ut+ℓ + (−1)v1 ut .

λ=1

Here, we used the fact that vλ+j = 0 for j ≥ 1. Suppose that (8.1) is true. Then λ X ℓ=1

vℓ (∆

r+1

u)t+ℓ =

λ X

vℓ (∆r (∆u))t+ℓ

ℓ=1

= (−1)r

λ r X X (−1)j (∆j−1 v)j (∆r−j+1 u)t (∆r v)ℓ+r (∆u)t+ℓ + j=1

ℓ=1

r

= (−1)

r

= (−1) +

"

"

λ X ℓ=1

(∆ v)ℓ+r ut+ℓ −

λ X ℓ=1

r X

r

r

(∆ v)ℓ+r ut+ℓ −

λ X

#

r

(∆ v)ℓ+r ut+ℓ−1 +

ℓ=1

λ X ℓ=1

r

r X

(−1)j (∆j−1 v)j (∆r−j+1 u)t

j=1 r

(∆ v)ℓ+r+1 ut+ℓ − (∆ v)r+1 ut

#

(−1)j (∆j−1 v)j (∆r−j+1 u)t

j=1

= (−1)r+1

λ r X X (−1)j (∆j−1 v)j (∆r+1−j u)t . (∆r+1 v)ℓ+r+1 ut+ℓ + (−1)r+1 (∆r v)r+1 ut + j=1

ℓ=1

In the fourth equality, we use the fact that (∆r v)λ+r+1 =

r

Pr

j j=0 (−1) j

vλ+r+1−j = 0.

We are now ready to prove Lemma 4.5. Let Vj be the jth row of the matrix V , then by Lemma 8.1 Vj D r u =

λ X

vℓ (∆r u)(j−1)λ+ℓ

ℓ=1

= (−1)r |

r λ X X (−1)j (∆j−1 v)j (∆r−j u)t , (∆r v)ℓ+r ut+ℓ + ℓ=1

{z

(I)

}

j=1

|

{z

(II)

}

where t = (j − 1)λ. To bound (I) and (II), we first define the Fourier series of a sequence z := (zj )j∈Z in ℓ2 (Z) as X zb(ξ) := zj e−ijξ . j∈Z

29

r j ,

Let h := (hj )j∈Z where hj = (−1)j

for j = 0, . . . , r, and 0 otherwise. We observe that

|(I)| ≤ kuk∞ = kuk∞

where (h ∗ v)ℓ =

and

we obtain

P

λ X

ℓ=1 λ X ℓ=1

|(∆r v)ℓ+r | |(h ∗ v)ℓ+r |

≤ kuk∞ kh ∗ vk1 , j∈Z hj vℓ−j

b h(ξ) =

denotes the convolution of h and v. Since

X

−ijξ

hj e

j=0

j∈Z

vb(ξ) =

X

=

r X

  r (−1) (e−iξ )j = (1 − e−iξ )r j j

vj e−ijξ = (1 + e−iξ + . . . + e−iξ(λ−1) )r ,

j∈Z

  r −iξλj (−1) e , j j=0  which implies that the non-zero entries of h ∗ v are (h ∗ v)λj = (−1)j rj . Thus, kh ∗ vk1 = 2r and h[ ∗ v(ξ) = b h(ξ)b v (ξ) = (1 − e−iξλ )r =

r X

j

|(I)| ≤ 2r kuk∞ .

To bound (II), we observe that |(∆j−1 v)j | ≤ 2j−1 |vj | and |(∆r−j u)t | ≤ 2r−j kuk∞ , thus |(II)| ≤ r2r−1 max |vj |kuk∞ ≤ r2r−1 22r−1 kuk∞ , 1≤j≤r

where the last inequality is due to the fact by definition of v, vj are multinomial  the  that 1 r 1 2r ≤ 4 . Combining the bounds on (I) and = coefficients satisfying max1≤j≤r |vj | ≤ 2r−1 2 r 2 r−1 (II), we get |Vj D r u| ≤ 2r + r23r−2 ≤ r23r−1 . kuk∞ This yields

kVe D r k∞→2 ≤ 6

r23r−1 log m ≤ (8r)r+1 λ−r+1/2 6 log m. kvk2

Here, the first and second inequalities in (8.2) are due to the facts that kVe D r k∞→2 ≤

and kvk2 ≥ λr−1/2 r −r , respectively.



pkVe D r k∞→∞ = kV D r k∞→∞

30

6 log m kvk2

(8.2)

8.2

Upper bound on kΦDǫ k2→∞

Lemma 8.2. Let Φ ∈ Rm×n , Dǫ ∈ Rn×n be as in Algorithm 1, then for any c ≥ 3 Pǫ (kΦDǫ k2→∞ ≥ c log m) ≤ 2m−(3c/8−1) . Proof. Note that for any vector x ∈ B2n , the entries of the vector ΦDǫ x are of the form hϕ, Dǫ xi = n P ϕi ǫi xi where ϕ is an arbitrary row of Φ. Moreover, note that due to Construction 1.3,

i=1 |φi xi |

≤ 1 hence ǫi φi xi is a random variable in [−1, 1]. We may then use Bernstein’s inequality to bound   n  X  ϕi ǫi xi > t ≤ 2 exp  P − i=1

where

i=1

n n n X X X 2 2 2 2 x2i ≤ 1. ϕi xi ≤ max ϕi (ϕi ǫi xi ) = E

P(|

n X i=1

i

i=1

i=1

So now

 t2 /2 , n  P t/3 + E (ϕi ǫi xi )2



i=1

t2 ϕi ǫi xi | > t) ≤ 2 exp − 2t/3 + 2



≤ 2e−3t/8 ,

where the last inequality assumes t ≥ 1. As this is true for every row of the matrix and any x ∈ B2n , we now have by a union bound over all the rows P(kΦDǫ k2→∞ > t) ≤ 2me−3t/8 . Choosing t = c log m yields P(kΦDǫ k2→∞ > c log m) ≤ 2e(1−3c/8) log m = 2m−(3c/8−1) .

8.3

Covering number bound

Herein we present a lemma estimating the covering number of the set of unit-norm sparse vectors Tk using k · kX balls. The proof of this lemma follows the techniques of [60] and [50] which are in turn based on an approach by Carl [14], see also [28]. We present the proof here for completeness, for the reader’s convenience, and to illustrate the role that v plays. Lemma 8.3. Let Tk := {x ∈ Σnk , kxk2 = 1}, and let k · kX be as in (7.9), then for any u > 0 the following inequalities hold: v u √ !  u  2kvk n  2 k  1+ (8.3)  tk log p k u log N (Tk , k · kX , u) .  p    kvk2 k log p log(2n + 1)  . (8.4) u 31

√ Proof. We show (8.3) first. Since kxkX ≤ Q := supx∈Tk kxkX = kvk2 k, then N (Tk , k · kX , u) ≤ N (Tk , k · k2 , u/Q) .

 n k  k

2Q 1+ u

k

by a standard volume argument and union bound (see, e.g., [60]). √ For the second bound (8.4), we will compute the covering number of Tk / k. To that end, we use Maurey’s empirial method as in [60] (see also [28]). Denote by ej the jth standard basis vector and for any √ x ∈ Tk / k ⊂ B1n = {x ∈ Rn , kxk1 ≤ 1} let

Zj =



ej sign(xj ), with probability |xj | 0, with probability 1 − kxk1 .

So EZj = x. We will show that Ekx − 1s form (1/s)

s P

s P

i=1

Zi kX . u, which implies that there is a vector of the

Zi within distance at most u from x. Indeed, by symmetrization (see, e.g.[60])

i=1



2 1X

X

Zi ≤ ǫ Z E E x −

.

i i

s s X X

Now, conditioning on the Zi ’s, i.e., considering only expectation with respect to ǫ, gaussian comparison yields

1 2

X

X ǫi Zi . Eg gi Zi Eǫ s s X X

v 1

= Eg max Aj Zg 2 , j s

where gi are independent standard gaussians and Z is an n × s matrix whose columns are Zi (see, e.g., [28][Ch. 12] for the details of such arguments). Since kAvj Zgk2 is a Lipschitz function with Lipschitz constant kAvj Zk2→2 , then (see, e.g., [28](Ch. 8))  2 Pg kAvj Zgk2 − Eg kAvj Zgk2 ≥ tkAvj Zk2→2 ≤ e−ct ,

so the random variable kAvj Zgk2 − Eg kAvj Zgk2 is subgaussian, so that p Eg max kAvj Zgk2 − Eg kAvj Zgk2 . log p max kAvj Zk2→2 , j≤p

j

by [28] (Prop. 7.24 and Prop. 7.29). This yields the second inequality below, while the first is a simple consequence of the triangle inequality: Eg max kAvj Zgk2 ≤ max Eg kAvj Zgk2 + Eg max |kAvj Zgk2 − Eg kAvj Zgk2 | j j j p v . max Eg kAj Zgk2 + log p max kAvj Zk2→2 . j

j

This implies that taking expectations with respect to Z as well, we now have p G(x) := E max kAvj Zgk2 . EZ max Eg kAvj Zgk2 + log pEZ max kAvj Zk2→2 . j

j

j

32

We control the two summands separately. First, note that q EZ max Eg kAvj Zgk2 ≤ EZ max Eg kAvj Zgk22 = EZ max kAvj ZkF j j j v v u s λ u s λ uX X uX X 2 t 2 vj |haj , Zi i| ≤ EZ max t vj2 kaj k2∞ kZi k21 = EZ max j

j

i=1 j=1

v u s λ uX X √ vj2 kZi k21 ≤ kvk2 s. ≤ EZ t

i=1 j=1

i=1 j=1

√ Meanwhile, we also have EZ maxj kAvj Zk2→2 ≤ EZ maxj kAvj ZkF ≤ kvk2 s, and consequently G(x) . kvk2 Hence,

p

s log(p).

√ kvk2 log p 1X √ Zi kX . =u Ekx − s s

√ As stated in the beginning of the proof, this implies that for any x ∈ Tk / k, s s P P there exists a vector of the form (1/s) Zi such that kx− (1/s) Zi kX . X. As each Zi ∈ Rm i=1 i=1 √ can take 2n + 1 values, we need at most (2n + 1)s such vectors to cover Tk / k. We then obtain 2 2 the desired bound, namely, N (Tk , k · kX , u) . (2n + 1)O(kvk2 k log p/u ) . We refer the astute reader to the proof of [28] (Prop. 12.37) for some standard technical details needed to deal with the s P possibility that 1s Zi may not belong to Tk . for s =

kvk22 log p . u2

i=1

References [1] N. Ailon and B. Chazelle. The fast johnson–lindenstrauss transform and approximate nearest neighbors. SIAM Journal on computing, 39(1):302–322, 2009. [2] N. Ailon and E. Liberty. An almost optimal unrestricted fast johnson-lindenstrauss transform. ACM Transactions on Algorithms (TALG), 9(3):21, 2013. [3] R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin. A simple proof of the restricted isometry property for random matrices. Constr. Approx., 28(3):253–263, 2008. [4] R. G. Baraniuk, S. Foucart, D. Needell, Y. Plan, and M. Wootters. Exponential decay of reconstruction error from binary measurements of sparse signals. IEEE Transactions on Information Theory, 63(6):3368–3385, 2017. [5] R. G. Baraniuk, S. Foucart, D. Needell, Y. Plan, and M. Wootters. Exponential decay of reconstruction error from binary measurements of sparse signals. IEEE Transactions on Information Theory, 63(6):3368–3385, 2017. [6] J. J. Benedetto, A. M. Powell, and O. Yilmaz. Sigma-delta quantization and finite frames. IEEE Transactions on Information Theory, 52(5):1990–2005, 2006. 33

[7] J. Blum, M. Lammers, A. M. Powell, and O. Yılmaz. Sobolev duals in frame theory and sigma-delta quantization. J. Fourier Anal. Appl., 16(3):365–381, 2010. [8] P. Boufounos and R. Baraniuk. Quantization of sparse representations. In Data Compression Conference, 2007. DCC ’07, pages 378–378. March 2007. [9] P. T. Boufounos, L. Jacques, F. Krahmer, and R. Saab. Quantization and compressive sensing. In Compressed sensing and its applications, Appl. Numer. Harmon. Anal., pages 193–237. Birkhäuser/Springer, Cham, 2015. [10] J. Bourgain. An improved estimate in the restricted isometry problem. In Geometric aspects of functional analysis, volume 2116 of Lecture Notes in Math., pages 65–70. Springer, Cham, 2014. [11] M. Bridson, T. Gowers, J. Barrow-Green, and I. Leader. Geometric and combinatorial group theory. The Princeton companion to mathematics, T. Gowers, J. Barrow-Green, and I. Leader, Eds, page 10, 2008. [12] E. J. Candès, J. K. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measurements. Comm. Pure Appl. Math., 59(8):1207–1223, 2006. [13] E. J. Candes and T. Tao. Near-optimal signal recovery from random projections: universal encoding strategies? IEEE Trans. Inform. Theory, 52(12):5406–5425, 2006. [14] B. Carl. Inequalities of bernstein-jackson-type and the degree of compactness of operators in banach spaces. Ann. Inst. Fourier (Grenoble), 35(3):79–118, 1985. [15] M. Cheraghchi, V. Guruswami, and A. Velingker. Restricted isometry of Fourier matrices and list decodability of random linear codes. SIAM J. Comput., 42(5):1888–1914, 2013. [16] A. Choromanska, K. Choromanski, M. Bojarski, T. Jebara, S. Kumar, and Y. LeCun. Binary embeddings with structured hashed projections. In International Conference on Machine Learning, pages 344–353, 2016. [17] E. Chou. Beta-Duals of Frames and Applications to Problems in Quantization. ProQuest LLC, Ann Arbor, MI, 2013. Thesis (Ph.D.)–New York University. [18] E. Chou and C. S. Güntürk. Distributed noise-shaping quantization: I. Beta duals of finite frames and near-optimal quantization of random measurements. Constr. Approx., 44(1):1– 22, 2016. [19] E. Chou and C. S. Güntürk. Distributed Noise-Shaping Quantization: II. Classical Frames, pages 179–198. Springer International Publishing, Cham, 2017. [20] E. Chou, C. S. Güntürk, F. Krahmer, R. Saab, and O. Yılmaz. Noise-shaping quantization methods for frame-based and compressive sampling systems. In Sampling theory, a renaissance, Appl. Numer. Harmon. Anal., pages 157–184. Birkhäuser/Springer, Cham, 2015. [21] I. Daubechies and R. DeVore. Approximating a bandlimited function using very coarsely quantized data: a family of stable sigma-delta modulators of arbitrary order. Ann. of Math. (2), 158(2):679–710, 2003. 34

[22] P. Deift, F. Krahmer, and C. S. Güntürk. An optimal family of exponentially accurate onebit sigma-delta quantization schemes. Communications on Pure and Applied Mathematics, 64(7):883–919, 2011. [23] S. Dirksen, H. C. Jung, and H. Rauhut. One-bit compressed sensing with partial gaussian circulant matrices. arXiv preprint arXiv:1710.03287, 2017. [24] S. Dirksen and A. Stollenwerk. Fast binary embeddings with gaussian circulant matrices: improved bounds. Discrete Comput. Geom., to appear, arXiv:1608.06498, 2016. [25] D. L. Donoho. Compressed sensing. IEEE Trans. Inform. Theory, 52(4):1289–1306, 2006. [26] J. Feng and F. Krahmer. An rip approach to sigma-delta quantization for compressed sensing. IEEE Signal Process. Lett., 21(11):1351–1355, 2014. [27] J.-M. Feng, F. Krahmer, and R. Saab. Quantized compressed sensing for partial random circulant matrices. arXiv preprint arXiv:1702.04711, 2017. [28] S. Foucart and H. Rauhut. A mathematical introduction to compressive sensing. Applied and Numerical Harmonic Analysis. Birkhäuser/Springer, New York, 2013. [29] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12):2916–2929, 2013. [30] C. S. Güntürk. One-bit sigma-delta quantization with exponential accuracy. Comm. Pure Appl. Math., 56(11):1608–1630, 2003. [31] C. S. Güntürk, M. Lammers, A. M. Powell, R. Saab, and O. Yılmaz. Sobolev duals for random frames and Σ∆ quantization of compressed sensing measurements. Found. Comput. Math., 13(1):1–36, 2013. [32] J. P. Haldar, D. Hernando, and Z.-P. Liang. Compressed-sensing mri with random encoding. IEEE transactions on Medical Imaging, 30(4):893–903, 2011. [33] J. Haupt, W. U. Bajwa, G. Raz, and R. Nowak. Toeplitz compressed sensing matrices with applications to sparse channel estimation. IEEE Trans. Inform. Theory, 56(11):5862–5875, 2010. [34] I. Haviv and O. Regev. The restricted isometry property of subsampled fourier matrices. In Geometric Aspects of Functional Analysis, pages 163–179. Springer, 2017. [35] T. Huynh. Accurate Quantization in Redundant Systems: From Frames to Compressive Sampling and Phase Retrieval. PhD thesis, New York University, 2016. [36] L. Jacques. Small width, low distortions: quantized random embeddings of low-complexity sets. IEEE Transactions on Information Theory, 63(9):5477–5495, 2017. [37] L. Jacques and V. Cambareri. Time for dithering: fast and quantized random embeddings via the restricted isometry property. Information and Inference: A Journal of the IMA, 2017.

35

[38] L. Jacques, J. N. Laska, P. T. Boufounos, and R. G. Baraniuk. Robust 1-bit compressive sensing via binary stable embeddings of sparse vectors. IEEE Trans. Inform. Theory, 59(4):2082–2102, 2013. [39] W. B. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. In Conference in modern analysis and probability (New Haven, Conn., 1982), volume 26 of Contemp. Math., pages 189–206. Amer. Math. Soc., Providence, RI, 1984. [40] F. Krahmer, S. Mendelson, and H. Rauhut. Suprema of chaos processes and the restricted isometry property. Comm. Pure Appl. Math., 67(11):1877–1904, 2014. [41] F. Krahmer, R. Saab, and R. Ward. Root-exponential accuracy for coarse quantization of finite frame expansions. IEEE Trans. Inform. Theory, 58(2):1069–1079, 2012. [42] F. Krahmer, R. Saab, and O. Yılmaz. Sigma-Delta quantization of sub-Gaussian frame expansions and its application to compressed sensing. Inf. Inference, 3(1):40–58, 2014. [43] F. Krahmer and R. Ward. New and improved johnson–lindenstrauss embeddings via the restricted isometry property. SIAM Journal on Mathematical Analysis, 43(3):1269–1281, 2011. [44] M. Lammers, A. M. Powell, and O. Yılmaz. Alternative dual frames for digital-to-analog conversion in sigma-delta quantization. Adv. Comput. Math., 32(1):73–102, 2010. [45] W. Liu, J. Wang, S. Kumar, and S.-F. Chang. Hashing with graphs. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 1–8, 2011. [46] M. Lustig, D. Donoho, and J. M. Pauly. Sparse mri: The application of compressed sensing for rapid mr imaging. Magnetic resonance in medicine, 58(6):1182–1195, 2007. [47] S. Mendelson, A. Pajor, and N. Tomczak-Jaegermann. Uniform uncertainty principle for Bernoulli and subgaussian ensembles. Constr. Approx., 28(3):277–289, 2008. [48] S. Mendelson, H. Rauhut, and R. Ward. Improved bounds for sparse recovery from subsampled random convolutions. arXiv preprint arXiv:1610.04983, 2016. [49] M. Murphy, M. Alley, J. Demmel, K. Keutzer, S. Vasanawala, and M. Lustig. Fast ℓ_1-spirit compressed sensing parallel imaging mri: Scalable parallel implementation and clinically feasible runtime. IEEE transactions on medical imaging, 31(6):1250–1262, 2012. [50] J. Nelson, E. Price, and M. Wootters. New constructions of RIP matrices with fast multiplication and fewer rows. In Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1515–1528. ACM, New York, 2014. [51] S. Oymak, B. Recht, and M. Soltanolkotabi. Isometric sketching of any set via the restricted isometry property. Information and Inference, to appear, 2018. [52] S. Oymak, C. Thrampoulidis, and B. Hassibi. Near-optimal sample complexity bounds for circulant binary embedding. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pages 6359–6363. IEEE, 2017.

36

[53] G. E. Pfander, H. Rauhut, and J. A. Tropp. The restricted isometry property for timefrequency structured random matrices. Probab. Theory Related Fields, 156(3-4):707–737, 2013. [54] Y. Plan and R. Vershynin. Robust 1-bit compressed sensing and sparse logistic regression: a convex programming approach. IEEE Trans. Inform. Theory, 59(1):482–494, 2013. [55] Y. Plan and R. Vershynin. Dimension reduction by random hyperplane tessellations. Discrete Comput. Geom., 51(2):438–461, 2014. [56] M. Raginsky and S. Lazebnik. Locality-sensitive binary codes from shift-invariant kernels. In Advances in neural information processing systems, pages 1509–1517, 2009. [57] H. Rauhut. Stability results for random sampling of sparse trigonometric polynomials. IEEE Transactions on Information Theory, 54(12):5661–5670, Dec 2008. [58] H. Rauhut, J. Romberg, and J. A. Tropp. Restricted isometries for partial random circulant matrices. Applied and Computational Harmonic Analysis, 32(2):242–254, 2012. [59] J. Romberg. Compressive sensing by random convolution. SIAM Journal on Imaging Sciences, 2(4):1098–1128, 2009. [60] M. Rudelson and R. Vershynin. On sparse reconstruction from Fourier and Gaussian measurements. Comm. Pure Appl. Math., 61(8):1025–1045, 2008. [61] R. Saab, R. Wang, and Ö. Yılmaz. From compressed sensing to compressed bit-streams: practical encoders, tractable decoders. IEEE Transactions on Information Theory, 2017. [62] R. Saab, R. Wang, and O. Yılmaz. Quantization of compressive samples with stable and robust recovery. Applied and Computational Harmonic Analysis, 44(1):123 – 143, 2018. [63] R. Salakhutdinov and G. Hinton. Semantic hashing. International Journal of Approximate Reasoning, 50(7):969–978, 2009. [64] R. Schreier and G. C. Temes. Understanding delta-sigma data converters. Wiley, New York, NY, 2005. [65] M. Talagrand. The generic chaining: upper and lower bounds of stochastic processes. Springer Science & Business Media, 2006. [66] J. A. Tropp, M. B. Wakin, M. F. Duarte, D. Baron, and R. G. Baraniuk. Random filters for compressive sampling and reconstruction. In Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on, volume 3, pages III–III. IEEE, 2006. [67] S. S. Vasanawala, M. T. Alley, B. A. Hargreaves, R. A. Barth, J. M. Pauly, and M. Lustig. Improved pediatric mr imaging with compressed sensing. Radiology, 256(2):607–616, 2010. [68] R. Wang. Sigma delta quantization with harmonic frames and partial fourier ensembles. Journal of Fourier Analysis and Applications, Dec 2017. [69] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In Advances in neural information processing systems, pages 1753–1760, 2009. 37

[70] X. Yi, C. Caramanis, and E. Price. Binary embedding: Fundamental limits and fast algorithm. In F. Bach and D. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 2162– 2170, Lille, France, 07–09 Jul 2015. PMLR. [71] F. X. Yu, A. Bhaskara, S. Kumar, Y. Gong, and S.-F. Chang. On binary embedding using circulant matrices. arXiv preprint arXiv:1511.06480, 2015.

38

Suggest Documents