ON THE CONDITIONING OF RANDOM SUBDICTIONARIES JOEL A. TROPP Abstract. An important problem in the theory of sparse approximation is to identify wellconditioned subsets of vectors from a general dictionary. In most cases, current results do not apply unless the number of vectors is smaller than the square root of the ambient dimension, so these bounds are too weak for many applications. This paper shatters the square-root bottleneck by focusing on random subdictionaries instead of arbitrary subdictionaries. It provides explicit bounds on the extreme singular values of random subdictionaries that hold with overwhelming probability. The results are phrased in terms of the coherence and spectral norm of the dictionary, which capture information about its global geometry. The proofs rely on standard tools from the area of Banach space probability. As an application, the paper shows that the conditioning of a subdictionary is the major obstacle to the uniqueness of sparse representations and the success of `1 minimization techniques for signal recovery. Indeed, if a fixed subdictionary is well conditioned and its cardinality is slightly smaller than the ambient dimension, then a random signal formed from this subdictionary almost surely has no other representation that is equally sparse. Moreover, with overwhelming probability, the maximally sparse representation can be identified via `1 minimization. Note that the results in this paper are not directly comparable with recent work on subdictionaries of random dictionaries.
1. Introduction To motivate the results of this article, we begin with sparse representation, the problem of finding a sparse solution to an underdetermined system of linear equations. Fix a dictionary Φ, which is a d × N complex matrix whose columns have unit `2 norm. Suppose we are given a signal s that is formed as a linear combination of m columns from Φ: s = Φc
where
kck0 = m.
(The `0 quasi-norm k·k0 counts the number of nonzero components in its argument.) The sparse representation problem asks us to determine the vector c of coefficients, given only the dictionary Φ and the observed signal s. In particular, we must locate the nonzero components of c to determine which columns participate in the signal. Observe that the active columns must form a linearly independent set before this problem is even well posed. One of the central challenges in sparse representation, therefore, is to identify the linearly independent collections of columns from a dictionary. In fact, many applications demand stronger information on the conditioning of column submatrices in the form of bounds for their extreme singular values. Ideally, these results should depend only on global properties of the dictionary matrix and on the number of columns extracted. Date: August 2006. Revised: 28 August 2007. Key words and phrases. Banach space probability, Basis Pursuit, coherence, convex optimization, dictionary, random matrix, regularization, sparsity, uncertainty principle, underdetermined linear system. JAT is with Applied & Computational Mathematics, MC 217-50, California Institute of Technology, 1200 E. California Blvd., Pasadena, CA 91125-5000. E-mail:
[email protected]. This work was supported in part by NSF DMS 0503299 and by DARPA/ONR N660010612011. 1
2
JOEL A. TROPP
The coherence of a dictionary encapsulates information about its global geometry, and it can be used to study the conditioning of its column submatrices. The coherence is defined by def
µ = max |hϕj , ϕk i| j6=k
where ϕk denotes the kth column of Φ. In words, µ bounds the cosine of the acute angle between pairs of columns from the dictionary. The following easy result connects the coherence with the extreme singular values of a column submatrix from the dictionary. Here and elsewhere, k·k denotes the spectral norm (i.e., the norm on linear maps from `2 to `2 ), and the letter I represents a conformal identity matrix. Proposition 1. Let Φ be a dictionary with coherence µ, and let A be an arbitrary m-column submatrix of Φ. Then kA∗ A − Ik ≤ (m − 1)µ In particular, every collection of m columns is linearly independent when (m − 1)µ < 1. To interpret this theorem, one should note that 2 2 kA∗ A − Ik = max σmax (A) − 1, 1 − σmin (A) , 2 (A) and σ 2 (A) indicate the extreme eigenvalues of A∗ A. where σmax min
Proof sketch. Apply Gershgorin’s Theorem to the Gram matrix A∗ A. See [DE03, Thm. 5] for details. This result is quite attractive, but its utility is limited by the fact that the coherence cannot be very small [SH03, Thm. 2.3]: s N −d µ≥ . (1.1) d(N − 1) When N ≥ 2d, it follows that µ ≥ (2d − 1)−1/2 . In this common parameter regime, Proposition 1 √ only yields information on the smallest singular value when m = O( d). It would be valuable to understand larger collections of columns, but deterministic methods cannot provide a much better assessment than Proposition 1. Instead of asking about arbitrary sets of columns, therefore, we study random collections of columns from the dictionary. This change of focus allows us to shatter the square-root bottleneck. 1.1. Contributions. This paper provides explicit bounds for the extreme singular values of random collections of columns from a general dictionary. The results are phrased in terms of the global geometry of the dictionary, and they are vastly superior to the simple bounds of Proposition 1. The proofs involve standard methods from the field of Banach space probability, such as the systematic use of symmetrization and (noncommutative) Khintchine inequalities. Although the approach is not new, this paper provides satisfying answers to fundamental questions, and we believe that both the results and the methods will be valuable to the computational harmonic analysis community. 1.2. Outline. A brief outline is standard at this point. Section 2 offers a snapshot of the most important theorems, and Section 3 compares our findings with earlier work from the fields of sparse approximation and Banach space theory. The key technical tools are introduced without proof in Section 4. Sections 5 and 6 develop the main results upon these foundations. Sections 7 and 8 discuss applications of these results to problems in sparse representation. Finally, Sections 9 and 10 contain complete proofs of the technical theorems, which is valuable because the functional analysis literature tends to be indifferent about the precise values of numerical constants.
RANDOM SUBDICTIONARIES
3
2. Background and Major Results This section presents some background material and sets the notation for the rest of the work. Then it states four representative theorems that summarize the major contributions of the paper. 2.1. Background and notation. For each p ≥ 1, we write k·kp for the usual `p vector norm, while the symbol k·kp,q denotes the norm on linear maps from `p to `q . In this work, two norms are especially important: • The quantity kAk1,2 is the maximum `2 norm of a column of A. • The spectral norm kAk2,2 returns the largest singular value of A. It is always written as k·k without any subscripts. For a natural number N , abbreviate the set {1, 2, . . . , N } by the symbol JN K. Given a subset Ω of JN K, we write CΩ for the set of functions mapping Ω to C, equipped with the usual addition and scalar multiplication to form a linear space. The symbol P {·} denotes the probability operator, which returns the probability of a given event. The letter E indicates the expectation operator. We use a special notation for conditional expectation. For example, if Z is a random variable, then EZ denotes integration with respect to Z, holding other variables fixed. The (positive homogeneous) qth moment of a random variable is computed via the expression (E |·|q )1/q . This expression defines the Lq norm on the space of complex-valued random variables. In particular, it verifies a triangle inequality: (E |Y + Z|q )1/q ≤ (E |Y |q )1/q + (E |Z|q )1/q . Throughout this work, the word random, without additional qualification, always means “uniformly random over the specified set.” For example, the phrase “a random m-column submatrix” means that each m-column submatrix is drawn with equal probability. We follow the analysts’ convention that upright Roman letters (c, C, D, . . . ) denote absolute constants that may change at each appearance. Finally, be aware that all logarithms in this paper have base e. 2.2. The dictionary. The term dictionary refers to a d × N complex matrix Φ whose columns have unit `2 norm. Without loss of generality, we may assume that the columns of Φ span Cd , which implies that N ≥ d. A column submatrix of Φ is referred to as a subdictionary. If T is a subset of JN K, we sometimes write ΦT for the subdictionary consisting of the columns in T . After the coherence, the spectral norm kΦk is the most important geometric quantity associated with the dictionary. It measures how much the dictionary matrix can dilate a unit-norm coefficient vector, so it reflects how much the columns of Φ are “spread out.” Using H¨older’s inequality, one may develop a lower bound on the spectral norm of the dictionary: N . (2.1) d When equality holds in this relation, the dictionary is called a unit-norm tight frame. Equivalently, the rows of Φ are mutually orthogonal vectors with equal norms. kΦk2 = kΦ∗ Φk ≥ d−1 trace(Φ∗ Φ) =
2.3. Pairs of orthonormal bases. One frequently encounters dictionaries that are formed as a concatenation of two orthonormal bases: Φ = Φ1 Φ2 , where Φ1 and Φ2 are both d × d unitary matrices. The paradigmatic example is the Fourier–Dirac dictionary, in which Φ1 is the discrete Fourier transform matrix and Φ2 is the identity matrix. This dictionary is connected with discrete uncertainty principles [DS89, DH01].
4
JOEL A. TROPP
The coherence of a pair of orthonormal bases is calculated as D E (1) (2) µ = max ϕj , ϕk for j, k = 1, 2, . . . , d. j,k
In this setting, the coherence satisfies a more stringent bound: µ ≥ d−1/2 .
(2.2)
Note that the Fourier–Dirac dictionary has coherence µ = d−1/2 . For a pair of orthonormal bases, we can strengthen Proposition 1 substantially [DH01, DE03]. Proposition 2. Let Φ be a pair of orthonormal bases. Suppose the subdictionary A contains |Ω| columns from the first basis and |T | columns from the second basis. Then p kA∗ A − Ik ≤ µ |Ω| |T |. In particular, A has linearly independent columns whenever |Ω| |T | < µ−2 . Proof sketch. Let B denote the off-diagonal block of A∗ A. Observe that kA∗ A − Ik = kBk, and apply the estimate kBk2 ≤ kBk1,1 kBk∞,∞ . 2.4. Results for incoherent dictionaries. This subsection presents two central results about the conditioning of random subdictionaries drawn from an incoherent dictionary. The proofs appear in Section 6. The first theorem focuses on the special case of a unit-norm tight frame with typical coherence. These hypotheses enable us to offer a transparent statement that illuminates the pessimism of the predictions in Proposition 1. Theorem A. Let Φ be a unit-norm tight frame for Cd with at least 2d columns. Suppose that X is a random m-column subdictionary. Then p E kX ∗ X − Ik ≤ C µ2 m log(m + 1), provided that the right-hand side is less than one. It is instructive to compare Theorem A with Proposition 1. Consider the important case where µ = d−1/2 . The theorem predicts that most m-column subdictionaries are linearly independent when m = const · d/ log d. In sharp contrast, the√proposition allows that all m-column subdictionaries are linearly independent only when m ≤ d. The price we pay for the better bound is a weakening from certainty about the conditioning to near certainty. Our arguments do not yield a good estimate for the constant; the proof delivers a value of about 12. As a consequence, the new result does not quantitatively improve on the older result until the dimension d is quite large. Theorem A is straightforward, but it does not take full advantage of the information available. In particular, our methods yield strong estimates on the probability that a random subdictionary is well conditioned. Here is a more detailed (i.e., complicated) result. Theorem B. Let Φ be a dictionary, and let X be a random m-column subdictionary. The condition p m µ2 m log(m + 1) · s + kΦk2 ≤ cδ with s ≥ 1 N implies that P {kX ∗ X − Ik ≥ δ} ≤ m−s . In words, the number δ controls the range of the extreme singular values of a random subdictionary. The probability that a subdictionary is ill conditioned drops exponentially fast as the parameter s increases. This result also exposes the participation of the spectral norm, which was hidden in Theorem A because of the hypotheses.
RANDOM SUBDICTIONARIES
5
To appreciate why Theorem B improves on Proposition 1, just compare the term µ2 m log m here against the term µm in the older result. On account of (2.1), we also have the bound m m kΦk2 ≥ . N d This contribution ensures that—no matter how small the coherence—the estimate of the smallest singular value is vacuous when the size of the subdictionary is close to the ambient dimension. Of course, this condition is natural because we cannot construct a linearly independent collection of more than d vectors in d dimensions. 2.5. Results for pairs of orthonormal bases. The special structure of a pair of orthonormal bases allows us to prove theorems that are essentially different, and it also yields improved constants. This section describes two major results that are established in Section 5. The first theorem estimates the expected spread of the extreme singular values of a partially random subdictionary. Theorem C. Let Φ be a pair of orthonormal bases, and suppose that X contains |Ω| arbitrary columns from the first basis and |T | random columns from the second basis. Then r p |T | ∗ E kX X − Ik ≤ C µ2 |Ω| log(m + 1) + d where m = min{|Ω| , |T |}. This result should be compared with Proposition 2. Consider the case where µ = d−1/2 . Theorem C allows the subdictionary to contain |Ω| = const · d/ log d arbitrary columns from the first basis and |T | = const · d random columns from the second√basis. In sharp contrast, the proposition requires that the subdictionary contain no more than d columns from at least one basis. We would like to emphasize several curious features of Theorem C. First, the coherence and the logarithm only limit the size of the arbitrary basis; the number of randomly chosen columns can be proportional to the ambient dimension. Second, the logarithm only involves the cardinality of the smaller set of columns. Finally, note that it is possible to obtain strong probability estimates, even though this point is not immediately apparent from the statement of the theorem. Here is another result for pairs of orthonormal bases that provides a probability estimate and is directly comparable with earlier work. In the next section, we weigh this theorem against results of Cand`es and Romberg [CR06]. Theorem D. Assume d ≥ 3. Let Φ be a pair of orthonormal bases, and let X consist of |Ω| arbitrary columns from the first basis and |T | random columns from the second basis. If |Ω| + |T | ≤
cµ−2 s log d
with s ≥ 1
then P {kX ∗ X − Ik ≥ 0.5} ≤ d−s . The constant c is no smaller than 0.004212. 2.6. Applications and Extensions. This paper also develops applications of these theorems in the field of sparse representation. Consider a subdictionary that is well conditioned and has cardinality slightly smaller than the ambient dimension. We show that a random signal formed from this subdictionary almost surely has no other representation that is equally sparse. Moreover, with overwhelming probability, the maximally sparse representation can be identified by `1 minimization. Detailed statements of these results require some background, so we refer the reader to Sections 7 and 8. It is also important to be aware that the work in this paper is tailored to the case where the dictionary is “uniformly coherent.” That is, many of inner products between columns are close to the coherence bound. When the dictionary has a different type of correlation structure, the
6
JOEL A. TROPP
theory here may give poor results. In particular, it is possible to develop much stronger estimates for localized dictionaries—those whose Gram matrix is concentrated near the diagonal. These conclusions can be reached with similar methods, but we have chosen not to present them for several reasons. The theorems are more complicated and less compelling; the proofs are substantially longer; and the constants are awful. If there is sufficient interest, we may pursue this line of inquiry in a future paper. 3. Related Work This section describes two strands of related work: one from the field of sparse approximation and the second from the field of Banach space theory. The former provides us our motivations, while the latter equips us with the technical tools to solve the problems. 3.1. Results from sparse approximation. The earliest work on the behavior of subdictionaries can be traced to the paper of Donoho and Stark [DS89] about applications of uncertainty principles in signal processing. Their results include a proof of Proposition 2 for the Fourier–Dirac dictionary. The works [DH01, EB02, GN03] broadened this investigation to cover other pairs of orthonormal bases, resulting in Proposition 1 and the general form of Proposition 2. Early work of Fuchs [Fuc04] hinted that certain results on sparse representation could be strengthened substantially by shifting to a probabilistic viewpoint. The first explicit results in this direction were established by Cand`es, Romberg, and Tao. In particular, they developed a probabilistic uncertainty principle for the Fourier–Dirac dictionary that outstrips Proposition 2 [CRT06]. The following result, which is representative, appears in [CR06]. Proposition 3 (Cand`es–Romberg–Tao). Assume that d ≥ 512. Suppose that X contains |Ω| arbitrary columns from the Fourier basis and |T | random columns from the Dirac basis. For each s ≥ 1, cd , |Ω| + |T | ≤ p (s + 1) log d implies that P {kX ∗ X − Ik ≥ 0.5} ≤ Cd−s log d. The constant c is no smaller than 0.2660. This theorem states that almost all collections of columns from the Fourier–Dirac dictionary are well conditioned, provided that the total number of columns is smaller than the dimension d by √ a factor of log d. In contrast, Theorem D requires that we back away from the dimension by a factor of log d. The present methods are not perspicuous enough to reproduce this result. Cand`es and Romberg also state a result for general pairs of orthonormal bases [CR06]. Proposition 4 (Cand`es–Romberg). Suppose that Φ is a pair of orthonormal bases whose coherence satisfies 1 µ≤ p . 2(s + 1) log d Assume that each column of the first basis appears in X independently with probability d−1 |Ω| and each column of the second basis appears independently with probability d−1 |T |. If |Ω| + |T | ≤
cµ−2 , ((s + 1) log d)5/2
then P {kX ∗ X − Ik ≥ 0.5} ≤ Cd−s log d.
RANDOM SUBDICTIONARIES
7
Theorem D improves on this proposition in several ways. Our result removes the coherence restriction; it reduces the power on the logarithm; it improves the probability decay; and it allows us to choose an arbitrary set of columns from one basis. We believe that there are no direct precedents in the sparse approximation literature for the results on general dictionaries, Theorems A and B. The results of Cand`es–Romberg–Tao actually come from a different strand of research about the properties of random dictionaries. Over the last two years, this area has seen impressive contributions from a variety of researchers. A partial list of important works includes [CT06, RV06, BDDW07, CDD06, DT06]. The results in this paper come from a different tradition that focuses on sparse approximation with respect to a fixed, deterministic dictionary. Structured dictionaries appear in many applications, which is why our results are valuable. 3.2. Results from Banach space geometry. The Banach space literature does not directly address the problem we are studying, but Rudelson has completely solved the dual problem [Rud99] by building on earlier work of Bourgain [Bou99]. One version of the result follows. Proposition 5 (Rudelson). Let Φ be a unit-norm tight frame. Let X be a random collection of m columns from Φ. Then r
1 d log d 1 ∗
, E
m XX − d I ≤ C m provided that the right-hand side does not exceed one. Since Φ is a tight frame, its normalized inertia matrix N −1 ΦΦ∗ = d−1 I. This proposition states that we can approximate the normalized inertia matrix by drawing m = const · d log d random columns and computing a sample average. The result is qualitatively optimal, and it completes a line of inquiry initiated by Kannan et al. [KLS97]. For recent progress on a closely related problem, see [Aub07]. The present work relies strongly on Rudelson’s techniques. It is also interesting to view our work in the light of the Restricted Invertibility Principle of Bourgain and Tzafriri. They prove that every matrix whose columns are not too small and whose spectral norm is not too large must contain a large column submatrix that is well conditioned [BT87]. Proposition 6 (Restricted Invertibility Principle). Let Φ be a dictionary. There exists a collection T of columns with cardinality cN |T | ≥ kΦk2 whose smallest singular value is bounded away from zero: kΦT xk2 ≥ c kxk2
for all x.
This remarkable theorem has applications in harmonic analysis and Banach space geometry. Other authors have extended this work; see [Ver05] and its bibliography for recent progress. A result with such weak hypotheses can only guarantee the existence of a well-conditioned subdictionary. Our goal is to understand when well-conditioned subdictionaries are prevalent, and this investigation requires additional hypotheses. To verify this point, consider a dictionary Φ that consists of two copies of the same orthonormal basis for Cd . It is easy to find orthonormal subdictionaries, but a random subdictionary with expected cardinality m usually contains duplicated √ columns when 2m ≥ d. One may explain this phenomenon by noting that the √ coherence of the dictionary is heinous (µ = 1), even though the spectral norm is small (kΦk = 2). After the present paper was complete, we learned of another paper by Bourgain and Tzafriri [BT91] that bears directly on our work. Here is a simplified version of their result.
8
JOEL A. TROPP
Proposition 7 (Bourgain–Tzafriri). Let A be an n × n matrix whose norm kAk ≤ 1 and whose entries satisfy 1 |ajk | ≤ . log2 n From A, draw a random principal submatrix B with dimensions cn × cn. Then P {kBk ≥ 0.5} ≤ n−c . By applying this proposition to the hollow Gram matrix of a dictionary, it is possible to obtain estimates for the conditioning of random subdictionaries that are much stronger than the ones presented here. Nevertheless, we feel that our results remain valuable. The primary advantage of our approach is its versatility and accessibility. In fact, we have used the methods of this paper to develop a modern proof of Proposition 7 [Tro06b]. A secondary advantage is that our techniques yield explicit and somewhat reasonable constants. 4. The Toolbox Our investigation centers around expressions of the form kRAR∗ k where A is a fixed matrix and R is a restriction to a randomly chosen set of coordinates. We must understand the habitat and behavior of this animal. What is its expectation? What is the probability that it achieves an unusually large value? These questions are answered with methods from the field of probability in Banach spaces [LT91]. The technologies are standard, although most of the results have not appeared in the precise form given here. Major sources for this work include [LP86, BT91, LT91, Rud99, Buc01, RV07]; more detailed citations appear throughout the paper. 4.1. Restriction maps. Before we continue, let us introduce a little more notation. Let Ω be a subset of JN K. We may define a restriction map RΩ : CN → CΩ
via the rule
(RΩ f )(ω) = f (ω)
for ω ∈ Ω.
That is, the map RΩ limits a vector in CN to the components in Ω. The adjoint is an extension map with the action ( f (k) when k ∈ Ω ∗ (RΩ f )(k) = 0 when k ∈ JN K \ Ω.
∗ extends a vector from CΩ to CN by padding it with zeros. We often drop In words, the map RΩ the subscript from the restriction if it is obvious or irrelevant. In this work, the term “restriction” always refers to a coordinate restriction.
4.2. Key Technical Results. Our most important theorem is adapted from work of Rudelson and Vershynin [RV07], who build essentially on earlier work of Rudelson [Rud99]. This theorem gives information about the spectral norm of a matrix that has been restricted to a random collection of columns. Theorem 8 (Spectral Norm of a Random Compression). Suppose that A is a matrix with N columns, and let R be a restriction to m coordinates, chosen at random from JN K. Fix q ≥ 1. For each p ≥ max{2, 2 log(rank AR∗ ), q/2}, it holds that r m √ ∗ q 1/q kAk . (E kAR k ) ≤ 3 p kAk1,2 + N This theorem tells us that the spectral norm of a random submatrix of A carries its share of the spectral norm of the entire matrix, plus an additional component that depends on the size of the √ columns of A. As we will see, the p can be converted into information about tail probabilities. Section 9 contains a complete proof of Theorem 8.
RANDOM SUBDICTIONARIES
9
The next major theorem is a decoupling result. It is challenging to work directly with expressions like kRAR∗ k because the random restriction appears twice. Decoupling allows us to move to an expression that involves (conditionally) independent restrictions. This alteration permits us to apply tools such as Theorem 8. Theorem 9 (Decoupling). Let A be a 2N × 2N Hermitian matrix with a zero diagonal, and let R be a restriction to m random coordinates. For each q ≥ 1, there exists a partition of J2N K into two blocks T1 and T2 with N elements each so that (E kRAR∗ kq )1/q < 2
max
m1 +m2 =m
(E kR1 AT1 ×T2 R2∗ kq )1/q
where • the maximum occurs over integers m1 , m2 ∈ [0, N ], • the symbol AT1 ×T2 denotes the submatrix of A indexed by T1 × T2 , and • the maps Ri are independent restrictions to mi random coordinates from Ti for i = 1, 2. When A has odd order (2N + 1) × (2N + 1), an analogous result holds for a partition into blocks of size N and N + 1. The proof of this theorem is the subject of Section 10. The argument is a variation on a classical technique [LT91, Sec. 4.4] Finally, we need to understand the relationship between the moments of a random variable and √ its tail behavior. In particular, a random variable whose Lq norm is proportional to q exhibits subgaussian decay. The following result is essentially Lemma 4.10 from [LT91]. Proposition 10 (Subgaussian Tail Bound). Let Z be a nonnegative random variable whose moments satisfy √ [E(Z q )]1/q ≤ α q + β for all q ≥ Q. √ For all u ≥ Q, one has the tail bound n o 2 P Z ≥ e1/4 (αu + β) ≤ e−u /4 . Proof. Let κ be a nonnegative number. Combining Markov’s inequality with the moment bound, we have √ α q+β q E Zq κ ≤ κ . P {Z ≥ e (αu + β)} ≤ κ (e (αu + β))q e (αu + β) Choose q = u2 to make the bracket equal e−κ . Then select κ = 0.25. 5. Pairs of Orthonormal Bases We begin with results for subdictionaries drawn from a pair of orthonormal bases. The special structure of these dictionaries allows us to simplify the arguments substantially. Fix the ambient dimension d. Suppose that the dictionary Φ = Φ1 Φ2 , where the blocks Φ1 and Φ2 are unitary matrices of dimension d. In this case, the number of columns N = 2d. The Gram matrix of the dictionary has the form I F Φ∗ Φ = F∗ I where F = Φ∗1 Φ2 is a unitary matrix √ of dimension d. It is easy to check that the spectral norm of the dictionary always satisfies kΦk = 2. The coherence µ of the dictionary can be calculated as D E (1) (2) µ = max |fjk | = max ϕj , ϕk for j, k = 1, 2, . . . , d j,k
(i)
j,k
where ϕj is the jth column from basis i. Recall from (2.2) that µ ≥ d−1/2 .
10
JOEL A. TROPP
5.1. Random Subdictionaries. Now, we describe our model for choosing a random subdictionary. Suppose that Ω indexes a fixed set of columns from the first basis, and let us draw a random set T of columns from the second basis. Define m = min{|Ω| , |T |}. These choices lead to the subdictionary ∗ Φ2 RT∗ . X = Φ1 RΩ Its Gram matrix has the form I RΩ F RT∗ X X= . (RΩ F RT∗ )∗ I
∗
Using a standard identity, we find that kX ∗ X − Ik = kRΩ F RT∗ k .
(5.1)
We must estimate the moments of this random variable. This calculation is an easy consequence of Theorem 8. It is clear that rank(RΩ F RT∗ ) ≤ m. Therefore, given q ≥ 1, we may select p = max{2, 2 log(m + 1), q/2}. Invoking the theorem to draw |T | random columns from RΩ F , we find that r n√ p p o |T | ∗ q 1/q 2, 2 log(m + 1), q/2 kRΩ F k1,2 + kRΩ F k . (ET kRΩ F RT k ) ≤ 3 max d Since the entries of F have magnitude no greater than µ and RΩ F has |Ω| rows, p kRΩ F k1,2 ≤ µ |Ω|. It also holds that kRΩ F k ≤ kF k = 1. Recalling (5.1), we obtain r n p p p o |T | q 1/q ∗ (E kX X − Ik ) ≤ 18µ2 |Ω| max 1, log(m + 1), q/4 + . d
(5.2)
5.2. Results. With the moments at hand, we can immediately prove several different results. The statement of Theorem C is furnished by the choice q = 1 in (5.2). Next, we prove a more detailed result. Let us assume m ≥ 2 so that log(m + 1) > 1. (In case m = 1, the theorem is vacuous.) Select q ≥ 4 log(m + 1) in (5.2) to obtain r p |T | √ q 1/q ∗ (E kX X − Ik ) ≤ 18µ2 |Ω| q + . (5.3) d p Choose s ≥ 1 and muster Proposition 10 with the parameter u = 4s log(m + 1) to reach the following theorem. Theorem 11. Let Φ be a pair of orthonormal bases. Let X consist of |Ω| arbitrary columns from the first basis and |T | random columns from the second basis. Suppose that r p |T | 2 18µ |Ω| log(m + 1) · s + ≤ e−1/4 δ d where m = min{|Ω| , |T |} and s ≥ 1. Then P {kX ∗ X − Ik ≥ δ} ≤ m−s . Finally, let us explain how to adduce Theorem D, which is really just a simplified version of the last result. Assume that d ≥ 3. Continuing from (5.3), recall that µ ≥ d−1/2 to advance to p p p √ (E kX ∗ X − Ikq )1/q ≤ 4.5µ2 |Ω| + |T | q for q ≥ 4 log d. The square-root function is concave, whence (E kX ∗ X − Ikq )1/q ≤
p √ 9µ2 (|Ω| + |T |) q.
RANDOM SUBDICTIONARIES
11
√ Fix s ≥ 1. Calling on Proposition 10 with parameter u = 4s log d, we find that the condition p 36µ2 (|Ω| + |T |) log d · s ≤ 0.5e−0.25 ensures that the random variable exceeds 0.5 with probability less than d−s . Perform some algebraic manipulations to isolate |Ω| + |T |: cµ−2 |Ω| + |T | ≤ s log d where c ≥ 0.004212. 6. Incoherent Dictionaries In this section, we develop results for subdictionaries of a general dictionary. To that end, suppose that Φ is a d × N matrix with unit-norm columns and coherence µ. 6.1. Moment bounds. Let X be an m-column submatrix of Φ, drawn at random. In this section, we calculate for q ≥ 1 that n p p p o 2m kΦk2 (6.1) (E kX ∗ X − Ikq )1/q ≤ 144µ2 m max 1, log(m/2 + 1), q/4 + N In the sequel, we assume that N , the number of columns in the dictionary, is even; the proof in the odd case is essentially the same. Define the hollow Gram matrix of the dictionary: H = Φ∗ Φ − I. This matrix has a zero diagonal since the columns of the dictionary have unit norm. Let R be a random restriction onto m coordinates from JN K. With this new notation, the matrix of interest may be viewed as a compression of the hollow Gram matrix: X ∗ X − I = RHR∗ . To begin our calculation, we invoke the decoupling result, Theorem 9, to see that
q 1/q
c ∗ q 1/q ≤ 2 max E R1 HR E X ∗ X − I 2 m1 +m2 =m
(6.2)
c is a submatrix of H of dimension N/2 × N/2 and the matrices Ri are independent where H restrictions to mi random coordinates for each i = 1, 2. In the sequel, the symbol Ei will denote expectation with respect to Ri , holding the other random restriction fixed. We may express
q 1/q
q q/q 1/q
∗ ∗ c c E R1 HR = E1 E2 R1 HR . 2 2 c ∗ ) ≤ m/2 since one of the numbers m1 or m2 is less than or equal to It is evident that rank(R1 HR 2 m/2. Therefore, we may choose p = max{2, 2 log(m/2 + 1), q/2}.
(6.3)
Apply the random compression bound, Theorem 8, to the inner expectation to select m2 of the c This step results in N/2 columns from H. r
1/q
m2 √ c ∗ q c c .
R1 H E2 R1 HR ≤ 3 p R H + 1 2 1,2 N/2 c is a submatrix of H with m1 rows. Therefore, none of its columns has `2 norm Observe that R1 H √ greater than µ m1 . Combining these bounds, r
1/q
q 1/q √ √ m2 ∗ q c c
E R1 HR2 ≤ E1 3µ m1 p + R1 H . N/2
12
JOEL A. TROPP
Apply the triangle inequality and the homogeneity of the Lq norm to reach r
1/q
1/q √ √ m2 ∗ q c c q
≤ 3µ m1 p + E R1 HR2 . E1 R1 H N/2
(6.4)
Next, we examine the remaining expectation. The spectral norm is invariant under conjugate c∗ R∗ to select m1 of the N/2 columns. transposition, so we apply the compression theorem to H 1 With the same choice of p as in (6.3), r
1/q m1 √ ∗ c q
c H 1,2 + H ∗ . E1 R1 H ≤ 3 p c N/2 We may bound the two norms above in terms of properties of the dictionary. The entries of the c are inner products between distinct columns of Φ, so the coherence controls the (1, 2) matrix H norm: p
∗
c H 1,2 ≤ µ N/2. b The spectral norm of the dictionary controls the spectral norm of H:
∗
c H ≤ kH ∗ k = kΦ∗ Φ − Ik = max{1, kΦk2 − 1} ≤ kΦk2 . Introduce the last three estimates into (6.4) to discover that √
1/q √ √ 4m1 m2 √ ∗ q c
E R1 HR2 ≤ 3µ ( m1 + m2 ) p + kΦk2 . (6.5) N Let us maximize this inequality over parameters m1 + m2 = m, subject to 0 ≤ mi ≤ N/2. First, √ notice that 4m1 m2 ≤ m. The square-root function is concave, so p √ √ √ m1 + m2 ≤ 2 (m1 + m2 )/2 = 2m. Introducing the last two bounds into (6.5), we reach p
1/q m √ ∗ q c
E R1 HR2 ≤ 18µ2 m p + kΦk2 . N Finally, recall the value of p from (6.3) and substitute the most recent bound into (6.2) to reach the announced inequality (6.1). 6.2. Results. First, we show how to reach Theorem A of Section 2. This result frames the hypotheses that N ≥ 2d and that Φ is a unit-norm tight frame. For simplicity, assume that m ≥ 6 so that log(m/2 + 1) > 1. The moment bound (6.1) with q = 1 implies that p 2m kΦk2 . E kX ∗ X − Ik ≤ 144µ2 m log(m/2 + 1) + N Since Φ is a tight frame, kΦk2 = N/d. Thus p 2m E kX ∗ X − Ik ≤ 144µ2 m log(m/2 + 1) + . d The assumption that N ≥ 2d implies that the coherence µ > (2d)−1/2 on account of inequality (1.1). When the first term is less than one, it follows that p 2m < 4µ2 m 4 µ2 m. d Therefore, we can absorb the first term into the second term by adjusting the constant. Increasing log(m/2 + 1) to log(m + 1) completes the argument. When m < 6, the same result clearly holds with a (potentially) larger constant.
RANDOM SUBDICTIONARIES
13
Our major result contains more detailed information about the role of the dictionary’s spectral norm and the probability decay. In case q ≥ 4 log(m/2 + 1) ≥ 4, the moment bound (6.1) becomes p 2m √ (E kX ∗ X − Ikq )1/q ≤ 36µ2 m q + kΦk2 . N p Fix s ≥ 1, and invoke Proposition 10 with parameter u = 4s log(m/2 + 1). Theorem 12. Let Φ be a dictionary, and let X be a random m-column subdictionary where m ≥ 4. Suppose that p 2m 144µ2 m log(m/2 + 1) · s + kΦk2 ≤ e−1/4 δ with s ≥ 1. N Then P {kX ∗ X − Ik ≥ δ} ≤ (m/2)−s . Theorem B of Section 2 follows from a similar argument, along with some algebraic simplifications. 7. Application: Sparse Representation Our previous results have several interesting applications. The first one concerns uniqueness of sparse representations, the problem that was raised in the introduction. We draw a random sparse signal according to the following model and ask about the probability that this signal has another representation that is equally sparse. Model (M0) for a random signal s = Φz The dictionary:
Φ
The subdictionary: ΦT
has coherence µ. has least singular value σmin (ΦT ) ≥ 2−1/2 , and its cardinality satisfies µ2 |T | < 2−1/2 .
The coefficients:
z
is supported on T , and RT z is continuously distributed.
Observe that our previous results (such as Theorems 11 and 12) allow us to determine the probability that a randomly chosen subdictionary fits the requirements of this model. Theorem 13. Suppose that s = Φz is a random signal drawn from Model (M0). Then z is almost surely the unique vector that satisfies the constraints Φc = s
and
kck0 ≤ m.
This theorem is closely related to results of Cand`es and Romberg about the uniqueness of signals that are sparse with respect to a pair of orthonormal bases [CR06, Thms. 4.1, 5.2]. The extension to other types of dictionaries is new. Proof. We make the abbreviation m = |T |. Since ΦT has full rank, the signal s = Φz has a continuous distribution on the range of ΦT . Let Ω be an arbitrary set of m indices or fewer. We claim that range(ΦΩ ) = range(ΦT ) =⇒ Ω = T. In particular, Ω 6= T =⇒ dim(range(ΦΩ ) ∩ range(ΦT )) < m. To this how this fact implies the result, consider the set of signals in range(ΦT ) that can be represented using a different set of m columns. This set can be written as a finite union of subspaces with dimension strictly less than m. Therefore, it has zero volume with respect to any nonatomic
14
JOEL A. TROPP
measure. In other words, it is almost surely the case that the signal has no other representation as a linear combination of m columns from Φ. The theorem follows. Let us turn to the claim. We may assume that |Ω| = m, or else the range of ΦΩ cannot exhaust the range of ΦT . Since the dimensions match, it suffices to find one vector in range(ΦΩ ) that does not lie in range(ΦT ). We check that, for any ω ∈ Ω \ T , kPT ϕω k2 < 1, where PT is the orthogonal projector onto range(ΦT ). Since Ω is distinct from T , it follows that Ω \ T is nonempty, which supplies us with the coveted vector. Observe that the projector can be written as PT = (Φ†T )∗ Φ∗T where the dagger † indicates the Moore–Penrose pseudoinverse. The usual norm estimate implies that
kPT ϕω k2 ≤ Φ†T kΦ∗T ϕω k2 . √ Since σmin (ΦT ) ≥ 2−1/2 , the first term is no greater than 2. Since ω ∈ / T , each √ entry of the vector Φ∗T ϕω is bounded in magnitude by the coherence µ. Therefore, kΦ∗T ϕω k2 ≤ µ m. We conclude that √ √ kPT ϕω k2 ≤ 2 · µ m < 1 owing to the assumptions in Model (M0).
8. Application: Sparse Recovery In this section, we consider an algorithmic method for identifying representations of a sparse signal. Suppose that Φ is a dictionary. Let s be a signal that has a sparse representation with respect to the dictionary: s = Φcopt
where kcopt k0 ≤ m.
In general, it is NP-hard to identify the coefficient vector copt [Nat95, DMA97], so various heuristic methods have been developed. One approach, especially popular with mathematicians, is to solve min kck1
subject to Φc = s.
(P1)
The idea is that the `1 norm is a convex relaxation of the `0 quasi-norm, so one hopes that the solutions will coincide. The problem (P1) can be cast as a second-order cone program, which means that the optimization can be completed in polynomial time. The paper [CDS99] was the first to advocate this heuristic for sparse recovery. Some other references include [DH01, Tro06a, CR06]. Before we present the signal model, a little background is necessary. Define the signum function ( eiθ when r > 0 def sgn (reiθ ) = 0 when r = 0. Extend this function to vectors by applying it to each component. A Steinhaus random variable is a complex random variable that is distributed uniformly on the unit circle {w ∈ C : |w| = 1}. A Steinhaus sequence is a (countable) collection of independent Steinhaus random variables. We draw a random sparse signal according to the following model and ask about the probability that this representation can be identified by solving (P1).
RANDOM SUBDICTIONARIES
15
Model (M1) for a random signal s = Φz The dictionary: has coherence µ. The subdictionary:
ΦT
has least singular value σmin (ΦT ) ≥ 2−1/2 , and its cardinality satisfies 8µ2 |T | ≤ log(N/δ).
The coefficients:
z
is supported on T , and sgn (RT z) forms a Steinhaus sequence.
In words, the magnitudes of the nonzero entries of the coefficient vector are completely arbitrary, but the phases must be independent and uniformly distributed on the circle. If in addition the distribution of the nonzero coefficients is continuous, then the signal also satisfies the requirements of Model (M0). Theorem 14. Suppose that s = Φz is a random signal drawn according to Model (M1). Then z is the unique solution to (P1), except with probability 2δ. If the signal satisfies the requirements of both Model (M0) and (M1), then Theorem 13 shows that the sparsest representation of the random signal is almost surely unique and Theorem 14 shows that (P1) identifies this maximally sparse representation with overwhelming probability. The proof of Theorem 14 is an application of a result due to Fuchs [Fuc04] and the present author [Tro05] combined with a basic large deviation inequality. Cand`es and Romberg have established an analogous result for pairs of orthonormal bases [CR06]. The extension to other types of dictionaries is novel. Proposition 15 (Fuchs, Tropp). Suppose that s = Φcopt , and write T = supp(copt ). If D E † ϕ , R sgn (c ) Φ for all k ∈ /T T k opt < 1 T then copt is the unique solution to (P1). As before, the dagger † denotes the Moore–Penrose pseudoinverse. Proposition 16 (Complex Bernstein). Let a be a complex vector, and let ε be a Steinhaus sequence. For all u ≥ 0 and all κ ∈ (0, 1), n X o e−κu2 . εj aj ≥ u kak2 ≤ P j 1−κ 2
In particular, we may take the right-hand side to be 2e−u /2 . P Proof sketch. Let Z = εj aj . The Khintchine inequality for Steinhaus sequences yields a sharp j
estimate for the even moments of this random variable [PS95, Sec. 9.3]: E Z 2p ≤ p! kak2p 2 . Apply the Laplace transform method to Z 2 to obtain the tail bound.
Proof of Theorem 14. We abbreviate m = |T |. To ensure that the unique solution to (P1) equals z, Proposition 15 asks us to verify that D E † for all k ∈ / T. ΦT ϕk , RT (sgn z) < 1 First, let us estimate the norm of the left-hand member of the inner product. Expanding the pseudoinverse, we obtain Φ†T ϕk = (Φ∗T ΦT )−1 Φ∗T ϕk .
16
JOEL A. TROPP
Taking norms and applying the familiar operator norm bound,
†
Φ ϕk ≤ (Φ∗T ΦT )−1 kΦ∗T ϕk k . 2 T 2 Since σmin (ΦT ) ≥ 2−1/2 , the spectral norm of the inverse matrix does not exceed two. Since ∗ k∈ / T , each entry √ of the vector ΦT ϕk is bounded in magnitude by the coherence parameter µ, so ∗ kΦT ϕk k2 ≤ µ m. Therefore,
† √
Φ ϕk ≤ 2µ m for all k ∈ / T. T 2 The inner product can be rewritten as D E X Φ†T ϕk , RT (sgn z) =
j∈T
(Φ†T ϕk )j sgn zj .
Since {sgn zj : j ∈ T } is a Steinhaus sequence, the complex Bernstein inequality results in n D E o 2 P Φ†T ϕk , RT (sgn z) ≥ 1 ≤ 2e−1/8µ m . Invoking the union bound over at most N choices of ϕk , we reach D n E o 2 † P maxk∈T Φ ϕ , R (sgn z) ≥ 1 ≤ 2N e−1/8µ m . T / T k Owing to the choice of m in Model (M1), the right-hand side does not exceed 2δ.
9. The Spectral Norm of a Random Compression ∗ k where the matrix The goal of this section is to study the moments of the random variable kARΩ A is arbitrary and RΩ is a restriction to m random coordinates. If we treat the columns of A as separate points in a linear space of matrices, the restriction map effectively creates the sum of a random subset of these columns: X ∗ ARΩ = ak e∗k k∈Ω
where ek is the kth standard basis vector. To estimate the spectral norm of this sum, we first introduce additional randomness of a type that is easier to understand. Then we can study the sum by conditioning on the subset Ω and applying methods that are adapted for the simple random variables. 9.1. Rademacher Sums. The simplest nontrivial random variable is the Rademacher random variable, which takes the values ±1 with equal probability. It is traditionally denoted by the letter ε. A Rademacher sequence is a (countable) collection of independent Rademacher random variables. Let {x1 , x2 , . . . , xN } be a sequence of points in a normed linear space X. The associated Rademacher sum is the random linear combination XN εk x k k=1
where {εk } is a Rademacher sequence. As we will see, Rademacher sums play a central role in our investigation because there are powerful techniques for estimating their norms. 9.2. Symmetrization of Random Subset Sums. A fundamental method for studying the sum of independent random variables in a normed space is to compare the sum against a related Rademacher sum. This process is known as symmetrization, and the key result is Lemma 6.3 of [LT91]. This lemma does not apply to random subset sums, but it is possible to establish an analogous bound. In words, a random subset sum carries its share of the norm plus an additional component that is dominated by a Rademacher sum.
RANDOM SUBDICTIONARIES
17
Lemma 17. Let {x1 , x2 , . . . , xN } be a sequence of points in a normed linear space X, and let Ω be a random subset of JN K with cardinality m. For each q ≥ 1,
q 1/q
X
q 1/q m X
XN
xk xk εk x k EΩ ≤ 2 EΩ Eε +
.
k=1 Ω Ω N X X X where ε is a Rademacher sequence independent from Ω. This result may be known, but we do not have a reference. (See the proof of Theorem 3.4 in [RV06], which mistakenly cites Lemma 6.3 of [LT91].) Before we begin, let us make another definition. A balanced Rademacher sequence of length 2K is a random vector ξ ∈ {±1}2K whose entries sum to zero, with all such choices equally likely. Proof. We prove the result for q = 1, since the other cases are substantially identically. Let δ ∈ {0, 1}N be a random vector with exactly m entries equal to one. The variables {δk } will select points to participate in the sum. That is,
X
X
N
def
E = EΩ δ k xk xk = Eδ
. Ω
k=1
X
X
Define δ¯ = m/N to be the (common) expectation of the entries of δ. Center the selectors by subtracting their mean and applying the triangle inequality:
XN
XN
¯ ¯
E ≤ Eδ (δk − δ)xk + δ xk
. k=1 k=1 X
X
(When q > 1, we must apply the triangle inequality a second time to draw the second summand out of the Lq norm.) Let the random vector δ 0 be an independent copy of δ. Then,
X XN
N
0 + δ¯ E ≤ Eδ (δk − δk )xk xk
Eδ0
k=1 k=1
X
X X
X
N
¯ N xk , ≤ Eδ,δ0 (δk − δk0 )xk
+δ
k=1
k=1
X
X
where the second relation follows from Jensen’s inequality. The rest of the proof focuses on the quantity
XN
def 0
F = Eδ,δ0 (δk − δk )xk
. k=1 X
Consider the set of places where exactly one of the two selectors is active: T = {k ∈ JN K : δk + δk0 = 1}. def
Note the following properties of this set: (1) The cardinality of T is an even number in the range [0, 2m]. (2) Conditional on |T |, the set T is a uniformly random subset of JN K. (3) Exactly 12 |T | of the components k listed in T have δk = 1. These locations form a uniformly random subset of T . If ξ ∈ {±1}T is a balanced Rademacher sequence of length |T |, independent from everything, then our expression for F becomes
X
ξk x k . F = ET Eξ T
X
We estimate the balanced Rademacher sum using Lemma 18 of the sequel, which furnishes
X
F ≤ 2 ET EU Eε εk x k U
1 2
X
where U is a uniformly random subset of T with |T | entries and ε is a standard Rademacher sequence, independent from everything. The distribution induced on U has two notable properties.
18
JOEL A. TROPP
(1) The cardinality |U | is an integer random variable ranging over [0, m]. (2) Conditional on |U |, the set U is a uniformly random subset of JN K. Forgetting about the subset T , we may write
X
εk x k F ≤ 2 E|U | EU Eε U X
X
εk xk . ≤ 2 max|U |=0,1,...,m EU Eε U
X
Using Jensen’s inequality, one easily checks that Rademacher series are monotonic in the sense that
XK+1
XK
ε k yk for each sequence {yk } ⊂ X. εk yk ≤ Eε Eε
k=1 k=1 X
X
It follows that the maximum is attained when |U | = m. We conclude that
X
εk x k , F ≤ 2 EΩ Eε Ω
X
where Ω is a uniformly random subset of JN K with cardinality m.
The next lemma shows that a balanced Rademacher sum is dominated by a certain standard Rademacher sum. Lemma 18. Let {x1 , x2 , . . . , x2K } be a sequence of points in a normed linear space X, and suppose that ξ is a balanced Rademacher sequence of length 2K. For each q ≥ 1,
q 1/q X
X
q 1/q
2K
Eξ ξk x k ≤ 2 E E ε x
U ε k k
k=1
U
X
X
where U is a random set of K numbers from J2KK and ε is a standard Rademacher sequence, independent from U . Proof. We establish the result for q = 1 since the proof is the same for the other cases. The basic strategy is to pair positive and negative entries of the balanced Rademacher sequence at random and then to turn off one element of each pair, also at random. Let π be a random permutation of J2KK. For each π, define π+ (k) = π(k) and π− (k) = π(K + k) for each k = 1, 2, . . . , K. Then
X2K
XK def
E = Eξ ξk xk = Eπ xπ+ (k) − xπ− (k)
. k=1 k=1 X
X
{0, 1}K ,
Independent from π, draw a random vector δ from and observe that E δk = component k. Using Jensen’s inequality and independence,
X
K
δk xπ+ (k) − (1 − δk )xπ− (k) E = 2 Eπ Eδ
k=1
X
XK
≤ 2 Eδ Eπ
k=1 δk xπ+ (k) − (1 − δk )xπ− (k) X
XK
= 2 Eε Eπ εk xπεk (k)
k=1
1 2
for each
X
where ε is a standard Rademacher sequence, independent from everything. For each fixed ε, {πε1 (1), πε2 (2), . . . , πεK (K)} is a uniformly random sequence of K distinct numbers from J2KK independent from ε, so we can write
X
K
E ≤ 2 Eε Eπ εk xπ(k)
. k=1 X
RANDOM SUBDICTIONARIES
19
Interchange the order of the expectations and invert the permutation:
X
επ−1 (k) xk E ≤ 2 Eπ Eε U
X
π −1 (JKK)
where U = is a uniformly random set of K numbers from J2KK. Rademacher sequences are exchangeable, so we may remove the inverse permutation to complete the proof. 9.3. Schatten Norms. Next, we present another piece of background. To each matrix A, one may associate the vector σ(A) of singular values. Given a parameter 1 ≤ p ≤ ∞, the Schatten p-norm is defined as def kAkSp = kσ(A)kp , where k·kp is the usual `p vector norm. Note that the choices p = 2, ∞ lead respectively to the Frobenius and spectral norms. The Schatten norms inherit many properties from the `p norms. In particular, • When q ≤ p, it holds that kAkSp ≤ kAkSq . • When 1 ≤ p, it holds that kAkSp ≤ rank(A)1/p kAk. Moreover, the Schatten spaces form a scale, so we can interpolate between them. The books [TJ89, Sim05] contain a detailed study of the Schatten classes. 9.4. Khintchine Inequalities. The scalar Khintchine inequality, which dates to 1923, gives detailed information about the moments of a Rademacher sum of real numbers. More recently, it was discovered that an analogous inequality holds for Rademacher sums of matrices. This astonishing fact is due to Lust-Picquard [LP86], but the version here depends on an argument of Buchholz [Buc01] that extends the original method of Khintchine [PS95, Sec. 2]. Proposition 19 (Noncommutative Khintchine Inequality). Let {Ak } be a finite sequence of matrices of the same dimension, and let γ be a sequence of independent, standard Gaussian variables. For each even number p ≥ 2, ) (
X
p 1/p 1/2 1/2
X
X
∗ ∗
, , (9.1) Ak Ak E γk Ak ≤ Cp max Ak Ak
k k k Sp Sp
Sp
where the optimal constant Cp =
p! p/2 2 (p/2)!
1/p .
This result is a direct application of Theorem 5 from [Buc01] with the expectation playing the role of the linear functional ψ. It is elementary to check that Gaussian variables have “mixed moments defined by pairings” with the pairing function mn identically equal to one. Note that the optimal value of the constant matches the scalar case [Haa82]. The following corollary extends Proposition 19 to other values of p, although it is unable to locate the best constant. Corollary 20. When 2 ≤ p < ∞, the Khintchine inequality (9.1) holds with a constant that satisfies √ Cp ≤ 20.25 e−1/2 p. Proof sketch. Choose n ∈ N and θ ∈ (0, 1). To obtain the Khintchine inequality for S2n+2θ , invoke the Riesz–Thorin theorem [Zyg02, Chap. XII] to interpolate between the inequalities for S2n and S2n+2 . The constant satisfies θ C2n+2θ ≤ C1−θ 2n C2n+2 . A routine estimate (using, for example, Stirling’s approximation) shows that √ C2n ≤ 21/4n e−1/2 2n.
20
JOEL A. TROPP
Introducing the latter bound into the former yields the conclusion.
It is an easy consequence of Jensen’s inequality that the noncommutative Khintchine inequality also holds for Rademacher sums of matrices. Corollary 21. Let {Ak } be a finite sequence of matrices of Rademacher sequence. For each p ≥ 2, ( X
p 1/p 1/2
X
∗
εk A k ≤ Dp max Ak Ak E
k
k
Sp
Sp
the same dimension, and let ε be a )
1/2
X
, , A∗k Ak
k
(9.2)
Sp
where the constant r Dp ≤
π Cp ≤ 2−0.25 2
r
π√ p. e
The optimal value of the Rademacher constant Dp is not available except in the simplest case p = 2. The Central Limit Theorem implies that Dp ≥ Cp . It seems possible that the Rademacher and Gaussian constants are identical, just as they are in the scalar Khintchine inequality, but the question remains open. 9.5. Rudelson’s Lemma. The noncommutative Khintchine inequality does not apply directly to the spectral norm, but it can be used to obtain information about the spectral norm of a Rademacher sum. The following estimate, due to Rudelson [Rud99], shows how to accomplish this. Lemma 22 (Rudelson). Suppose that a1 , a2 , . . . , am are the columns of a matrix A, and fix q > 0. For any p ≥ max{2, 2 log(rank A), q}, it holds that
q 1/q Xm √
E εk ak a∗k < 1.5 p kAk1,2 kAk , k=1
where ε is a Rademacher sequence. Proof. Choose a value of p subject to the limitations described above. To begin the estimate, we bound the spectral norm by the Schatten p-norm and use H¨older’s inequality to move from the Lq norm to the Lp norm.
q 1/q Xm def
E = E εk ak a∗k k=1 X
q 1/q
m ∗ ≤ E εk ak ak k=1
Sp
X
p 1/p
m ∗ ≤ E εk ak ak . k=1
Sp
Apply the noncommutative Khintchine inequality to obtain
1/2
Xm
2 ∗
E ≤ Dp kak k2 ak ak
k=1
.
Sp
The rank of matrix inside the norm does not exceed r = rank A. Thus, we may √ bound the Schatten p-norm by the spectral norm if we pay a factor of r1/p , which does not exceed e by our choice of p. Afterward, we draw the square root out from the norm to reach
1/2 √
Xm
E ≤ Dp e kak k22 ak a∗k . k=1
The summands are all positive semidefinite, so the spectral norm of the sum increases monotonically with each scalar coefficient. Therefore, we may replace each coefficient by maxk kak k22 and use the
RANDOM SUBDICTIONARIES
21
homogeneity of the norm to obtain
1/2
Xm √
ak a∗k . E ≤ Dp e · maxk kak k2 · k=1
The maximum can be rewritten as kAk1,2 , and the spectral norm can be expressed as
1/2
Xm
ak a∗k = kAA∗ k1/2 = kAk .
k=1 p √ Recall that Dp ≤ 2−0.25 π/e p, then calculate the leading constant numerically to complete the proof. 9.6. Random Compressions. We are now prepared to prove the main theorem. This argument is essentially due to Rudelson [Rud99]. Some related work appears in [MP06, RV07]. The only significant change in the proof is the application of the new symmetrization result, Lemma 17, in place of Lemma 6.3 of [LT91]. We have also made some minor modifications to reduce the size of the constants. Theorem 23 (Spectral Norm of a Random Compression). Suppose that A is a matrix with N columns, and let R be a random restriction to m coordinates, chosen at random from JN K. Fix q ≥ 2. For each p ≥ max{2, 2 log(rank AR∗ ), q/2}, it holds that 1/q r m √ ∗ q 1/q ∗ q (E kAR k ) ≤ 3 p E kAR k1,2 + kAk . N Proof. Let us begin with an overview of the proof. First, we express the random compression as a subset sum. Then, we symmetrize the subset sum and apply Rudelson’s lemma to obtain an upper bound involving the value we are trying to estimate. Finally, we solve an algebraic relation to obtain an explicit estimate. The object of interest is the quantity E = (E kAR∗ kq )1/q . def
First, observe that 2
∗
∗ q/2
E = E kAR RA k
2/q
X
q/2 2/q
∗ = E ak ak Ω
where Ω is the random set of coordinates selected by the restriction R. Since q/2 ≥ 1, the symmetrization result, Lemma 17, shows that
X
q/2 2/q m
XN
∗ 2 ∗
E ≤ 2 EΩ Eε εk ak ak + ak ak
Ω k=1 N " #2/q
X
q/2 (2/q)(q/2) m
= 2 EΩ Eε εk ak a∗k + kAk2 . Ω N Choose p subject to the limitations described. To estimate the parenthesis, we invoke Rudelson’s lemma, conditional on the choice of Ω. The matrix in the statement of the lemma is AR∗ , resulting in q/2 2/q m √ 2 ∗ ∗ E ≤ 3 p E kAR k1,2 kAR k + kAk2 . N Invoke the Cauchy–Schwarz inequality to find that 1/q m √ E 2 ≤ 3 p E kAR∗ kq1,2 (E kAR∗ kq )1/q + kAk2 . N Observe that we have obtained a copy of E on the right-hand side of the relation.
22
JOEL A. TROPP
This inequality takes the form E 2 ≤ bE + c. One obtains an upper bound on E by choosing the larger root of the quadratic and invoking the subadditivity of the square root: √ √ b + b2 + 4c E≤ ≤ b + c. 2 We discover that 1/q r m √ ∗ q E ≤ 3 p E kAR k1,2 + kAk . N This is the conclusion of the theorem. There is some interesting information in the statement of Theorem 23 about the precise role of the column norms. Adapting an observation of Rudelson and Vershynin [RV07, Lemma 4.1], we see that mX kak k22 . E kAR∗ k21,2 ≤ 2 max k∈K |K|=N/m N More or less, if we draw m columns at random from a matrix with N columns, we expect the maximum column norm to be controlled by the average of the largest N/m column norms. Therefore, if the columns of A have wildly disparate `2 norms, it is valuable to take this fact into account. For clarity of exposition, we use a simpler result that ignores this information. Corollary 24. Suppose that A is a matrix with N columns, and let R be a restriction to m coordinates, chosen at random from JN K. Fix q > 0. For each p ≥ max{2, 2 log(rank AR∗ ), q/2}, it holds that r m √ ∗ q 1/q kAk . (E kAR k ) ≤ 3 p kAk1,2 + N Proof. For any coordinate restriction R, it is true that kAR∗ k1,2 ≤ kAk1,2 , which yields the result for q ≥ 2. The result for 0 < q < 2 follows from H¨older’s inequality. 10. Decoupling under the Spectral Norm Decoupling is another important method from probability theory that allows one to replace dependent random variables by independent random variables. The following theorem offers a new twist on a classical result from harmonic analysis. See [BT87, Prop. 1.9] and [LT91, Sec. 4.4]. Theorem 25 (Decoupling in the Spectral Norm). Let A be a 2N × 2N Hermitian matrix with a zero diagonal, and let R be a restriction to m random coordinates. Fix q ≥ 1. There exists a partition of J2N K into two blocks T1 and T2 with N elements each so that (E kRAR∗ kq )1/q < 2
max
m1 +m2 =m
(E kR1 AT1 ×T2 R2∗ kq )1/q
where • the maximum occurs over integers m1 , m2 ∈ [0, N ], • the symbol AT1 ×T2 denotes the submatrix of A indexed by T1 × T2 , and • the matrices Ri are independent restrictions to mi random coordinates from Ti for i = 1, 2. When A has odd order (2N + 1) × (2N + 1), an analogous result holds for a partition into blocks of size N and N + 1. This theorem offers several important advantages over the classical one. First, the classical result only applies when the restriction selects each coordinate independent from the others, whereas the new result allows us to select exactly m coordinates. In consequence, the conclusions are slightly weaker: we must pass to an unknown submatrix and the restrictions are only conditionally independent. Second, we have exploited properties of the spectral norm to reduce the constant by an order of magnitude.
RANDOM SUBDICTIONARIES
23
Proof. We establish the result for the case where the matrix has even order; the other case is essentially the same. For notational simplicity, we take q = 1; the other cases are similar. Define the matrices Bjk = ajk ej e∗k , and let δ ∈ {0, 1}2N be a random vector with exactly m components equal to one. Then one can express X δj δk Bjk RAR∗ = j6=k
Our goal is to bound the expectation
X def
E = Eδ {0, 1}2N
Let η ∈ elementary that
j6=k
δj δk Bjk .
be a random vector with N components equal to one. When j 6= k, it is
N . 2N − 1 We insinuate this quantity into the norm, use Jensen’s inequality to draw out the expectation, and exchange the indices in the second copy of the sum:
X
2N − 1
E= Eδ Eη [ηj (1 − ηk ) + ηk (1 − ηj )] δj δk Bjk j6 = k N
X
X
< 2 Eδ Eη ηj δj · (1 − ηk )δk Bjk + (1 − ηj )δj · ηk δk Bjk j6=k j6=k
X
= 2 Eη Eδ ηj δj · (1 − ηk )δk (Bjk + Bkj ) . Eη [ηj (1 − ηk ) + (1 − ηj )ηk ] =
j6=k
There must exist some vector η ? for which the inner expectation is no smaller than its average over η. Define T1 = {j : ηj? = 1} and T2 = {k : ηk? = 0}. Note that these sets partition J2N K, and each contains N elements. It holds that
X
E < 2 Eδ j∈T1 δj δk (Bjk + Bkj )
. k∈T2
∗ , Bjk
the matrix inside the norm is Hermitian. Since T1 and T2 are disjoint, the matrix Since Bkj = is also block counter-diagonal. To complete the argument, we must repackage the matrix as a submatrix of A. Let the random variable m1 denote the number of selectors that are active on the set T1 , and write m2 = m − m1 . Observe that the sequences {δj : j ∈ T1 } and {δk : k ∈ T2 } are conditionally independent, given the numbers m1 and m2 . Moreover, under this conditioning, all choices of mi entries from Ti are equally likely for i = 1, 2. If we write Ri for a random restriction to mi coordinates from Ti , then we have E < 2 Emi ERi kR1 AT1 ×T2 R2∗ + R2 AT2 ×T1 R1∗ k = 2 Emi ERi kR1 AT1 ×T2 R2∗ k , where the equality holds on account of the identity
0 B
∗
= kBk .
B 0 The expectation with respect to the random variable m1 is certainly less than the maximum over all choices m1 + m2 = m with the caveat 0 ≤ mi ≤ N . Hence, E