Robust Identifiability in Sparse Dictionary Learning - Semantic Scholar

0 downloads 0 Views 373KB Size Report
Jun 22, 2016 - Abstract. Sparse coding or sparse dictionary learning methods have exposed underlying sparse structure in many kinds of natural data. Here ...
1

Robust Identifiability in Sparse Dictionary Learning Charles J. Garfinkle, Christopher J. Hillar

arXiv:1606.06997v1 [stat.ML] 22 Jun 2016

Abstract Sparse coding or sparse dictionary learning methods have exposed underlying sparse structure in many kinds of natural data. Here, we generalize previous results guaranteeing when the learned dictionary and sparse codes are unique up to inherent permutation and scaling ambiguities [1]. We show that these solutions are robust to the addition of measurement noise provided the data samples are sufficiently diverse. Central to our proofs is a useful lemma in combinatorial matrix theory which allows us to derive bounds on the number of samples necessary to guarantee uniqueness. We also provide probabilistic extensions to our robust identifiability theorem and an extension to the case where only an upper bound on the number of dictionary elements is known a priori. Our results help to inform the interpretation of sparse structure learned from data; whenever the conditions to one of our robust identifiability theorems are met, any sparsity-constrained algorithm that succeeds in approximately reconstructing the data well enough recovers the original dictionary and sparse codes up to an error commensurate with the noise. We discuss applications of this result to smoothed analysis, communication theory, and applied physics and engineering. Index Terms Bilinear inverse problem, matrix factorization, identifiability, dictionary learning, sparse coding, compressed sensing, combinatorial linear algebra, blind source separation, sparse component analysis

I. I NTRODUCTION number of modern approaches to signal processing model datasets Y = {y1 , . . . , yN } ⊂ Rn as lying near a collection of low-dimensional subspaces spanned by the columns {A1 , . . . , Am } of a dictionary matrix A ∈ Rn×m . Specifically, data vectors y ∈ Y are modeled as arising from a linear measurement:

A

y = Aa + n,

(1)

in which the underlying sparse code a ∈ Rm has at most k nonzero entries (k-sparse) and the noise vector n has bounded energy, |n|2 ≤ ε. The noise accounts for uncertainty both due to measurement inaccuracy and the degree to which the sparse linear model fails to describe the true nature of the data. In this note, we study the uniqueness of solutions to the associated inverse problem: Find B ∈ Rn×m and k-sparse b1 , . . . , bN ∈ Rm such that |yi − Bbi |2 ≤ ε for i = 1, . . . , N .

(2)

The term ‘uniqueness’ in this context requires clarification, however, as even in the noiseless case (ε = 0) there are many solutions to (2). Let Y and X be matrices with columns yi and ai , respectively, so that Y = AX for X with k-sparse columns. Then we also have Y = BX 0 for X 0 with k-sparse columns whenever B = AD−1 P −1 and X 0 = P DX with P ∈ Rm×m a permutation matrix and D ∈ Rm×m an invertible diagonal. Any particular solution to (2) thus generates many other solutions through mappings P D, which together constitute an ambiguity transform group [2] associated to this particular bilinear inverse problem. The question of when every solution Y = BX 0 must necessarily be part of this equivalence class (i.e., X 0 = P DX and BP D = A for some P , D) was addressed in the theoretical works [3], [4], [1] and determines whether A and X are identifiable from Y up to this inherent ambiguity. It remains to determine the extent to which A and X can be determined up to these permutation-scaling ambiguities in the presence of measurement noise (ε > 0). To address this, we consider the following robust identifiability criterion. Definition 1. A dataset Y = {y1 , . . . , yN } ⊂ Rn has a k-sparse representation in Rm if for some dictionary A ∈ Rn×m and k-sparse a1 , . . . , aN ∈ Rm , we have yi = Aai for i = 1, . . . , N . We say this representation is robustly identifiable if for every δ1 , δ2 ≥ 0 there exists a nonnegative ε = ε(δ1 , δ2 ) (with ε > 0 when δ1 , δ2 > 0) such that if B ∈ Rn×m and k-sparse b1 , . . . , bN ∈ Rm satisfy: |yi − Bbi |2 ≤ ε,

for i = 1, . . . , N,

(3)

then there necessarily exist a permutation matrix P ∈ Rm×m and invertible diagonal matrix D ∈ Rm×m with |Aj − (BP D)j |2 ≤ δ1 and |ai − D−1 P > bi |1 ≤ δ2 , n

for

j = 1, . . . , m; i = 1, . . . , N.

Problem 1. Let Y = {y1 , . . . , yN } ⊂ R be generated as yi = Aai for some matrix A ∈ R When does Y have a robustly identifiable k-sparse representation in Rm ?

n×m

(4)

and k-sparse ai ∈ Rm .

The research of Garfinkle and Hillar was conducted while at the Redwood Center for Theoretical Neuroscience, Berkeley, CA, USA; e-mails: [email protected], [email protected]. Hillar was supported, in part, by National Science Foundation grants IIS-1219212 and IIS-1219199.

2

It is easy to see how a solution to Problem 1 directly informs the interpretation of any solution to (2). Suppose Y is a sequence of measurements noisily generated as in (1), where the underlying dictionary A and sparse codes ai are such that the sequence of noiseless measurements Aai has a robustly identifiable k-sparse representation in Rm . Fix δ1 , δ2 ≥ 0 and let ε(δ1 , δ2 ) be as in Def. 1. Since by (1) we have |yi − Aai |2 ≤ ε for all i, if ε ≤ 21 ε(δ1 , δ2 ) then, by the triangle inequality, any solution to (2) necessarily satisfies (4) for some P and D. That is, the solution to (2) is uniquely determined up to the inherent permutation-scaling ambiguity and an error commensurate with measurement noise. The main finding of this work is that so long as the matrix A ∈ Rn×m is injective on the set of k-sparse vectors: As1 = As2 =⇒ s1 = s2 ,

for all k-sparse s1 , s2 ∈ Rm ,

(5)

known in the sparse recovery literature as the spark condition, then the k-sparse representation of Y in Problem 1 is robustly identifiable provided the data are sufficiently diverse. Note that this condition is necessary given only that the ai are k-sparse. Before stating this result more precisely, we explain how the spark condition relates to the lower bound [5] of A, which is the largest number α such that |Ax|2 > α|x|2 for all x. A straightforward compactness argument shows that every injective linear map has a nonzero lower bound; hence, if A satisfies the spark condition then every submatrix formed from 2k of its columns has a nonzero lower bound. We therefore define the following domain-restricted lower bound of a matrix A: Lk (A) := max{α : |As|2 ≥ α|s|2 for all k-sparse s ∈ Rm }.

(6)

Clearly, Lk (A) ≥ Lk0 (A) whenever k < k 0 and L2k (A) > 0 for any A satisfying (5). Recall that a cyclic order on [m] := {1, . . . , m} is an arrangement of [m] in a circular necklace, and that an interval is a contiguous sequence of such elements. The following is our main result. Theorem 1. Fix positive integers n, m, k < m, and a cyclic order on [m]. If k-sparse a1 , . . . , aN ∈ Rm are such that for every interval of length k in the cyclic order there are (k − 1) m k + 1 vectors ai in general linear position supported on that interval and A ∈ Rn×m satisfies spark condition (5), then Y = {Aa1 , . . . , AaN } has a robustly identifiable k-sparse representation. C −1 . If any matrix B ∈ Rn×m Specifically, there exists a constant C > 0 for which the following holds for all ε < L2√(A) 2 and k-sparse b1 , . . . , bN ∈ Rm are such that |Aai − Bbi |2 ≤ ε for all i ∈ [N ], then |Aj − (BP D)j |2 ≤ Cε,

for all j ∈ [m]

(7)

for some permutation matrix P ∈ Rm×m and invertible diagonal matrix D ∈ Rm×m . If, moreover, we have ε < ε0 , where 2k (A) ε0 = L√ C −1 , then B also satisfies the spark condition and 2k  ε |ai − D−1 P −1 bi |1 ≤ C −1 + |ai |1 , for all i ∈ [N ]. (8) ε0 − ε An important consequence of this result is that for sufficiently small reconstruction error, the original dictionary and k-sparse vectors are determined up to a commensurate error and permutation-scaling ambiguity. In terms of Def. 1, for fixed δ1 , δ2 ≥ 0 we have that (3) implies (4) for any ε ∈ [0, ε0 ) satisfying:   δ2 ε0 ε ≤ min δ1 C −1 , . (9) δ2 + C −1 + maxi∈[N ] |ai |1 The constant C is defined in terms of A and the ai in (14), below.  m Corollary 1. Given positive integers n, m, and k < m, there exist N = m(k − 1) m with the k + m vectors a1 , . . . , aN ∈ R n×m following property: every matrix A ∈ R satisfying spark condition (5) generates Y = {Aa1 , . . . , AaN } with a robustly identifiable k-sparse representation. Our proof of Theorem 1 is a delicate refinement of the arguments in [1] tohandle measurement error.  Here, we also reduce 2 m the theoretically required number of samples given by [1] from N = k m to N = m(k − 1) k k + m, and we provide guarantees for the case when only an upper bound on the number of columns in A is known (see Theorem 2 below). Given Theorem 1, it is straightforward to provide probabilistic extensions (Theorems 3 and 4) by drawing on a key result from the recently emergent field of compressed sensing (CS) [6], [7], [8]. Briefly, the goal of CS is to recover a signal x ∈ Rn that is sparse enough in some known basis (i.e. s = Ψx is k-sparse for some invertible Ψ) via a stable and efficient reconstruction process after it has been linearly subsampled as y = Φx by a known compression matrix Φ ∈ Rn×m . If the generation matrix A = ΦΨ satisfies the spark condition (5) then s is identifiable given y and the signal x can then be reconstructed as Ψ−1 s. The key result we refer to is that a random1 compression matrix Φ yields an A satisfying the spark condition with probability one (or “high probability” for discrete variables) provided the dimension n of y satisfies: m , (10) n ≥ γk log k 1 Many

ensembles of random matrices work, e.g. [9, Sec. §4].

3

where γ is a positive constant dependent on the distribution from which the entries of Φ are drawn. But what if we don’t know Φ? Our theorems demonstrate that the matrix A and sparse codes si can still be estimated up to noise and an inherent permutation-scaling ambiguity by examining a sufficiently diverse set of samples y1 , . . . , yN . We state here the main implications of our probabilistic results. To keep our exposition simple, our statements are based upon the following elementary construction of random sparse vectors (although many ensembles will suffice, e.g. [9, Sec. §4]). Definition 2 (Random k-Sparse Vectors). Given the support set for its k nonzero entries, a random draw of a is the k-sparse vector with support entries chosen uniformly from the interval [0, 1] ⊂  R, independently. When a support set is not specified, a random draw is a choice of one support set uniformly from all m k of them and then a random draw. Corollary 2. Suppose m, n, and k satisfy inequality (10). With probability one, a random n × m generation matrix A satisfies (5). Fixing such an  A, we have with probability one that a dataset  {Aa1 , . . . , AaN } generated from a random draw of m N = m(k − 1) m + m k-sparse vectors a , consisting of (k − 1) i k k + 1 samples supported on each interval of length k in some cyclic ordering of [m], has a robustly identifiable k-sparse representation in Rm .  m Corollary 3. Suppose m, n, and k obey inequality (10). If N = m(k − 1) m k + m randomly drawn  vectors a1 , . . . , aN ∈ R m are such that for every interval of length k in some cyclic ordering of [m] there are (k − 1) k + 1 vectors ai supported on that interval then, with probability one, almost every matrix A ∈ Rn×m gives a robustly identifiable Y = {Aa1 , . . . , AaN }. The organization of the rest of this paper is as follows. In Section II we list additional definitions and key lemmas, including our main tool from combinatorial matrix theory (Lemma 1). We derive Theorem 1 in Section III. In Section IV we state extension of this result to the case where we have only an upper bound on the dimensionality of the sparse codes (or, equivalently, the number of columns in the generating dictionary) and briefly describe their proofs, which are themselves contained in Appendix B. In Section V we state and prove Theorems 3 and 4, our probabilistic versions of Theorem 1. The final section is a discussion, and Appendix A contains the proof of Lemma 1. II. P RELIMINARIES In this section, we briefly review standard definitions and outline our main tools, which include general notions of angle (Def. 3) and distance(Def. 5) between vector subspaces as well as a robust uniqueness result in combinatorial matrix theory (Lemma 1). Let [m] be the collection of subsets of [m] := {1, . . . , m} of cardinality k, and let Span{v1 , . . . , v` } for real k vectors v1 , . . . , v` be the vector space consisting of their R-linear span. For a matrix M , the spectral norm is denoted kM k2 . Also, given S ⊆ [m] and M ∈ Rn×m with columns {M1 , . . . , Mm }, we define MS to be the submatrix with columns {Mj : j ∈ S} and also set Span{MS } := Span{Mj : j ∈ S}. Between any pair of subspaces in Euclidean space one can define the following generalized “angle”: Definition 3. The Friedrichs angle θF = θF (U, V ) ∈ [0, π2 ] between subspaces U, V ⊆ Rn is defined in terms of its cosine:  cos θF := max hu, vi : u ∈ U ∩ (U ∩ V )⊥ ∩ B, v ∈ V ∩ (U ∩ V )⊥ ∩ B , (11) where B = {x : |x|2 ≤ 1} is the unit `2 -ball in Rn [10]. For example, for n = 3 and k = 1 this is simply the angle between vectors, and for k = 2 it is the angle between the normal vectors of two planes. In higher dimensions, the Friedrichs angle is one out of a set of principal (or canonical or Jordan) angles between subspaces which are invariant to orthogonal transformations. These angles are all zero if and only if one subspace is a subset of the other; otherwise, the Friedrichs angle is the smallest nonzero such angle. The next definition we need is based on a quantity derived in [10] to describe the convergence of the alternating projections algorithm for projecting a point onto the intersection of a set of subspaces. We use it to bound the distance between a point and the intersection of a set of subspaces given an upper bound on the distance from that point to each individual subspace. The minimization over permutations in (13) below is done only to remove the dependence the convergence result has on the order in which subspaces are inputted to the alternating projections algorithm. Definition 4. Fix A ∈ Rn×m and k < m. Setting φ1 (A) := 1, define for k ≥ 2: φk (A) :=

min S1 ,...,Sk ∈([m] k )

1 − ξ(Span{AS1 }, . . . , Span{ASk }),

(12)

where for any set V = {V1 , . . . , Vk } of subspaces of Rm , ξ(V) := min

σ∈Sk

1−

k−1 Y

!1/2 2

sin θF Vσ(i) , Vσ(i+1) ∩ · · · ∩ Vσ(k)

i=1

and Sk is the set of permutations (i.e. bijections) on k elements.



.

(13)

4

We are now in a position to state explicitly the constant C referred to in Theorem 1. Letting T be the set of supports on which the ai are supported (intervals of length k in some cyclic ordering of [m]), X be the m × N matrix with columns ai , let, and I(S) := {i : supp(ai ) = S}, we have: ! √ maxj∈[m] |Aj |2 k3 , (14) C= φk (A) minS∈T Lk (AXI(S) ) Remark 1. We can be sure that C is well-defined provided minS∈T Lk (AXI(S) ) > 0, since φk (A) = 0 only when Span(AS1 ) ⊇  Span(AS2 ) ∩ . . . , ∩Span(ASk ) for some S1 , . . . , Sk ∈ [m] k , which would be in violation of the spark condition on A. Definition 5. Let U, V be subspaces of Rm and let d(u, V ) := inf{|u − v|2 : v ∈ W } = |u − ΠV u|2 where ΠV is the orthogonal projection operator onto subspace V . The gap metric Θ on subspaces of Rm is [11]: ! Θ(U, V ) := max

sup

d(u, V ),

u∈U, |u|2 =1

sup

d(v, U ) .

(15)

v∈V, |v|2 =1

Remark 2. We note that Θ(U, V ) is in fact equal to the sine of the largest Jordan angle between U and V . We now state our main result from combinatorial matrix theory, generalizing [1, Lemma 1] to the noisy case. Lemma 1 (Main Lemma). Fix positive integers n, m, k < m, and let T be the set of intervals of length k in some cyclic ordering of [m]. Let A, B ∈ Rn×m A satisfies the spark condition (5) with maximum column `2 -norm ρ. If  and suppose Lthat 2 (A) √ such that there exists a map π : T → [m] and some δ < k 2 Θ(Span{AS }, Span{Bπ(S) }) ≤

φk (A) δ, ρk

for all S ∈ T,

(16)

then there exist a permutation matrix P ∈ Rm×m and an invertible diagonal matrix D ∈ Rm×m with |Aj − (BP D)j |2 ≤ δ,

for all j ∈ [m].

(17)

We defer the (technical) proof of this lemma to Appendix A. In words, the result says that the vectors forming the columns of A are nearly identical to those forming the columns of B (up to symmetry) provided that for a special set T of m subspaces, each spanned by k columns of A, there exist k columns of B which span a nearby subspace with respect to the gap metric. In our proof of robust identifiability, we will also use the following useful facts about the distance d from Def. 5. The first, dim(W ) = dim(V ) =⇒

sup v∈V,|v|2 =1

d(v, W ) =

sup

d(w, V ),

(18)

w∈W,|w|2 =1

can be found in [12, Lemma 3.3]. The second is: Lemma 2. If U, V are subspaces of Rm , then d(u, V ) < |u|2 for all u ∈ U \ {0} =⇒ dim(U ) ≤ dim(V ).

(19)

Proof: We prove the contrapositive. If dim(U ) > dim(V ), then a dimension argument (dim U + dim V ⊥ > m) gives a nonzero u ∈ U ∩ V ⊥ . In particular, we have |u − v|22 = |u|22 + |v|22 ≥ |u|22 for all v ∈ V , and thus d(u, V ) ≥ |u|2 . III. D ETERMINISTIC I DENTIFIABILITY Let e1 , . . . , em be the standard basis vectors in Rm . Before proving Theorem 1 in full generality, consider when k = 1. Fix some A ∈ Rn×m satisfying spark condition (5) and suppose we have N = m 1-sparse vectors aj = cj ej for cj ∈ R \ {0}, j ∈ [m]. By (14) we have: √ ! √  maxj∈[m] |Aj |2  maxj∈[m] |Aj |2 1 k3 = k3 ≥ max , (20) C= φ1 (A) mini∈[m] L1 (ci Ai ) mini∈[m] |ci Ai |2 i∈[m] |ci | Suppose that for some B ∈ Rn×m and 1-sparse bi ∈ Rm we have |Aai − Bbi |2 ≤ ε < L2√(A) C −1 for all i ∈ [m]. Since 2 0 0 the bi are 1-sparse, there must exist c1 , . . . , cm ∈ R and some map π : [m] → [m] such that |ci Ai − c0i Bπ(i) |2 ≤ ε for all i ∈ [m]. c0i

(21)

Note that 6= 0 for all i since then otherwise (by definition of L2 (A)) we reach the contradiction |ci Ai |2 < mini∈[m] |ci Ai |2 . We will now show that π is necessarily injective (and thus defines a permutation). Suppose that π(i) = π(j) = ` for some

5

i 6= j and ` ∈ [m]. Then, |ci Ai − c0i B` |2 ≤ ε and |cj Aj − c0j B` |2 ≤ ε. Using the triangle inequality followed by application of the definition of L2 , we have: (|c0i | + |c0j |)ε ≥ |c0j ||ci Ai − c0i B` |2 + |c0i ||c0j B` − cj Aj |2 ≥ |c0j ci Ai − c0i cj Aj |2 ≥ L2 (A)|c0j ci ei − c0i cj ej |2 L2 (A) ≥ √ |c0j ci ei − c0i cj ej |1 2  L2 (A) 0 ≥ √ |cj | + |c0i | min |c` | `∈[m] 2  which is in contradiction with (20) and our upper bound on ε. Hence, π is injective. Letting P = eπ(1) · · · eπ(m) and 0 0 cm c ), we see that (21) becomes D = diag( c11 , . . . , cm |(A − BP D)i |2 = |Ai −

c0i ε Bπ(i) |2 ≤ ≤ Cε for all i ∈ [m]. ci |ci |

(22)

Remark 3. From this result we can, in general, bound |ai − D−1 P −1 bi |1 as well. Specifically, we will show that (8) always L2k C −1 . Note that for all 2k-sparse x ∈ Rm we have by the triangle inequality: follows from (7) when ε < ε0 , with ε0 = √ 2k √ |(A − BP D)x|2 ≤ Cε|x|1 ≤ Cε 2k|x|2 . Thus, |BP Dx|2 ≥ ||Ax|2 − |(A − BP D)x|2 | ≥ (L2k (A) −

√ 2kCε)|x|2 ,

where √ for the last inequality it was admissible to drop the absolute value since ε < ε0 . Hence, L2k (BP D) ≥ L2k (A) − Cε 2k > 0 and it follows that: √ |D−1 P −1 bi − ai |1 ≤ 2k|ai − D−1 P −1 bi |2 √ 2k ≤ |BP D(ai − D−1 P −1 bi )|2 L2k (BP D) √ 2k (|Bbi − Aai |2 + |(A − BP D)ai |2 ) ≤ L2k (BP D) √ ε 2k ≤ (1 + C|ai |1 ) L2k (BP D) √ ε 2k(1 + C|ai |1 ) √ ≤ L2k (A) − Cε 2k  ε = C −1 + |ai |1 . ε0 − ε C −1 Remark 4. We demonstrate with the following counter-example that for C as defined in (14) the condition ε < L2√(A) 2 is necessary to guarantee in general that (7) follows from the remaining assumptions of Theorem 1. Consider the dataset m ai = ei for i = 1, . . . , m and let A √ be the identity matrix in Rm×m . Then L2 (A) = 1 (we have |Ax|2 = |x|  2 for all x ∈ R ) L2 (A) −1 1 and C = 1; hence √2 C = 1/ 2. Consider the alternate dictionary B = 0, 2 (e1 + e2 ), e3 , . . . , em and sparse codes √ bi = e2 for i = 1, 2 and bi = ei for i = 3, . . . , m. Then |Aai − Bbi | = 1/ 2 for i = 1, 2 (and 0 otherwise). If there were permutation and invertible diagonal matrices P ∈ Rm×m and D ∈ Rm×m such that √ |(A − BP D)ei | ≤ Cε for all i ∈ [m], then we would reach the contradiction 1 = |P −1 e1 |2 = |(A − BP D)P −1 e1 |2 ≤ 1/ 2. It remains to show that (7) with C given in (14) follows from ε < L2√(A) C −1 for k > 1. Our main tool for the proof is 2 Lemma 1. Proof of Theorem 1 and Corollary 1: Let T be the set of intervals of length k in the given cyclic order of [m]. From k above, we may assume that k > 1. The first step is to produce a set of N = m(k − 1) m k + m vectors in R in general linear position (i.e., any k of them are linearly independent). Specifically, let γ1 , . . . , γN be any distinct numbers. Then the columns of the k × N matrix V = (γji )k,N i,j=1 are in general linear position (since the γj are distinct, any k × k “Vandermonde” sub-determinant is nonzero). Next, form the k-sparse vectors a1, . . . , aN ∈ Rm with supports S ∈ T (partitioning the ai evenly among these supports so that each support contains (k − 1) m k + 1 vectors ai ) by setting the nonzero values of vector ai to be those contained in the ith column of V . Fix A ∈ Rn×m satisfying (5). We claim that {Aa1 , . . . , AaN } has a robustly identifiable k-sparse representation in Rm . Suppose that for some B ∈ Rn×m there exist k-sparse bi ∈ Rm such that |Aai − Bbi |2 ≤ ε for all i ∈ [N ]. Since there are

6

  [m] 0 (k − 1) m k + 1 vectors ai with a given support S ∈ T , the pigeon-hole principle implies that there exists some S ∈ k and some set of indices J(S) of cardinality k such that all ai and bi with i ∈ J(S) have supports S and S 0 , respectively. Let X and X 0 be the m × N matrices with columns ai and bi , respectively. It follows from the general linear position of the ai and the linear independence of every k columns of A that the columns of the n × k matrix AXJ(S) are linearly independent, i.e. L(AXJ(S) ) > 0, and therefore form a basis for Span{AS }. Fixing z ∈ Span{AS }, there then exists a unique 0 c = (c1 , . . . , ck ) ∈ Rk such that z = AXJ(S) c. Letting z0 = BXJ(S) c, which is in Span{BS 0 }, we have: √ √ N N X X √ ε k ε k 0 0 |AXJ(S) c|2 = |z|2 . |z − z |2 = | ci (AXJ(S) − BXJ(S) )ej |2 ≤ ε |cj | ≤ ε k|c|2 ≤ L(AXJS) ) L(AXJ(S) ) j=1 j=1 Hence,

√ ε k . sup d(z, Span{BS 0 }) ≤ L(AXJ(S) ) z∈Span{AS }

(23)

|z|2 =1

We now show that (7) follows if ε < L2√(A) C −1 , with C as defined in (14). In this case we can bound the RHS of (23) as 2 follows. Letting ρ = maxj∈[m] |Aj |2 and I(S) = {i : supp(ai ) = S}, we have: √     ε k φk (A)L2 (A) minS∈T Lk (AXI(S) ) φk (A) L2 (A) √ √ < ≤ , (24) L(AXJ(S) ) L(AXJ(S) ) ρk ρk 2 2 √ Since L2 (A) ≤ ρ 2 and φk (A) ≤ 1, we have that the RHS of (23) is strictly less than one. It follows by Lemma 2 that dim(Span{BS 0 }) ≥ dim(Span{AS }) = k (since every k columns of A are linearly independent). Since |S 0 | = k, we have 0 dim(Span{BS 0 }) ≤ k; hence,  dim(Span{BS 0 }) = dim(Span{AS }). Recalling (18), we see the association S 7→ S thus [m] defines a map π : T → k satisfying √ ε k for all S ∈ T. (25) Θ(Span{AS }, Span{Bπ(S) }) ≤ L(AXJ(S) ) (A) From (24) and (25) we see that the inequality Θ(Span{AS }, Span{Bπ(S) }) ≤ φkρk δ is satisfied for δ <   √ ρk ε k δ = φk (A) L(AXJ(S) ) (see Remark 1 for why we can be sure φk (A) 6= 0). We therefore satisfy (16) for √ ! maxj∈[m] |Aj |2 ε k3 = Cε δ= φk (A) minS∈T Lk (AXI(S) )

L2 (A) √ 2

by setting

(26)

It follows by Lemma 1 that there exist a permutation matrix P ∈ Rm×m and invertible diagonal matrix D ∈ Rm×m such that |Aj − (BP D)j |2 ≤ Cε for all j ∈ [m]. The proof of how (8) follows from this result is contained in Remark 3. IV. U NKNOWN REPRESENTATION DIMENSION In this section we state a version of Theorem 1 and Lemma 1 assuming that B also satisfies the spark condition (in addition to A satisfying the spark condition). With this additional assumption, we can address the issue of recovering A ∈ Rn×m and the ai ∈ Rm when only an upper bound m0 on the number m of columns in A is known. Theorem 2. Fix positive integers n, m, m0 , and k with k < m ≤ m0 and fix a cyclic order on [m]. If k-sparse a1 , . . . , aN ∈ Rm  m0 are such that for every interval of length k in the cyclic order there are (k − 1) k + 1 vectors ai in general linear position 0 supported on that interval and A ∈ Rn×m and B ∈ Rn×m both satisfy spark condition (5) then there exists a constant C˜ > 0 0 L2 (A) ˜ −1 for which the following holds for all ε < √2 C . If there exist k-sparse b1 , . . . , bN ∈ Rm such that |Aai − Bbi |2 ≤ ε for all i ∈ [N ] then: ˜ |Aj − (BP D)j |2 ≤ Cε 0

for all j ∈ [m]

(27)

0

for some partial permutation matrix P ∈ Rm ×m (there is at most one nonzero entry in each row and column and these 0 nonzero entries are all one) and diagonal matrix D ∈ Rm ×m (Dij = 0 whenever i 6= j). In other words, the columns of the learned dictionary B contain (up to noise, and after appropriate scaling) the columns of the original dictionary A. One can then show by manipulations similar to those contained in Remark 3 that the coefficients in each ai form (up to noise) a scaled subset of the coefficients in bi . The constant C˜ is given by: ! √ maxj∈[m] |Aj |2 k3 ˜ , (28) C= min(φk (A), φk (B)) minS∈T Lk (AXI(S) )

7

where X is the m × N matrix with columns given by ai and I(S) = {i : supp(ai ) = S}. The proof of Theorem 2 is very similar to the proof of Theorem 1, the difference being that now we establish a map 0 π : [m] → [m0 ] satisfying the requirements of Lemma 3, which we state next, by pigeonholing (k − 1) mk + 1 vectors with respect to holes [m0 ]. This insures that we can establish a one-to-one correspondence between subspaces spanned by the m0 columns of B and nearby subspaces spanned by the m columns of A despite the fact that m < m0 . By requiring B to also satisfy the spark condition, we remove the dependency of Lemma 1 on Lemma 7 (which requires that m = m0 ), resulting in Lemma 3. Lemma 3 (Main Lemma for m < m0 ). Fix positive integers n, m, m0 , and k where k < m < m0 , and let T be the set of 0 intervals of length k in some cyclic ordering of [m]. Let A ∈ Rn×m and B ∈ Rn×m both satisfy spark condition (5) with A [m0 ] such that having maximum column `2 -norm ρ. If there exists a map π : T → k and some δ < L2√(A) 2 δ min(φk (A), φk (B)), ρk

Θ(Span{AS }, Span{Bπ(S) }) ≤ 0

0

for all S ∈ T,

(29)

0

then there exist a partial permutation matrix P ∈ Rm ×m and a diagonal matrix D ∈ Rm ×m such that |Aj − (BP D)j |2 ≤ δ,

for all j ∈ [m].

(30)

We defer the proof of this lemma to Appendix B. V. P ROBABILISTIC I DENTIFIABILITY We next give precise statements of our probabilistic versions of Theorem 1, which apply to random sparse vectors as defined in Definition 2. Our brief proofs rely largely on the following lemma, the proof of which can be found in [1, Lemma 3]: Lemma 4. Fix positive integers n, m, k < m, and a matrix M ∈ Rn×m satisfying (5). With probability one, M a1 , . . . , M ak are linearly independent whenever the ai are random k-sparse vectors.  Theorem 3. Fix positive integers n, m, k < m, and A ∈ Rn×m satisfying (5). If a set of N = m(k − 1) m k + m randomly  drawn vectors a1 , . . . , aN ∈ Rm are such that for each interval of length k in some cyclic order on [m] there are (k −1) m k +1 vectors ai supported on that interval then {Aa1 , . . . , AaN } has a robustly identifiable k-sparse representation with probability one. Proof: By Lemma 4 (setting M to be the identity matrix), with probability one the ai are in general linear position. Apply Theorem 1. Note that an algebraic set is a solution to a finite set of polynomial equations.  m Theorem 4. Fix positive integers k < m and n . If N = m(k − 1) m k + m randomly drawn  vectors a1 , . . . , aN ∈ R are m such that for every interval of length k in some cyclic ordering of [m] there are (k − 1) k + 1 vectors ai supported on that interval, then with probability one the following holds. There is an algebraic set Z ⊂ Rn×m of Lebesgue measure zero with the following property: if A ∈ / Z then {Aa1 , . . . , AaN } has a robustly identifiable k-sparse representation in Rm . Proof: By Lemma 4 (setting M to be the identity matrix), with probability one the ai are in general linear position. By the same arguments made in the proof of Theorem 3 in [1], the set of matrices A that fail to satisfy (5) form an algebraic set of measure zero. Apply Theorem 1. We note that our results in the deterministic case (Theorem 1) were derived for the worst case noise, which (depending on the noise distribution) may have very small probability of occurrence. In particular, for fixed k, the larger the ambient dimension of y, the smaller the probability that randomly generated noise points in a direction which conflates signals generated by distinct k-dimensional subspaces spanned by the columns of A. For a given distribution the “effective” noise may therefore potentially be much smaller, with the original dictionary and sparse codes being identifiable up to a commensurate error with high probability. VI. D ISCUSSION In this note, we generalize the known uniqueness of solutions to (2) in the exact case (ε = 0) to the more realistic case of deterministic noise (ε > 0). Somewhat surprisingly, as long as the standard assumptions from compressed sensing are met, almost every dictionary and sufficient quantity N of sparse codes are uniquely determined up to the error in sampling and the problem’s inherent symmetries (uniform relabeling and scaling). To convince the reader of the general usefulness of this result, we elaborate briefly on four diverse application areas. Sparse Component Analysis (SCA). In the SCA framework, it is assumed that the linear model in (1) describes some real physical process and the goal is to infer the true underlying dictionary A and sparse codes ai from measurements yi . Often in such analyses it is implicitly assumed that sparse coding or dictionary learning algorithms have exposed this underlying sparse structure in the data (e.g. the feature-tuned field potentials in [13]) as opposed to some artifactual degenerate solution.

8

Our results suggest that given enough data samples the uniqueness of this decomposition is indeed the norm rather than the exception. Regarding this, it would be useful to determine for general k the best possible dependence of ε on δ1 , δ2 (see Def. 1) as well as the minimum possible sample size N . We encourage the sparse coding community to extend our results and find a tight dependency of all the parameters, both for the sake of theory and practical applications. Smoothed Analysis. The main idea in smoothed analysis [14] is that certain algorithms having bad worst case behavior, nonetheless, are efficient if certain (typically Lebesgue measure zero) pathological input sets are avoided. Our results imply that if there is an efficient “smoothed” algorithm for solving (2) then only for a measure zero set of inputs will this algorithm fail to determine the unique original solution; this is a focus of future work. We note that avoiding “bad” (NP-hard) sets of inputs is necessary for dictionary learning [15], [16]. Communication Theory. In [17] and [18] it was posited that sparse features of natural data could be passed through a communication bottleneck in the brain using random projections and decoded by unsupervised learning of sparse codes. A necessary condition for this theory to work is that there is a unique solution to the SCA problem. This was verified in the case of data sampled without noise in [1]. The present work extends this theory to the more realistic case of sampling noise. Applied Physics and Engineering. As a final application domain, we consider engineering applications. Several groups have found ways to utilize compressed sensing for signal processing tasks, such as digital image compression [19] (the “single-pixel camera”) and, more recently, the design of an ultrafast camera [20] capable of capturing one hundred billion frames per second. Given such effective uses of classical CS, it is only a matter of time before these systems utilize sparse coding algorithms to code and process data. In this case, guarantees such as the ones offered by our main theorems allow any such device to be compared to any other (having different initial parameters and data samples) as long as the data originates from the same system. A PPENDIX A C OMBINATORIAL M ATRIX T HEORY In this section, we prove Lemma 1, which is the main ingredient in our proof of Theorem 1. For readers willing to assume a priori that the spark condition holds for B as well as for A, a shorter proof of this case (Lemma 3 from Section IV) is provided in Appendix B. This additional assumption simplifies the argument and allows us to extend robust identifiability conditions to the case where only an upper bound on the number of columns m in A is known. We now prove some auxiliary lemmas before deriving Lemma 1. Given a collection of sets T , we let ∩T denote their intersection.  S Lemma 5. Let M ∈ Rn×m . If every 2k columns of M are linearly independent, then for any T ⊆ `≤k [m] ` , \ Span{M∩T } = Span{MS }. (31) S∈T

Proof: By induction, it is enough to prove the lemma when |T | = 2. The proof now follows directly from the assumption. Lemma 6. Fix k ≥ 2. Let V = {V1 , . . . , Vk } be subspaces of Rm and let V =

T

V. For every x ∈ Rm , we have

k

|x − ΠV x|2 ≤

X 1 |x − ΠVi x|2 , 1 − ξ(V) i=1

(32)

provided ξ(V) 6= 1, where the expression for ξ is given in Def. 4. Proof: Fix x ∈ Rm and k ≥ 2. The proof consists of two parts. First, we shall show that |x − ΠV x|2 ≤

k X

|x − ΠV` x|2 + |ΠVk ΠVk−1 · · · ΠV1 x − ΠV x|2 .

(33)

`=1

For each ` ∈ {2, . . . , k + 1} (when ` = k + 1, the product ΠVk · · · ΠV` is set to I), we have by the triangle inequality and the fact that kΠV` k2 ≤ 1 (as ΠV` are projections): |ΠVk · · · ΠV` x − ΠV x| ≤ |ΠVk · · · ΠV`−1 x − ΠV x| + |x − ΠV`−1 x|.

(34)

Summing these inequalities over ` gives (33). Next, we show how the result (32) follows from (33) from the following result of [10, Theorem 9.33]: |ΠVk ΠVk−1 · · · ΠV1 x − ΠV x|2 ≤ z|x|2

for all x ∈ Rm ,

(35)

h i1/2  Qk−1 where z = 1 − `=1 (1 − z`2 ) and z` = cos θF V` , ∩ks=`+1 Vs . To see this, note that |ΠVk ΠVk−1 · · · ΠV1 (x − ΠV x) − ΠV (x − ΠV x)|2 = |ΠVk ΠVk−1 · · · ΠV1 x − ΠV x|2 ,

(36)

9

since ΠV` ΠV = ΠV for all ` = 1, . . . , k and Π2V = ΠV . Therefore by (35) and (36), it follows that |ΠVk ΠVk−1 · · · ΠV1 x − ΠV x|2 = |ΠVk ΠVk−1 · · · ΠV1 (x − ΠV x) − ΠV (x − ΠV x)|2 ≤ z|x − ΠV x|2 . Combining this last inequality with (33) and rearranging, we arrive at k 1 X |x − ΠVi x|2 . |x − ΠV x|2 ≤ 1 − z i=1

(37)

Finally, since the ordering of the subspaces is arbitrary, we can replace z in (37) with ξ(V) to obtain (32). Lemma 7. Fix positive integers k < m, and let S1 , . . . , Sm be the set of contiguous length k intervals in some cyclic order of [m]. Suppose there exists a map π : T → [m] such that k   \ \ [m] | π(Si )| ≤ | Si | for all J ∈ . (38) k i∈J

i∈J

Then, |π(Sj1 ) ∩ · · · ∩ π(Sjk )| = 1 for j1 , . . . , jk consecutive modulo m. Proof: Consider the set Qm = {(r, t) : r ∈ π(St ), t ∈ [m]}, which has mk elements. By the pigeon-hole principle, there is some q ∈ [m] and J ∈ [m] such that (q, j) ∈ Qm for all j ∈ J. In particular, we have q ∈ ∩j∈J π(Sj ) so that from (38) k there must be some p ∈ [m] with p ∈ ∩j∈J Sj . Since |J| = k, this is only possible if the elements of J = {j1 , . . . , jk } are consecutive modulo m, in which case | ∩j∈J Sj | = 1. Hence | ∩j∈J π(Sj )| = 1 as well. We next consider if any other t ∈ / J is such that q ∈ π(St ). Suppose there were such a t; then, we would have q ∈ π(St ) ∩ π(Sj1 ) ∩ · · · ∩ π(Sjk ) and (38) would imply that the intersection of every k-element subset of {St } ∪ {Sj : j ∈ J} is nonempty. This would only be possible if {t} ∪ J = [m], in which case the result then trivially holds since then q ∈ π(Sj ) for all j ∈ [m]. Suppose now there exists no such t; then letting Qm−1 ⊂ Qm be the set of elements of Qm not having q as a first coordinate, we have |Qm−1 | = (m − 1)k. By iterating the above arguments we arrive at a partitioning of Qm into sets Ri = Qi \ Qi−1 for i = 1, . . . , m, each having a unique element of [m] as a first coordinate common to all k elements while having second coordinates which form a consecutive set modulo m. In fact, every set of k consecutive integers modulo m is the set of second coordinates of some Ri . This must be the case because for every consecutive set J we have | ∩j∈J Sj | = 1, whereas if J is the set of second coordinates for two distinct sets Ri we would have | ∩j∈J π(Sj )| ≥ 2, which violates (38). Proof of Lemma 1 (Main Lemma): We assume k ≥ 2 since the case k = 1 was proven at the beginning of Section III. Let S1 , . . . , Sm be the set of contiguous length k intervals in some cyclic ordering of [m]. We begin by proving that dim(Span{Bπ(Si ) }) = k for all i ∈ [m]. Fix i ∈ [m] and note that by (16) we have for all unit vectors u ∈ Span{ASi } that (A) δ for δ < L2√(A) . By definition of L2 (A) we have for all 2-sparse x ∈ Rm : d(u, Span{Bπ(Si ) }) ≤ φkρk 2 L2 (A) ≤

√ |Ax|2 |x|1 ≤ρ ≤ρ 2 |x|2 |x|2

(39)

Hence δ < ρ. Since φk ≤ 1 we have d(u, Span{Bπ(Si ) }) < 1 and it follows by Lemma 2 that dim(Span{Bπ(Si ) }) ≥ dim(Span{ASi }) = k. Since |π(Si )| = k, we in fact have dim(Span{Bπ(Si ) }) = k. We will now show that   \ \ [m] | π(Si )| ≤ | Si | for all J ∈ . (40) k i∈J i∈J  φk (A) Fix J ∈ [m] k . By (16) we have for all unit vectors u ∈ ∩i∈J Span{Bπ(Si ) } that d(u, Span{ASi }) ≤ ρk δ for all j ∈ J, L2 (A) √ . It follows by Lemma 6 that where δ < 2 !   \ δ φk (A) δ d u, Span{ASj } ≤ ≤ , ρ 1 − ξ({Span{ASi } : i ∈ J}) ρ i∈J

where the second inequality follows immediately from the definition of φk (A). Now, since Span{B∩i∈J π(Si ) } ⊆ ∩i∈J Span{Bπ(Si ) } and (by Lemma 5) ∩i∈J Span{ASi } = Span{A∩i∈J Si }, we have d (u, Span{A∩i∈J Si }) ≤

δ ρ

for all unit vectors u ∈ Span{B∩i∈J π(Si ) }.

(41)

We therefore have by Lemma 2 (since δ/ρ < 1) that dim(Span{B∩i∈J π(Si ) }) ≤ dim(Span{A∩i∈J Si }) and (40) follows by the linear independence of the columns of ASi and Bπ(Si ) for all i ∈ [m]. Suppose now that J = {i − k + 1, . . . , i} so that ∩i∈J Si = i. By (40) we have that ∩i∈J π(Si ) is either empty or it contains a single element. Lemma 7 ensures that the latter case is the only possibility. Thus, the association i 7→ ∩i∈J π(Si ) defines a map

10

 π ˆ : [m] → [m]. Recalling (18), it follows from (41) that for all unit vectors u ∈ Span{Ai } we have d u, Span{Bπˆ (i) } ≤ δ/ρ also. Since i is arbitrary, it follows that for every canonical basis vector ei ∈ Rm , letting ci = |Aei |−1 and ε = δ/ρ, there 2 min c . This is exactly the supposition in (21) exists some c0i ∈ R such that |ci Aei − c0i Beπˆ (i) |2 ≤ ε where ε < L2√(A) i j∈[m] 2 and the result follows from the subsequent arguments of Section III. Remark 5. In general, there may exist combinations of fewer supports with intersection {i}, e.g. if m ≥ 2k − 1 then Si−(k−1) ∩ Si = {i}. For brevity, we have considered a construction that is valid for any k < m. A PPENDIX B U NKNOWN R EPRESENTATION D IMENSION In this section we prove Lemma 3 from Section IV by first considering the case when k = 1 and then showing how the case for k > 1 reduces to this. The proof is largely identical to that of Lemma 1; hence, we gloss over some identical manipulations. Proof of Lemma 3: Let k = 1 and fix some A ∈ Rn×m satisfying spark condition (5). Suppose we have N = m 1-sparse vectors ai = ci ei (i = 1, . . . , m) for c1 , . . . , cm ∈ Rm /{0}. By (28) we have: 1 . C˜ ≥ max i∈[m] |ci |

(42)

0 0 Suppose that for some B ∈ Rn×m and 1-sparse bi ∈ Rm we have |Aai − Bbi |2 ≤ ε < L2√(A) C˜ −1 for all i ∈ [m]. Since 2 0 0 0 the bi are 1-sparse, there must exist c1 , . . . , cm ∈ R and some map π : [m] → [m ] such that

|ci Aei − c0i Beπ(i) |2 ≤ ε for all j ∈ [m].

(43)

c0i

Note that 6= 0 for all i since then otherwise (by definition of L2 (A)) we reach the contradiction |ci Aei |2 < mini∈[m] |ci Aei |2 . We will now show that π is necessarily injective. Suppose that π(i) = π(j) = ` for some i 6= j and ` ∈ [m0 ]. Then, |ci Aei − c0i Be` |2 ≤ ε and |cj Aej − c0j Be` |2 ≤ ε. Using the triangle inequality followed by application of the definition of L2 (A), we have: (|c0i | + |c0j |)ε ≥

 L2 (A) 0 √ |cj | + |c0i | min |c` | `∈[m] 2

 0 0 which contradicts (42) and our upper bound on ε. Hence, π is injective. Letting P = eπ(1) · · · eπ(m) , 0, . . . , 0 ∈ Rm ×m 0 0 0 c c ) ∈ Rm ×m , we see that (43) becomes and D = diag( c11 , . . . , cm m ˜ for all i ∈ [m]. |(A − BP D)ei |2 ≤ Cε

(44)

Now suppose k ≥ 2 and let S1 , . . . , Sm be the set of contiguous length k intervals in some cyclic ordering of [m]. Fix i ∈ [m] and let J = {i − k + 1, . . . , i} so that ∩i∈J Si = i. Let ρ = maxj∈[m] |Aj |2 . Assumption (29) implies that for all unit (A) vectors u ∈ ∩j∈J Span{Bπ(Sj ) } we have d(u, Span{ASj }) ≤ φkρk . By Lemma 6 we have δ for all j ∈ J, with δ < L2√(A) 2 that:     \ φk (A) δ δ   ≤ , (45) d u, Span{ASj } ≤ ρ 1 − ξ({Span{ASj } : j ∈ J}) ρ j∈J

where the second inequality is by definition of φk (A). Now, since Span{B∩i∈J π(Si ) } ⊆ ∩i∈J Span{Bπ(Si ) } and (by Lemma 5) ∩i∈J Span{ASi } = Span{Ai }, we have d (u, Span{Ai }) ≤

δ ρ

for all u ∈ Span{B∩i∈J π(Si ) }.

(46)

We therefore have by Lemma 2, setting V = Span{B∩i∈J π(Si ) }) and W = Span{Ai }, that dim(V ) ≤ dim(W ). The same arguments can be used to prove (46) with the roles of A and B reversed. Hence, dim(V ) = dim(W ) = 1 and it follows that | ∩i∈J π(Si )| = 1 since the columns of B∩i∈J π(Si ) are linearly independent  by the spark condition. Thus, the association i 7→ ∩i∈J π(Si ) defines a map π : [m] → [m0 ] such that d u, Span{Bπ(Si ) } ≤ δ/ρ for all i ∈ [m]. Since i is arbitrary, it 0 follows that for every canonical basis vector ei ∈ Rm , letting ci = |Aei |−1 2 and ε = δ/ρ, there exists some ci ∈ R such that L2 (A) 0 |ci Aei − ci Beπˆ (i) |2 ≤ ε where ε < √2 mini∈[m] |ci |. This is exactly the supposition in (43) and the proof is completed by the subsequent arguments of the k = 1 case. ACKNOWLEDGMENT We thank Fritz Sommer for turning our attention to the dictionary learning problem, Darren Rhea for sharing his early explorations of this problem, and Ian Morris for posting identity (18) with a reference to his proof of it on the internet (www.stackexchange.com).

11

R EFERENCES [1] C. Hillar and F. Sommer, “When can dictionary learning uniquely recover sparse data from subsamples?” Information Theory, IEEE Transactions on, vol. 61, no. 11, pp. 6290–6297, 2015. [2] Y. Li, K. Lee, and Y. Bresler, “A unified framework for identifiability analysis in bilinear inverse problems with applications to subspace and sparsity models,” CoRR, vol. abs/1501.06120, 2015. [Online]. Available: http://arxiv.org/abs/1501.06120 [3] A. C. P Georgiev, F Theis, “Sparse component analysis and blind source separation of underdetermined mixtures,” IEEE Transactions on Neural Networks, vol. 16(4), pp. 992–996, 2005. [4] M. Aharon, M. Elad, and A. M. Bruckstein, “On the uniqueness of overcomplete dictionaries, and a practical way to retrieve them,” Linear algebra and its applications, vol. 416, no. 1, pp. 48–67, 2006. [5] J. F. Grcar, “A matrix lower bound,” Linear Algebra and its Applications, vol. 433, no. 1, pp. 203–220, 2010. [6] E. J. Candes and T. Tao, “Near-optimal signal recovery from random projections: Universal encoding strategies?” Information Theory, IEEE Transactions on, vol. 52, no. 12, pp. 5406–5425, 2006. [7] D. Donoho, “Compressed sensing,” Information Theory, IEEE Transactions on, vol. 52, no. 4, pp. 1289–1306, 2006. [8] E. J. Candes, J. K. Romberg, and T. Tao, “Stable signal recovery from incomplete and inaccurate measurements,” Communications on pure and applied mathematics, vol. 59, no. 8, pp. 1207–1223, 2006. [9] R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin, “A simple proof of the restricted isometry property for random matrices,” Constructive Approximation, vol. 28, no. 3, pp. 253–263, 2008. [10] F. R. Deutsch, Best approximation in inner product spaces. Springer Science & Business Media, 2012. [11] N. I. Akhiezer and I. M. Glazman, Theory of linear operators in Hilbert space. Courier Corporation, 2013. [12] I. D. Morris, “A rapidly-converging lower bound for the joint spectral radius via multiplicative ergodic theory,” Advances in Mathematics, vol. 225, no. 6, pp. 3425–3445, 2010. [13] G. Agarwal, I. H. Stevenson, A. Ber´enyi, K. Mizuseki, G. Buzs´aki, and F. T. Sommer, “Spatially distributed local fields in the hippocampus encode rat position,” Science, vol. 344, no. 6184, pp. 626–630, 2014. [14] D. A. Spielman and S.-H. Teng, “Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time,” Journal of the ACM (JACM), vol. 51, no. 3, pp. 385–463, 2004. [15] M. Razaviyayn, H.-W. Tseng, and Z.-Q. Luo, “Computational intractability of dictionary learning for sparse representation,” arXiv preprint arXiv:1511.01776, 2015. [16] A. M. Tillmann, “On the computational intractability of exact and approximate dictionary learning,” Signal Processing Letters, IEEE, vol. 22, no. 1, pp. 45–49, 2015. [17] W. K. Coulter, C. J. Hillar, G. Isley, and F. T. Sommer, “Adaptive compressed sensing?a new class of self-organizing coding models for neuroscience,” in Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on. IEEE, 2010, pp. 5494–5497. [18] G. Isely, C. Hillar, and F. Sommer, “Deciphering subsampled data: adaptive compressive sampling as a principle of brain communication,” in Advances in neural information processing systems, 2010, pp. 910–918. [19] M. F. Duarte, M. A. Davenport, D. Takbar, J. N. Laska, T. Sun, K. F. Kelly, and R. G. Baraniuk, “Single-pixel imaging via compressive sampling,” IEEE Signal Processing Magazine, vol. 25, no. 2, pp. 83–91, March 2008. [20] L. Gao, J. Liang, C. Li, and L. V. Wang, “Single-shot compressed ultrafast photography at one hundred billion frames per second,” Nature, vol. 516, no. 7529, pp. 74–77, 2014.

Charles J. Garfinkle completed a B.S. in Physics and Chemistry at McGill University. He is currently a Ph.D. candidate in Neuroscience at UC Berkeley.

Christopher J. Hillar completed a B.S. in Mathematics and a B.S. in Computer Science at Yale University. Supported by an NSF Graduate Research Fellowship, he received his Ph.D. in Mathematics from the University of California (UC), Berkeley in 2005. From 2005-2008, he was a Visiting Assistant Professor and NSF Postdoctoral Fellow at Texas A&M University. From 2008-2010, he was an NSF Mathematical Sciences Research Institutes Postdoctoral Fellow at the Mathematical Sciences Research Institute (MSRI) in Berkeley, CA. In 2010, he joined the Redwood Center for Theoretical Neuroscience at UC Berkeley, and in 2011, he became a research specialist in the Tecott mouse behavioral neuroscience lab at UC San Francisco.

Suggest Documents