On the Sample Complexity of Predictive Sparse Coding

0 downloads 0 Views 641KB Size Report
Feb 18, 2012 - Predictive sparse coding methods such as Mairal et al. (2010)'s task-driven dictionary learning recently have achieved state-of-the-art results ...
On the Sample Complexity of Predictive Sparse Coding Nishant A. Mehta∗ and Alexander G. Gray

arXiv:1202.4050v1 [cs.LG] 18 Feb 2012

School of Computer Science College of Computing Georgia Institute of Technology Atlanta, Georgia

February 21, 2012

Abstract Predictive sparse coding algorithms recently have demonstrated impressive performance on a variety of supervised tasks, but they lack a learning theoretic analysis. We establish the first generalization bounds for predictive sparse coding. In the overcomplete dictionary learning setting, where the dictionary sizepk exceeds√the dimensionality d of the data, we present an estimation error bound that is roughly O( dk/m +√ s/(µm)). In the infinite-dimensional setting, we show a dimension-free bound that is √ roughly O(k s/(µ m)). The quantity µ is a measure of the incoherence of the dictionary and s is the sparsity level. Both bounds are data-dependent, explicitly taking into account certain incoherence properties of the learned dictionary and the sparsity level of the codes learned on actual data. Keywords: Statistical learning theory, luckiness, data-dependent complexity, dictionary learning, sparse coding, LASSO

1

Introduction

Learning architectures such as the support vector machine and other linear predictors enjoy strong theoretical properties and and many empirical successes (Steinwart and Christmann, 2008; Kakade et al., 2009; Sch¨ olkopf and Smola, 2002), but a learning-theoretic study of many more complex learning architectures is lacking. Predictive methods based on sparse coding recently have emerged which simultaneously learn a data representation via a nonlinear encoding scheme and an estimator linear in that representation (Bradley and Bagnell, 2009; Mairal et al., 2010, 2009). A sparse coding representation z ∈ Rk of a data point x ∈ Rd is learned by representing x as a sparse linear combination of k atoms Dj ∈ Rd of a dictionary Pk D = (D1 , . . . , Dk ) ∈ Rd ×k . In the coding x ≈ j =1 zj Dj , all but a few zj are zero. Predictive sparse coding methods such as Mairal et al. (2010)’s task-driven dictionary learning recently have achieved state-of-the-art results on many tasks, including the MNIST digits task. Whereas standard sparse coding minimizes an unsupervised, reconstructive `2 loss, predictive sparse coding seeks to minimize a supervised loss by optimizing a dictionary D and a predictor which takes encodings to D as input. It empirically has been observed that sparse coding can provide good abstraction by finding higher-level representations which are useful in predictive tasks (Yu et al., 2009). Intuitively, the power of prediction-driven dictionaries is that they pack more atoms in parts of the representational space where the prediction task is more difficult. Despite the empirical successes of predictive sparse coding methods, it is unknown how well they generalize in a theoretical sense. In this work, we develop what to our knowledge are the first generalization error bounds for predictive sparse coding algorithms; in particular, we focus on `1 -regularized sparse coding. Maurer and Pontil (2008) ∗ To

whom correspondence should be addressed. Email: [email protected]

and Vainsencher et al. (2011) previously established generalization bounds for the classical, reconstructive sparse coding setting. There, the objective is to learn a representation of data by coding the data to a dictionary of atoms such that the data can be reconstructed with low error using only the coded data and the dictionary. Extending their analysis to the predictive setting introduces certain difficulties, the most salient being a seemingly unavoidable demand that the learned encoder be stable with respect to dictionary perturbations. Our analysis therefore is intimately tied to encoder stability properties. The sparse encoder’s stability is characterized by properties central to sparse inference: • The sparsity level s – the number of non-zeros in an encoding of a point. • The s -incoherence µs – the square of the minimum singular value among all s -column subdictionaries of D . • The coding margin – a measure of a coordinate’s sign stability. Since the sparsity level and the coding margin depend on the particular training sample, our learning bounds will depend on the maximum sparsity level and the minimum coding margin observed on the training sample. The need for encoder stability may be the price one pays for absconding from the (quite stable) prison of `2 -regularization. As in the luckiness frameworks of Shawe-Taylor et al. (1998) and Herbrich and Williamson (2002), our learning bounds depend on a posteriori -observable properties related to the training sample and the learned hypothesis; hence we will need to guarantee that these properties hold with high probability over a second, ghost sample. We provide learning bounds for two core scenarios in sparse coding: the overcomplete setting of k ≥ d and the infinite-dimensional setting where d  k or d is even infinite. Both bounds hold provided that m is larger than 3 times the inverse of the permissible radius of perturbation, a function linear in the coding margin, the sparsity level, the inverse of the s -incoherence, and the `1 -regularization parameter λ. Our contributions are: 1. A forward stability result for the LASSO which holds under mild conditions. This result implies conditions for support preservation under dictionary perturbations. 2. In the overcomplete setting, a learning bound on the estimation error for predictive sparse coding that q √ s dk is essentially of order m + µs m (Corollary 1). 3. In the infinite-dimensional setting, a learning bound on the estimation error for predictive sparse coding √ that is independent of the dimension of the data; the bound is essentially of order µk2s √sm (Corollary 2). Whereas in the overcomplete setting the bound depends on the observed sparsity level and the coding margin on the training sample, in the infinite-dimensional setting the bound depends on these quantities as measured on both the training sample and a second, unlabeled sample not used for training. Both learning bounds contain a factor of k , and the optimal order of k is unknown — in both the predictive setting and the reconstructive setting. In addition to providing guarantees in the case of simultaneous dictionary and linear estimator learning (true predictive sparse coding), the presented bounds are quite general in that they also apply to the heuristic, two-stage algorithm where one is given a labeled training sample, learns a dictionary reconstructively (not using the labels) to fix the sparse codes, and then learns a linear estimator using the same training sample but with the labels as well. The next section introduces the reconstructive and predictive forms of sparse coding. Section 3 sets up notation and presents two results: the Sparse Coding Stability Theorem (Theorem 1) and a useful symmetrization lemma. The overcomplete and infinite-dimensional settings are covered in Sections 4 and 5 respectively. In Section 6 we compare the results to recent learning bounds for unsupervised sparse coding. To allow the paper to flow smoothly, we provide proof sketches in the main paper and leave detailed proofs to the appendix.

2

2

Sparse coding, reconstructive and predictive

Suppose points x1 , . . . , xm have been drawn from a probability measure PX over BRd , the unit ball in Rd . The sparse coding problem is to represent each point xi as a sparse linear combination of k basis vectors, or atoms D1 , . . . , Dk . The atoms form the columns of a dictionary D ∈ D, where D is a space of dictionaries k D := (BRd ) and Di = (Di1 , . . . , Did )T . We will use this definition of D from here on out. In this work, we use r BRd to denote the origin-centered ball in Rd scaled to radius r . It will be useful to frame sparse coding in terms of an encoder ϕD : ϕD (x ) := arg min kx − Dz k22 + ξ(z ),

(1)

z

where ξ(·) is a sparsity-inducing regularizer, or to be colorfully concise, a sparsifier. Reconstructive sparse coding. The reconstructive sparse coding objective is min E kx − D ϕD (x )k22 + ξ(ϕD (x )),

D ∈D x ∼PX

where PX is a probability measure on BRd . Generalization bounds for the empirical risk minimization (ERM) √ variant of this objective have been established: Maurer and Pontil (2008) showed an O (k / m) bound that is independent of the dimension d ; this is useful when d  k , as in general √ Hilbert √ spaces. Vainsencher et al. (2011) handled the overcomplete setting, producing a bound that is O ( dk / m) as well as fast rates of O (dk /m). Predictive sparse coding. As in Mairal et al. (2010), the predictive setup minimizes a supervised loss with respect to a representation and an estimator linear in the representation. Define a space of linear hypotheses W := r BRk , for r > 0. Given a probability measure P on BRd × Y, with Y ⊂ R in regression and Y = {−1, 1} in binary classification, the predictive sparse coding objective is min

E

D ∈D,w ∈W (x ,y )∼P

φ(y , hw , ϕD (x )i) + Θ (w ),

(2)

where Θ (·) is a regularizer often taken to be proportional to the squared `2 -norm. Other formulations exist, some of which wield a separate dictionary for each class (Mairal et al., 2008); we do not consider such formulations here for multiple reasons. First, they do not naturally extend to regression problems. Also, when the number of classes is large, both the computational and sample complexities of learning a dictionary for each class becomes prohibitive. Encoder stability. The choice of the sparsifier ξ seems to be pivotal both from an empirical perspective and a theoretical one. Bradley and Bagnell (2009) used a differentiable approximate sparsifier based on the Kullback-Leibler divergence. The sparsifier is approximate because true sparsity does not result, although it encourages low `1 norms; nevertheless, true sparsity is a necessity in certain applications and also aids in some theoretical arguments. The most popular sparsifier is ξ(·) = λk · k1 . Notably, k · k1 is the tightest convex lower bound for the `0 “norm”: |{i : xi 6= 0}|. From a stability perspective the `1 regularizer regrettably is not well-behaved in general. Indeed, from the lack of strict convexity it is evident that each x need not have a unique image under ϕD . It also is unclear how to analyze the class of mappings ϕD , parameterized by D , if the map changes drastically under small perturbations to D . Hence, we will begin by establishing sufficient conditions under which ϕD is stable under perturbations to D . In this work, we analyze the ERM variant of (2) with Θ (w ) = 1r kw k22 : min

D ∈D,w ∈W

m 1 X 1 φ(yi , hw , ϕD (xi )i) + kw k22 . m r

(3)

i =1

Since this objective is not convex, we will present bounds on the estimation error that hold uniformly over certain random subclasses of hypotheses. 3

3

Definitions, notation, and foundational results

Some definitions and notation will be useful. Let z be an iid random sample of m points, where each zi = (xi , yi ), xi ∈ BRd and yi ∈ Y. Let the space of targets, Y, be a bounded subset of R (regression) or {−1, 1} (binary classification). Let x00 be a second, unlabeled iid sample of m points x100 , . . . , xm00 ; this sample will be used only in the infinite-dimensional setting. Also, let z0 be an iid ghost sample of m points with the same distribution as z. A predictive sparse coding hypothesis function f is fully specified by a choice (D , w ) ∈ D × W, yielding f (x ) = hw , ϕD (x )i. We often identify f using the notation f = (D , w ). The function class F is the set of such hypotheses. When provided a training sample z, the hypothesis returned by the learner will be referred to as ˆf . Note that ˆf is random, but ˆf becomes a fixed function upon conditioning on z. Throughout this work, φ : Y × R → [0, b ] will be a bounded loss function (b > 0) that is L-Lipschitz in its second argument. For f ∈ F, define fφ : Y × Rd → R+ as the loss-composed function fφ (y , x ) 7→ φ(y , f (x )). Let φ ◦ F be the class of such functions induced by the choice of F and φ. We use the notation Pf =

E

f (x )

P fφ =

(x ,y )∼P

E

φ(y , f (x ))

(x ,y )∼P

for the expectation operator P and the notation Pz f =

m 1 X f (xi ) m

Pz fφ =

i =1

m 1 X φ(yi , f (xi )) m i =1

for the empirical measure Pz associated with sample z. Define lasso(λ, D , x ) as the optimization problem lasso(λ, D , x ) ≡ minα∈Rk kx − D αk22 + λkαk1 ; call the argmin ϕD (x ) = ((ϕD (x ))1 , . . . , (ϕD (x ))k )T (λ is fixed and hence withheld from the notation). Let R+ be the set of positive reals and [n] := {1, . . . , n}, for n ∈ N. For s ∈ [k ], any d ≥ s , and D ∈ D, define µs (D ) as the the minimum squared singular value, taken across all matrices formed by choosing s distinct columns of D . We often use the overloaded definition µs (f ) := µs (D ) where f = (D , w ). The index set A is the set of non-zero indices of ϕD (x ); with some abuse of notation A always is defined in terms of the most recently used dictionary D and data point x . Index sets induce coordinate projections in the following way: if z ∈ Rk , then zA = (zA1 , . . . , zA|A| ). For t ∈ Rk , define supp(t ) := {i ∈ [k ] : ti 6= 0}. Finally, an epsilon-cover refers to the concept of an ε-cover but not a specific cover. All epsilon-covers of spaces of dictionaries use the metric induced by the operator norm k · k2 .

3.1

Sparse coding stability

We begin with a fundamental forward stability result for the LASSO; we are not aware of any similar result in the literature.

˜ ∈ (BRd )k with µs (D ), µs (D ˜) ≥ µ Theorem 1 (Sparse Coding Stability). Let λ > 0, x ∈ BRd , and D , D for µ > 0. If there are τ > 0 and s ∈ [k ] such that:

and if

(i )

kϕD (x )k0 ≤ s ,

(ii )

|(ϕD (x ))j | > τ for all j ∈ A,

(5)

(iii )

|hDj , x − D ϕD (x )i| < λ − τ for all j ∈ / A;

(6)

˜ k2 ≤ ε ≤ kD − D

4

s +µ λ

(4)

τµ , √ + s +µ

(7)

supp(ϕD (x )) = supp(ϕD˜ (x )) √  s ε kϕD (x ) − ϕD˜ (x )k2 ≤ +1 . µ λ

then and

Condition (6) means that atoms not used in the coding ϕD (x ) cannot have too high absolute correlation with the residual x −D ϕD (x ). Note that the right-most term of (7) is the permissible radius of perturbation (PRP). In short, the theorem says that if lasso(λ, D , x ) admits a stable sparse solution, then a small perturbation to the dictionary will not change the support of the solution, and the perturbation to the solution will be bounded by a constant factor times the size of the perturbation (where the constant depends on a condition number to a restricted problem, the amount of `1 -regularization, and the sparsity level). Proof sketch: Our strategy is to show that there is a unique solution to the perturbed problem, defined in terms of the optimality conditions of the LASSO (see conditions L1 and L2 of (Asif and Romberg, 2010)), and this solution has the same support as the solution to the original problem. As a result, the perturbed solution’s proximity to the original solution is governed in part by a condition number µ of a linear system of s variables.

3.2

Symmetrization by ghost sample for random subclasses

The next result is essentially due to Mendelson and Philips (2004); it applies symmetrization by a ghost sample for random subclasses. Our main departure is that we allow the random subclass to depend on a second, unlabeled sample x00 . The lemma will be applied to the overcomplete setting in the simpler form without x00 and to the infinite-dimensional setting in its full form. Lemma 1 (Symmetrization by Ghost Sample). Let F(z, x00 ) ⊂ F be a random subclass which can depend on both an labeled sample z and an unlabeled sample x00 . 2 If m ≥ bt , then n to Prz x00 {∃f ∈ F(z, x00 ) P fφ − Pz fφ ≥ t } ≤ 2Prz z0 x00 ∃f ∈ F(z, x00 ) Pz’ fφ − Pz fφ ≥ . 2

4

Overcomplete setting

Classically, the overcomplete setting is the modus operandi in sparse coding. Since k ≥ d in this setting, an ideal learning bound has minimal dependence on k . We derive learning bounds with a square root dependence on k , below which the possibility of further improvement is open even in the reconstructive setting1 . At a high level, our strategy for the overcomplete case learning bound is to construct an epsilon-cover over a subclass of the space of functions F := {f = (D , w ) : D ∈ D, w ∈ W} and to show that the metric entropy of this subclass is of order dk . The main difficulty is that an epsilon-cover over D need not approximate F to any degree, unless one has a notion of encoder stability (which we describe in this section). Our analysis effectively will be concerned only with a training sample and a ghost sample; hence, similar to the style of the luckiness framework of Shawe-Taylor et al. (1998), should we observe that the sufficient conditions for encoder stability hold true on the training sample, it is enough to guarantee that most points in a ghost sample also satisfy these conditions (possibly at a weaker level).

4.1

Useful conditions and subclasses

We first establish two important conditions. Letting f = (D , w ) ∈ D × W, define     As (f , x) := max kϕD (xi )k0 ≤ s and Cτ (f , x) := max max ψj (D , xi ) < λ − τ , xi ∈x

1 The

xi ∈x j ∈[k ]

only works known to attack the reconstructive setting are Maurer and Pontil (2008) and Vainsencher et al. (2011).

5

where ψj (D , x ) := |hDj , x − D ϕD (x )i| − |(ϕD (x ))j |. The first condition is critical as the learning bound will exploit the sparsity level over the training sample; the second condition can be motivated as follows. Fix some x ∈ BRd and assume Cτ (f , x ) is true. If j ∈ A, the LASSO optimality condition L1 of Asif and Romberg (2010) implies that |hDj , x − D ϕD (x )i| = λ; since ψj (D , x ) is true, it follows that condition (5) holds. If j ∈ / A, then since (ϕD (x ))j = 0 the condition ψj (D , x ) is precisely the condition (6). Hence, the condition ψj (D , x ) < λ − τ is the key encoder-stability inequality which will need to be enforced for each coordinate. Since the form of ψj is independent of j , all coordinates can be treated identically independent of their membership in A. For notational compactness, define Υs ,τ (f , x) = As (f , x) ∧ Cτ (f , x) and   Υη,s ,τ (f , x) = @˜ x ⊂ x |˜ x| > η ∧ ∀x ∈ ˜ x As (f , x ) ∨ Cτ (f , x ) ,

¯ is the negation of a boolean expression E . where E Define the level α-coding margin τα (D , x) := max{τ0 > 0 : Cατ0 (D , x)} for α > 0 and the coding margin τ(D , x) := τ1 (D , x); clearly τα (D , x) = α−1 τ(D , x). For convenience we often make the first argument f rather than D , using the overloaded definition τ(f , x) := τ(D , x) where f = (D , w ). Our bounds will require a crucial PRP-based condition that depends on both the learned dictionary and the training sample:  s √ s 1 λ + + +1 . τ(D , x) ≥ ι(λ, µ, ε, s ) for ι(λ, µ, ε, s ) = 3ε µ λ For brevity we will refer to ι with its parameters implicit; the dependence on ε, s , λ, and µ will be a nonissue because we first develop bounds with all these quantities fixed a priori. Finally, we will use the subclasses Dµ := {D ∈ D : µs (D ) ≥ µ} and Fµ := {f = (D , w ) ∈ F : D ∈ Dµ }, where µ > 0 is fixed.

4.2

Learning bound

The following proposition is a specialization of Lemma 1 with x00 taken as the empty set and the random subclass defined as F(z, x00 ) := {f ∈ Fµ : Υs ,ι (f , x)}. 2 Proposition 1. If m ≥ bt , then Prz {∃f ∈ Fµ Υs ,ι (f , x) ∧ (P fφ − Pz fφ > t )} ≤ 2Prz z0 {∃f ∈ Fµ Υs ,ι (f , x) ∧ (Pz’ fφ − Pz fφ > t /2)} . {z } | event J

Define Z as the event that there exists a hypothesis with stable codings on the original sample but more than η(m, d , k , ε, δ) points with unstable codings on the ghost sample: n  o Z := z z0 : ∃f ∈ Fµ Υs ,ι (f , x) ∧ ∃˜ x ⊂ x0 |˜ x| > η(m, d , k , ε, δ) ∧ ∀x ∈ ˜ x Υs ,τ3 (f ,x) (f , x ) . We will show that Pr(J ) is small by use of the fact that

¯ ) + Pr(J ∩ Z ) ≤ Pr(J ∩ Z ¯ ) + Pr(Z ). Pr(J ) = Pr(J ∩ Z So far, our strategy is similar to the beginning of Shawe-Taylor et al.’s proof of the main luckiness framework ¯ ) are small learning bound (see Shawe-Taylor et al., 1998, Theorem 5.22). We show that Pr(Z ) and Pr(J ∩ Z in turn and then present the main learning bound. The next lemma shadows Shawe-Taylor et al. (1998)’s notion of probable smoothness and vanquishes the key difficulty in bounding Pr(Z ).

6

Lemma 2 (Good Ghost). Fix µ, λ > 0 and s ∈ [k ]. Let x ∼ Pm and x0 ∼ Pm be iid samples of m points each. Choose any D ∈ Dµ for which As (D , x). With probability at least 1 − δ at least m − η(m, d , k , ε, δ) points ˜ x ⊂ x0 satisfy As (D , ˜ x) and Cτ3 (D ,x) (D , ˜ x), for η(m, d , k , ε, δ) := 2dk log

2 96s + log(2m + 1) + 2 log . λµτ(D , x) δ

Proof sketch: The core idea of the proof is to show that if D satisfies encoder stability at level s and coding margin τ, then there exists a representative D 0 in an ε-cover of Dµ — composed of balls with radius linear in τ1 — which satisfies encoder stability at level s and a coding margin reduced by a constant factor. Next, a standard permutation argument (Vapnik and Chervonenkis, 1968, Proof of Theorem 2), along with a VC dimension argument to entertain all choices of the coding margin τ, guarantees with high probability that the representative D 0 also satisfies the same stability properties on most of the ghost sample. Finally, we once again jump from D 0 to D to guarantee that, on this good part of the ghost sample where D 0 satisfies s -sparsity and slightly reduced coding margin, D also satisfies s -sparsity and some further reduced coding margin.

¯ ). We use the shorthand η = η(m, d , k , ε, δ) for conciseness. It remains to bound Pr(J ∩ Z    √  Lemma 3 (Large Deviation on Good Ghost). Let $ := t /2− 2Lβ + bmη and β := ε µr ( λs + 1) + λ1 . Then   1/(d +1) (d +1)k ¯ ) ≤ 8(r /2) Pr(J ∩ Z exp(−m$2 /(2b 2 )). ε Equivalently, the difference between the loss on z and the loss on z0 is greater than $ + 2Lβ +  (d +1)k 1/(d+1) exp(−m$2 /(2b 2 )). probability at most 8(r /2)ε

bη m

with

¯ , the entire training sample and all but η points of the ghost Proof sketch: From the definition of J ∩ Z sample are “good” (i.e. have stable codings). Thus, there exists a representative f 0 in a product of ε-covers of D and W which closely approximates (with error at most 2Lβ, from the Sparse Coding Stability Theorem (Theorem 1)) the function evaluation of all good points and suffers error at most bmη on the bad points. Hence, for there to exist a large deviation t 0 between the training and ghost sample, there would need to be a deviation of at least t 0 − (2Lβ + bmη ) for at least one hypothesis in the product of ε-covers. The result follows from Hoeffding’s inequality and the union bound. A learning bound is within reach. Let ε = m1 ; then from some elementary manipulations: s √  1 λ+ s Theorem 2. If m > 3τ + 1 , then for all f ∈ F such that µs (f ) ≥ µ, As (f , x), and Cι (f , x), with µ probability at least 1 − δ over z: s 2((d + 1)k log(8m) + k log 2r + log δ4 ) (P − Pz )fφ ≤ 2b m  √     4L r s 1 2b 96s 8 + +1 + + 2dk log + log(2m + 1) + 2 log . m µ λ λ m λµτ(f , x) δ Now, we construct a bound for each choice of s and µ. To each choice of s ∈ [k ] assign prior probability To each choice of i ∈ N ∪ {0} for 2−i ≤ µ assign prior probability (i + 1)−2 . For a given choice of s ∈ [k ] Pk P∞ P∞ 1 π2 1 1 and 2−i ≤ µ we use δ(s , i ) := π62 (i +1) 2 k δ. Note that s =0 i =0 δ(s , i ) = δ since i =1 i 2 = 6 . 1 k.

Define ν(µ) := min{1, 2blog2 µc } for µ > 0. The following corollary is now proved.

7

Corollary 1 (Overcomplete  s √ Learning  Bound). For any s ∈ [k ], for all f ∈ F such that As (f , x), 3 1 λ+ s Cι (f , x), and m > τ(f ,x) ν(µs (f )) + λ + 1 , with probability at least 1 − δ over z:

(P − Pz )fφ ≤ 2b +

+

5

v  u u u 2 (d + 1)k log(8m) + k log t 4L m



r ν(µs (f )) 

r 2

+ log

2π2 (log2

2 ν(µs (f ))

2

)

k





m  s 1 +1 + λ λ

√



2b  96s + log(2m + 1) + 2 log 2dk log m λν(µs (f ))τ(f , x)

 4π2 log2

2 ν(µs (f ))



2  k .

Infinite-dimensional setting

Often in learning problems, we first map the data implicitly to a space of very high dimension or even infinite dimension and use kernels for efficient computations. In these cases where d  k or d is infinite, it is unacceptable for any learning bound to exhibit dependence on d . Unfortunately, the strategy of the previous section breaks down in the infinite-dimensional setting because the straightforward construction of any epsilon-cover over the space of dictionaries had cardinality that depends on d . Even worse, epsilon-covers actually were used both to approximate the function class F in k · k∞ norm and to guarantee that most points of the ghost sample are good provided that all points of the training sample were good (the Good Ghost Lemma (Lemma 2)). These issues can be overcome by requiring an additional, unlabeled sample — a device often justified in supervised learning problems because unlabeled data may be inexpensive and yet quite helpeful — and by switching to more sophisticated techniques based on conditional Rademacher and Gaussian averages. After learning a hypothesis ˆf from a predictive sparse coding algorithm, the sparsity level and coding margin are measured on a second, unlabeled sample x00 of m points2 . Since this sample is independent of the choice of 2 ˆf , it is possible to guarantee that all but a very small fraction ( η = 2 log δ ) of points of a ghost sample z are m m good with probability 1 − δ. In the likely case of this good event, and for a fixed sample, we then consider all  possible choices of a set of η bad indices in the ghost sample; each of the m cases corresponds to a subclass η of functions. We then approximate each subclass by a special ε-cover that is a disjoint union of a finite number of special subclasses; for each of these smaller subclasses, we bound the conditional Rademacher average by exploiting a sparsity property.

5.1

Symmetrization and decomposition

Let µ∗ ∈ R2+ and define Fµ∗ = {f ∈ F : (µs (f ) ≥ µ∗s ) ∧ (µ2s (f ) ≥ µ∗2s )}. The next result is immediate from Lemma 1 where F(z, x00 ) := {ˆf } ∩ {f ∈ Fµ : Υs ,τ (f , x) ∧ Υs ,τ (f , x00 )}. 2 Proposition 2. If m ≥ bt , then n o Prz x00 ˆf ∈ Fµ∗ Υs ,τ (ˆf , x) ∧ Υs ,τ (ˆf , x00 ) ∧ P ˆfφ − Pz ˆfφ ≥ t (8) n  o t ≤ 2Prz z0 x00 ˆf ∈ Fµ∗ Υs ,τ (ˆf , x) ∧ Υs ,τ (ˆf , x00 ) ∧ Pz’ ˆfφ − Pz ˆfφ ≥ . 2 2 The

cardinality matches the size of the training sample z purely for simplicity.

8

Now, observe that the probability of interest can be split into the probability of a large pening under a “good” event and the probability of a “bad” event occurring: n  t o Prz z0 x00 ˆf ∈ Fµ∗ Υs ,τ (ˆf , x) ∧ Υs ,τ (ˆf , x00 ) ∧ Pz’ ˆfφ − Pz ˆfφ ≥ 2 n  00 0 = Prz z0 x00 ˆf ∈ Fµ∗ Υs ,τ (ˆf , x) ∧ Υs ,τ (ˆf , x ) ∧ Υη,s ,τ (ˆf , x ) ∧ Pz’ ˆfφ − Pz ˆfφ ≥ n  + Prz z0 x00 ˆf ∈ Fµ∗ Υs ,τ (ˆf , x) ∧ Υs ,τ (ˆf , x00 ) ∧ Υη,s ,τ (ˆf , x0 ) ∧ Pz’ ˆfφ − Pz ˆfφ ≥ n  t o ≤ Prz z0 ∃f ∈ Fµ∗ Υs ,τ (f , x) ∧ Υη,s ,τ (f , x0 ) ∧ Pz’ fφ − Pz fφ ≥ 2 n o 00 + Prx0 x00 ˆf ∈ Fµ∗ Υs ,τ (ˆf , x ) ∧ Υη,s ,τ (ˆf , x0 ) .

deviation hap-

t o 2 t o 2

We treat the first probability in the next subsection. To bound the second probability, note that for each choice of x, ˆf is a fixed function. Hence, it is sufficient to select η such that, for any fixed function f ∈ F: n o Prx0 x00 Υs ,τ (f , x00 ) ∧ Υη,s ,τ (f , x0 ) ≤ δ. Lemma 4 (Unlikely Bad Ghost). Let f ∈ F be fixed. If η = 2 log δ2 , then o n Prx0 x00 Υs ,τ (f , x00 ) ∧ Υη,s ,τ (f , x0 ) ≤ δ. Proof sketch: The proof just uses the same standard permutation argument from the proof of the Good Ghost Lemma (Lemma 2).

5.2

Rademacher bound in the case of the good event

We now bound the probability of a large deviation in the (likely) case of the good event. For readability, let Fµ∗ ,Υ (x) be shorthand for {f ∈ Fµ∗ : Υs ,τ (f , x)}. Likewise, let Fµ∗ ,Υη (x) be shorthand for {f ∈ Fµ∗ : Υη,s ,τ (f , x)}. Let σ1 , . . . , σm be independent Rademacher random variables distributed uniformly on {−1, 1}. Lemma 5 (Symmetrization by Random Signs). n  t o Prz z0 ∃f ∈ Fµ∗ Υs ,τ (f , x) ∧ Υη,s ,τ (f , x0 ) ∧ Pz’ fφ − Pz fφ ≥ 2 ( ) ) ( m m X 1 t 1 X t ≤ Prz,σ sup σi φ(yi , f (xi )) ≥ + Prz,σ sup σi φ(yi , f (xi )) ≥ . 4 4 f ∈Fµ∗ ,Υ (x) m f ∈Fµ∗ ,Υη (x) m i =1

i =1

Proof sketch: The proof uses a standard application of symmetrization by random signs. Note that Fµ∗ ,Υ (x) is just Fµ∗ ,Υ0 (x), and so we need only bound the second term of the last line above for arbitrary η ∈ [m]. Before bounding this quantity, which for fixed x is a conditional Rademacher average (see Appendix D.1 for the fundamentals of Rademacher and Gaussian averages), we need to establish a few results on the Gaussian average of a related function class. First, note that for any D ∈ D, D can be decomposed into a product of two matrices as D = US , where all k U ∈ U ⊂ Rd ×k satisfy the isometry property U T U = I and S ∈ S := (BRk ) (Maurer and Pontil, 2008). Fix S , a linear hypothesis w ∈ W, and a sample x of m points. The subclass of interest will be those functions corresponding to U ∈ U such that kϕUS (x )k0 ≤ s . It turns out that the Gaussian average of this subclass is well-behaved.

9

Theorem 3 (Gaussian Average for Fixed S and w ). Let S ∈ S, s ∈ [k ], and x be a fixed m-sample. Define a particular subclass of U as Ux := {U ∈ U : As (US , x)}. Then √ m 4r k 2s 2 X √ . γi hw , ϕUS (xi )i ≤ Eγ sup (9) µ2s (S ) m U ∈Ux m i =1 The proof of this result uses the following lemma that shows how the difference between the feature maps ϕUS and ϕU 0 S can be characterized by the difference between U and U 0 . Define the s -restricted 2-norm of S as kS k2,s := sup{t ∈Rn :kt k=1,| supp(t )|≤s } kSt k2 . Lemma 6 (Difference Bound For Isometries). Let x ∈ BRd . If kϕUS (x )k0 ≤ s and kϕU 0 S (x )k0 ≤ s , then 2kS k2,2s T kϕUS (x ) − ϕU 0 S (x )k ≤ k(U 0 − U T )x k2 . µ2s (S ) Proof sketch: The proof uses a perturbation analysis of solutions to linearly constrained positive definite quadratic programs (Daniel, 1973), exploiting the sparsity of the optimal solutions to have dependence only on kS k2,2s and µ2s (S ) rather than kS k2 and µk (S ). Proof sketch (of Theorem 3): The proof controls the Gaussian average of the function class Ux , which is nonlinear in U , by the Gaussian average of a second function class that is linear in U . Lemma 6 is critical in establishing a link between the variances of the Gaussian process induced by each function class, after which Slepian’s Lemma (Lemma 9) relates the corresponding Gaussian averages. By exploiting the isometry property of Ux , the Gaussian average of the second function class can admit a dimension-free bound. The palette for the pivotal art of this section is established. Theorem 4 (Rademacher Average of Mostly Good Random Subclasses). )   ( (k +1)k m 8(r /2)1/(k +1) 1 X m t σi φ(yi , f (xi )) ≥ ≤ Prz,σ sup exp(−mt32 /(2b 2 )), η 4 ε f ∈Fµ∗ ,Υη (x) m i =1

for

t3 :=

t − Lε 4



√  √ √ 2b η r s 1 2L πr k s √ − ( + 1) + − . µ∗s λ λ m µ∗2s m

 Proof sketch: We decompose the space of dictionaries D via U and S and split this function class into N = m η subclasses, each of which has the property that all functions in the class have a common set of m − η good indices. We then approximate each subclass by a product of ε-covers of W and (a suitably incoherent subset of) S. The error from this approximation is low due to the Sparse Coding Stability Theorem (Theorem 1). Fixing a subclass and a representative from each of the two ε-covers induces a smaller subclass, and there is a index set of at least m − η points that are sparsely coded for all encoders in this smaller subclass. Theorem 3 then essentially implies a bound on the Rademacher complexity of each smaller subclass. The union bound over [N ] and the two ε-covers finishes the result. For the case of η = 0, define

t2 :=

t − Lε 4



√  √ √ r s 1 2L πr k s √ ( + 1) + − . µ∗s λ λ µ∗2s m

Since FΥ ,x is equivalent to FΥ0 ,x , Lemma 5 and Theorem 4 imply that n  t o Prz z0 ∃f ∈ Fµ∗ Υs ,τ (f , x) ∧ Υη,s ,τ (f , x0 ) ∧ Pz’ fφ − Pz fφ ≥ 2      1/(k +1) (k +1)k 1/(k +1) (k +1)k 8(r /2) m 8(r /2) 2 2 ≤ exp(−mt2 /(2b )) + exp(−mt32 /(2b 2 )) ε η ε   (k +1)k m 8(r /2)1/(k +1) ≤2 exp(−mt32 /(2b 2 )). η ε 10

Finally, the full probability (8) can be upper bounded (using η = 2 log δ2 ) as: n o Prz x00 ˆf ∈ Fµ∗ Υs ,τ (ˆf , x) ∧ Υs ,τ (ˆf , x00 ) ∧ P ˆfφ − Pz ˆfφ ≥ t   (k +1)k 8(r /2)1/(k +1) m exp(−mt32 /(2b 2 )) + 2δ. ≤4 2 log δ2 ε After some elementary and choosing ε =  s √ manipulations  3 1 0 λ+ s ι (λ, µ, m, s ) := m + λ +1 . µ

1 m,

we nearly have the final learning bound. Let

Theorem 5. Let µ∗s , µ∗2s > 0, s ∈ [k ], and τ ≥ ι0 (λ, µ∗s , m, s ). Suppose an algorithm is trained on a labeled sample z of m points and learns hypothesis ˆf . Let x00 be a second, unlabeled sample of m points. Suppose that µ2s (ˆf ) ≥ µ∗2s , µs (ˆf ) ≥ µ∗s , As (ˆf , x), As (ˆf , x00 ), Cτ (ˆf , x), and Cτ (ˆf , x00 ) all hold. Then with probability at least 1 − δ over z and x00 : s  √ √ r 16 2 ( k + 1) k log(8 m ) + k log + (2 log m + 3) log π r k s 2 L 2 δ √ +b (P − Pz )ˆfφ ≤ m µ∗2s m  √   L r s 1 4b 16 + +1 + + log . m µ∗s λ λ m δ A bound adaptive to the sparsity level and coding margin on z and x00 is now immediate. Recall that ν(µ) = min{1, 2blog2 µc } for µ > 0. Corollary 2 (Infinite-Dimensional Learning Bound). Suppose an algorithm is trained on a labeled sample z of m points and learns hypothesis ˆf . Let x00 be a second, unlabeled sample of m points. Choose n o n o s = max s 0 ∈ [k ] : As 0 (ˆf , x) ∧ As 0 (ˆf , x00 ) and τ = min τ0 ∈ R+ : Cτ0 (ˆf , x) ∧ Cτ0 (ˆf , x00 ) . b s := ν(µs (ˆf ) and µ b 2s := ν(µ2s (ˆf ). If τ ≥ ι0 (λ, µs (ˆf ), m, s ) and µ2s (ˆf ) > 0, then with probability at least Let µ 00 1 − δ over z and x : v   u u 2 (k 2 + k ) log(8m) + k log r + (2 log m + 3) log 44(log2 (bµs ) log2 (bµ2s ))2 k √ √ t 2 δ 2L πr k s √ (P − Pz )ˆfφ ≤ +b b 2s m m µ   √  2 L s 1 44 (log2 (b µs ) log2 (b µ2s )) k r 4b + +1 + log . + bs m µ λ λ m δ

6

Discussion

We have shown the first generalization error bounds for predictive sparse coding. The results highlight the central role of the stability of the sparse encoder. The presented bounds are data-dependent and exploit properties relating to the training sample and the learned hypothesis. A learning bound appropriate for the overcomplete setting exhibits square root dependence on both the ambient dimension d and the size of the dictionary k . In the infinite-dimensional setting, a dimension-free learning bound has linear dependence on k , square root dependence on s , and inverse dependence on the 2s -incoherence. Maurer and Pontil (2008) previously showed the following generalization bound for unsupervised (`1 regularized) sparse coding: ) (   r 20 1 p log(1/δ) k 2 + log (16m/λ ) + ≤ δ. Prx sup P fD − Px fD ≥ √ 2 2m m λ D ∈D

11

where fD (x ) := minz ∈Rk kx −Dz k22 +λkz k1 . Comparing their result to Corollary 2 and√neglecting regularization parameters, the dimension-free bound in the predictive case is larger by a factor of s . It is unclear√whether √ the s term is avoidable in the predictive setting. At least from our analysis, it appears that the s factor is the price of stability. The data-dependent stability conditions under which our bounds hold may seem restrictive. In the case where the coding margin over the training sample is not uniformly large, or the sparsity level over the training sample is not uniformly small, bounds based on unlikely large deviations from the expectations of these quantities can be established. We did not pursue such bounds here because in predictive sparse coding all training samples empirically do have a low sparsity level and high s -incoherence. Also, the analysis becomes considerably more involved but we anticipate the learning bounds to be of similar order. If one entertains a mixture of `1 and `2 norm regularization as in the elastic net (Zou and Hastie, 2005), fall-back guarantees are possible in both scenarios. A considerably simpler, data-independent analysis is possible in the overcomplete setting with a final bound that essentially just trades µs (D ) for the `2 -norm regularization parameter λ2 . In the infinite-dimensional setting, a simpler non-data-dependent analysis 3/2 using our approach would only attain a bound of the larger order λk2 √m . In conclusion, the presented datadependent bounds provide motivation for an algorithm to prefer dictionaries for which small subdictionaries are well-conditioned and to additionally encourage large coding margin on the training sample.

References M. Salman Asif and Justin Romberg. On the LASSO and Dantzig selector equivalence. In Information Sciences and Systems (CISS), 2010 44th Annual Conference on, pages 1–6. IEEE, 2010. David M. Bradley and J. Andrew Bagnell. Differentiable sparse coding. Advances in Neural Information Processing Systems, 21:113–120, 2009. Felipe Cucker and Steve Smale. On the mathematical foundations of learning. Bulletin of the American Mathematical Society, 39:1–49, 2002. James W. Daniel. Stability of the solution of definite quadratic programs. Mathematical Programming, 5 (1):41–53, 1973. ISSN 0025-5610. R. Herbrich and R.C. Williamson. Algorithmic luckiness. Journal of Machine Learning Research, 3:175–212, 2002. Sham M. Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, pages 793–800. MIT Press, 2009. Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and processes. Springer, 1991. ISBN 3540520139. Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro, and Andrew Zisserman. Discriminative learned dictionaries for local image analysis. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE, 2008. Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro, and Andrew Zisserman. Supervised dictionary learning. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, pages 1033–1040. MIT Press, 2009. Julien Mairal, Francis Bach, and Jean Ponce. Task-driven dictionary learning. Pattern Analysis and Machine Intelligence, IEEE Transactions on, (99):1–1, 2010. Andreas Maurer and Massimiliano Pontil. Generalization bounds for K-dimensional coding schemes in Hilbert spaces. In Algorithmic Learning Theory, pages 79–91. Springer, 2008. 12

Colin McDiarmid. On the method of bounded differences. Surveys in combinatorics, 141(1):148–188, 1989. Ron Meir and Tong Zhang. Generalization error bounds for bayesian mixture algorithms. Journal of Machine Learning Research, 4:839–860, 2003. Shahar Mendelson and Petra Philips. On the importance of small coordinate projections. Journal of Machine Learning Research, 5:219–238, 2004. Bernhard Sch¨ olkopf and Alexander J. Smola. Learning with kernels. MIT press Cambridge, Mass, 2002. John Shawe-Taylor, Peter L., Robert C. Williamson, and Martin Anthony. Structural risk minimization over data-dependent hierarchies. Information Theory, IEEE Transactions on, 44(5):1926–1940, 1998. David Slepian. The one-sided barrier problem for Gaussian noise. Bell System Technical Journal, 41(2): 463–501, 1962. Ingo Steinwart and Andreas Christmann. Support vector machines. Springer, 2008. Daniel Vainsencher, Shie Mannor, and Alfred M. Bruckstein. The sample complexity of dictionary learning. In Conference on Learning Theory, 2011. Vladimir Vapnik and Alexey Chervonenkis. Uniform convergence of frequencies of occurence of events to their probabilities. In Dokl. Akad. Nauk SSSR, volume 181, pages 915–918, 1968. M. Vidyasagar. Learning and Generalization with Applications to Neural Networks. Springer, London, 2002. Kai Yu, Tong Zhang, and Yihong Gong. Nonlinear learning using local coordinate coding. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 2223–2231. MIT Press, 2009. Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005.

A

Proof of Sparse Coding Stability Theorem

Proof (of Theorem 1): Let α be equal to (ϕD (x ))A , and define the scaled sign vector ζ := λ sign(α). Our strategy will be to show that, for some ∆ ∈ Rs , the optimal perturbed solution ϕD˜ (x ) satisfies (ϕD˜ (x ))A = α + ∆ and (ϕD˜ (x ))Ac = 0, where Ac := [k ] \ A. From the optimality conditions for the LASSO (e.g. see optimality conditions L1 and L2 of Asif and Romberg (2010)), it is sufficient to find ∆ such that

˜j, x − D ˜ A (α + ∆)i = ζj if j ∈ A, hD ˜ ˜ A (α + ∆)i < λ otherwise. hDj , x − D We proceed by setting up the linear system and characterizing the solution vector ∆: T ˜A ˜ A (α + ∆)) = ζ D (x − D

Solve for ∆

−−−−−−−−→

T˜ T ˜A ˜A ˜ A α) − ζ). ∆ = (D DA )−1 (D (x − D

˜ = D + E for kE k2 ≤ ε, Since D T ˜A ˜ A α) = (DA + EA )T (x − (DA + EA )α) D (x − D T T T = DA (x − DA α) − DA EA α + EA (x − (DA + EA )α) T T = ζ − DA EA α + EA (x − (DA + EA )α),

13

and so the solution for ∆ can be reformulated as T˜ T T ˜A ∆ = (D DA )−1 (−DA EA α + EA (x − (DA + EA )α)).

Now, T˜ T T ˜A k∆k2 ≤ kD DA )−1 k2 (kDA EA αk2 + kEA (x − (DA + EA )α)k2 ) √ 1 ε s ≤ ( + ε) µ λ √  s ε = +1 . µ λ

For y ∈ Rs , let yk ×1 be the extension to Rk satisfying (yk ×1 )A = y and (yk ×1 )Ac = 0. For (α + ∆)k ×1 to ˜ , x ), (α + ∆)k ×1 must satisfy the two optimality conditions and ∆ must be small be optimal for lasso(λ, D enough such that sign consistency holds between α and (α + ∆) (i.e. sign(αj ) = sign(αj + ∆j ) for all j ∈ [s ]). We first check the optimality conditions. The first optimality condition is equivalent to

˜j, x − D ˜ A (α + ∆)i = λ hD

for j ∈ A;

this condition is satisfied by construction. The second optimality condition is equivalent to ˜ ˜ A (α + ∆)i < λ for j ∈ / A. hDj , x − D But for j ∈ / A, ˜ ˜ A (α + ∆)i = |hDj + Ej , x − (DA + EA )(α + ∆)i| hDj , x − D = |hDj , x − DA αi − hDj , DA ∆i + hEj , x − (DA + EA )(α + ∆)i − hDj , EA (α + ∆)i|  √ √ s ε ε s +1 +ε+ ε

1 µ

√

s +1 λ

 .

All the above constraints are satisfied if τ satisfies  s √ s 1 λ + ε + + 1 ≤ τ. µ λ

14

B

Proof of Symmetrization by Ghost Sample Lemma

Proof (of Lemma 1): Replace F(σn ) from the notation of Mendelson and Philips (2004) with F(z, x00 ). A modified one-sided version of (Mendelson and Philips, 2004, Lemma 2.2) that uses the more favorable Chebyshev-Cantelli inequality implies that, for every t > 0:   4 supf ∈F Var(φ ◦ f ) Prz x00 {∃f ∈ F(z, x00 ) P fφ − Pz fφ ≥ t } 1− 4 supf ∈F Var(φ ◦ f ) + mt 2 n to ≤Prz z0 x00 ∃f ∈ F(z, x00 ) Pz’ fφ − Pz fφ ≥ . 2 Furthermore, because we assume a bounded loss function with losses trapped in [0, b ], it follows that 2 supf ∈F Var(φ ◦ f ) ≤ b4 . The lemma follows since the left hand term of the LHS of the above inequality 2 is at least 21 whenever m ≥ bt .

C

Proofs for overcomplete setting

Proof (of the Good Ghost Lemma (Lemma 2)): By the assumptions of the lemma, we consider D satisfying µs (D ) ≥ µ and As (D , x). We want to guarantee with high probability that for D all but η points of the ghost sample (1) are coded at sparsity level s , (2) have encodings whose non-zero-valued coordinates have absolute value greater than τ3 (D , x), and (3) have encodings whose zero-valued coordinates have absolute correlation of the corresponding atom with the residual (i.e. |hDi , xi − D ϕD (xi )i|) less than λ − τ3 (D , x). The latter two inequality conditions are the raisons d’ˆetre of ψ. Consider the class of threshold functions FDstab := {fDstab ,τ |τ ∈ R+ } defined as (

fDstab ,τ (x )

:=

1; if maxj ∈[k ] ψj (D , X ) < λ − τ, 0; otherwise.

Since the VC dimension of the one-dimensional threshold functions is 1, it follows that the VC(FDstab ) = 1.   τ3 (D ,x)µ √ Consider a minimum-cardinality s+µ - proper cover D0 of Dµ . Let D 0 be a candidate ele+ s +µ ment of D0 satisfying kD − D 0 k2 ≤ 0

0

λ τ3 (D ,x)µ s+µ √ λ + s +µ

. Then the Sparse Coding Stability Theorem (Theorem 1)

implies As (D , x) and Cτ3/2 (D ,x) (D , x). First, we ensure that most points from the ghost sample satisfy maxj ∈[k ] ψj (D 0 , ·) < λ − τ3/2 (D , x). By using the VC dimension of FDstab and the standard permutation argument of (Vapnik and Chervonenkis, 1968, Proof of Theorem 2), it follows that for a single, fixed element of D0 , with probability at least 1 − δ at most log(2m + 1) + log δ1 points from a ghost sample will violate the inequality. Hence, by the bound on the proper covering numbers provided by Proposition 3 (see Appendix E), we can we can guarantee for all candidate members D 0 ∈ D0 that with probability 1 − δ2 at most ! √ 8 s +µ s +µ 2 λ + dk log + log(2m + 1) + log τ3 (D , x)µ δ ≤ dk log

96s 2 + log(2m + 1) + log λµτ(D , x) δ

points from the ghost sample violate the inequality (in the above, we use the generally loose bound µ ≤ s and assume without loss of generality3 that λ ≤ 1. 3 If

λ > 1, then all sparse codes will equal the zero vector.

15

It also is necessary to guarantee with high probability that most points of the ghost sample are coded sparsely. Since s is fixed, for a single element of D0 we need only bound the probability that a single function ( 1; if kϕD (x )k0 ≤ s spar fD ,s (x ) := 0; otherwise takes the value 1 for all points in the training sample but takes the value of 0 for some number of points in the ghost sample. Indeed a standard permutation argument shows that log δ2 or more points of the ghost sample will violate the sparsity condition with probability at most δ2 . Thus, for arbitrary D 0 ∈ D0 satisfying the conditions of the lemma, with probability 1 − δ at most η(m, d , k , ε, δ) := 2dk log

2 96s + log(2m + 1) + 2 log λµτ(D , x) δ

points from the ghost sample violate As (D 0 , x0 ) or Cτ3/2 (D ,x) (D 0 , x0 ). Now, consider the at least m − η(m, d , k , ε, δ) points in the ghost sample that satisfy both As (D 0 , ·) and τ3 (D ,x)µ √ , the Sparse Coding Stability Theorem (Theorem 1) implies Cτ3/2 (f ,x) (D 0 , ·). Since kD 0 − D k2 ≤ s+µ + s +µ λ

that these points satisfy As (D , ·) and Cτ3 (D ,x) (D , ·).

¯ is a subset of Proof (of Lemma 3): First, note that J ∩ Z  ∃f ∈ Fµ Υs ,ι(f , x)    0 R := zz : ∧ @˜ x ⊂ x0 , |˜ x| > η, ∀x ∈ ˜ x Υs ,τ3 (f ,x) (f , x )   ∧ (Pz’ fφ − Pz fφ > t /2)

  

.

 

To see this, let bad ghost be shorthand for the condition of their being more than η points in the ghost sample that violate Υs ,τ3 (f ,x) (f , ·) (i.e., that are “bad”), let good ghost be true if and only if bad ghost is false, ¯ . Hence, ∃f ∈ Fµ and let P be shorthand for the condition Pz’ fφ − Pz fφ > t /2. Suppose that zz0 ∈ J ∩ Z such that Υs ,ι (f , x) and P , and furthermore, @f ∈ Fµ such that Υs ,ι (f , x) and bad ghost. Consider all f ∈ Fµ such that Υs ,ι (f , x) and P . Suppose any one of these satisfies bad ghost. Then ∃f ∈ Fµ such that Υs ,ι (f , x) ¯ . Hence, if zz0 ∈ J ∩ Z ¯ then ∃f ∈ Fµ such that Υs ,ι (f , x), P , and bad ghost, which implies that zz0 ∈ / J ∩Z and good ghost. Now, choose a f = (D , w ) ∈ Fµ satisfying Υs ,ι (f , x) and good ghost. Thus, if kD − D 0 k2 ≤ ε and kw − w 0 k2 ≤ ε, then for all but η of the points in the ghost sample (and for all points of the original sample) we are guaranteed that |hw , ϕD (xi )i − hw 0 , ϕD 0 (xi )i| ≤ |hw − w 0 , ϕD (xi )i| + |hw 0 , ϕD (xi ) − ϕD 0 (xi )i| √ ε ε s ≤ +r ( + 1) λ µ λ  √  r s 1 =ε ( + 1) + =: β. µ λ λ

(Sparse Coding Stability Theorem (Theorem 1))

For the rest of the ghost sample, there is a coarse guarantee that kϕD (xi ) − ϕD 0 (xi )k2 ≤ original sample m 1 X |φ(yi , hw , ϕD (xi )i) − φ(yi , hw 0 , ϕD 0 (xi )i)| ≤ Lβ, m i =1

16

2 λ.

Hence, on the

where the loss φ is L-Lipschitz. On the ghost sample m 1 X φ(yi0 , hw , ϕD (xi0 )i) − φ(yi0 , hw 0 , ϕD 0 (xi0 )i) m i =1   1 X L X φ(yi0 , hw , ϕD (xi0 )i) − φ(yi0 , hw 0 , ϕD 0 (xi0 )i) |hw , ϕD (xi )i − hw 0 , ϕD 0 (xi )i| + ≤ m m i good i bad   η L(ε + 2r ) ≤ Lβ + min ,b m λ bη ≤ Lβ + . m

The subclass of interest is n  o ˜ x0 ) := f ∈ Fµ : Υs ,ι (f , x) ∧ @˜x ⊂ x0 , |˜ F(x, x| > η, ∀x ∈ ˜ x Υs ,τ3 (f ,x) (f , x ) .

˜ x0 ), then all ˜ x0 ) defined such that if f = (D , w ) ∈ F(x, It is sufficient to consider the ε-neighborhood of F(x, 0 0 0 0 0 f = (D , w ) satisfying kD − D k2 ≤ ε and kw − w k2 ≤ ε are in the ε-neighborhood. Let Fε = Dε × Wε , where Dε is a minimum-cardinality proper ε-cover of Dµ and W is a minimumcardinality ε-cover of W. Note that the ε-neighborhood will contain at least one element of Fε . Hence, it is sufficient to prove the large deviation bound for all of Fε and to then consider the maximum difference ˜ (x, x0 ) and its closest representative in Fε (which of course is O (ε)). between an element of F Now, for each f = (D , w ) ∈ Fµ satisfying Υs ,ι (f , x) and good ghost, there is a D 0 ∈ Dε such that kD − D 0 k2 ≤ ε and a w 0 ∈ Wε satisfying kw − w 0 k2 ≤ ε; hence, the difference between the losses of f and f 0 on the double sample will be at most 2Lβ + bmη . Let ν be the absolute deviation between the loss of f on the original sample versus its loss on the ghost sample. Then the absolute deviation between the loss of f 0 = (D 0 , w 0 ) on the original sample and the loss of f 0 on the ghost sample is at least   bη ν − 2Lβ + . m Hence, if ν > t /2, then the absolute deviation between the loss of f 0 on the original sample and the loss of f 0 on the ghost sample must be at least   bη t /2 − 2Lβ + . m Let fD 0 ,w 0 be the hypothesis corresponding to (D 0 , w 0 ). It therefore is sufficient to bound the probability that    bη 0 0 Prz z0 ∃f = (D , w ) ∈ Dε × Wε Pz’ fφ − Pz fφ > t /2 − 2Lβ + , m since R implies the above event.  Now, let $ := t /2 − 2Lβ + bmη . We first handle the case of a fixed f = (D 0 , w 0 ) ∈ Dε × Wε . Just apply Hoeffding’s inequality for the random variable φ(y , f (xi )) − φ(y , f (xi0 )) (range in [−b , b ]) to yield Prz z0 {Pz’ fφ − Pz fφ > $} ≤ exp(−m$2 /(2b 2 )). Apply Proposition 4 to extend this bound over all of Dε × Wε via the union bound, yielding Prz z0 {∃f = (D 0 , w 0 ) ∈ Dε × Wε Pz’ fφ − Pz fφ > $}  (d +1)k 8(r /2)1/(d +1) ≤ exp(−m$2 /(2b 2 )). ε 17

Thus,

¯) ≤ Pr(J ∩ Z



8(r /2)1/(d +1) ε

(d +1)k

exp(−m$2 /(2b 2 )).

Proof (of Theorem 2): Proposition 1 and Lemmas 2 and 3 imply that Prz {∃f ∈ Fµ Υs ,ι (f , x) ∧ (P fφ − Pz fφ > t )} ! (d +1)k  8(r /2)1/(d +1) 2 2 ≤2 exp(−m$ /(2b )) + δ ε Equivalently,     b η(m, d , k , ε, δ) Prz ∃f ∈ Fµ Υs ,ι (f , x) ∧ P fφ − Pz fφ > 2 $ + 2Lβ + m !   1/(d +1) (d +1)k 8(r /2) ≤2 exp(−m$2 /(2b 2 )) + δ ε Now, expand β and η and replace δ with δ/4: ( Prz ∃f ∈ Fµ Υs ,ι (f , x) ∧ P fφ − Pz fφ !)  √ 96s 8 b (2dk log λµτ( r s 1 f ,x) + log(2m + 1) + 2 log δ ) >2 $ + 2Lε ( + 1) + + µ λ λ m (d +1)k  δ 8(r /2)1/(d +1) exp(−m$2 /(2b 2 )) + . ≤2 ε 2   ( d +1) k 1/(d+1) = 8(r /2)ε exp(−m$2 /(2b 2 )), yielding 

Choose (

δ 4

Prz ∃f ∈ Fµ Υs ,ι (f , x)   √  r s 1 ∧ P fφ − Pz fφ > 2 $ + 2Lε ( + 1) + µ λ λ

b (2dk log +2  ≤4·

+ log(2m + 1) + 2(d + 1)k log

ε 8(r /2)1/(d +1)

+

m $2 b2

+ 2 log 2)

)

m

 1/(d +1) (d +1)k

8(r /2) ε

96s λµτ(f ,x)

exp(−m$2 /(2b 2 )).

which is equivalent to ( Prz ∃f ∈ Fµ Υs ,ι (f , x)   √  r s 1 ∧ P fφ − Pz fφ > 2 $ + 2Lε ( + 1) + µ λ λ

b (2dk log +2  ≤4·

 1/(d +1) (d +1)k

8(r /2) ε

96s λµτ(f ,x)

− 2(d + 1)k log

8 ε

+ 2k log

m exp(−m$2 /(2b 2 )).

18

2 r

+ log(2m + 1) +

m $2 b2

+ 2 log 2)

)

Let δ (a new variable, not related to the previous incarnation of δ) be equal to the upper bound, and solve for $, yielding: s 2((d + 1)k log ε8 + k log 2r + log δ4 ) $=b m and hence ( Prz ∃f ∈ Fµ Υs ,ι (f , x) s

+ k log 2r + log δ4 ) m  √  b (2dk log r s 1 + 2 2Lε ( + 1) + + µ λ λ

∧ P fφ − Pz fφ > 2b

2((d + 1)k log

8 ε

96s λµτ(f ,x)

+ log(2m + 1) + 2 log δ8 )

!)

m

≤ δ. If we set ε =

1 m,

then provided that m >

1 3τ



√ s λ+ s µ

 +1 :

( Prz ∃f ∈ Fµ Υs ,ι (f , x) s

2((d + 1)k log(8m) + k log 2r + log δ4 ) m  √   ) 1 2b 96s 8 4L r s ( + 1) + + 2dk log + log(2m + 1) + 2 log + m µ λ λ m λµτ(f , x) δ

∧ P fφ − Pz fφ > 2b

≤ δ.

D D.1

Infinite-dimensional setting Rademacher and Gaussian averages and related results

Let σ1 , . . . , σm be independent Rademacher random variables distributed uniformly on {−1, 1}, and let γ1 , . . . γm be independent Gaussian random variables distributed as N (0, 1). Given a sample of m points x, define the conditional Rademacher and Gaussian averages of a function class as m

Rm|x (F) =

X 2 E sup σ i f ( xi ) m σ f ∈F

m

and

Gm|x (F) =

i =1

X 2 E sup γi f (xi ). m γ f ∈F i =1

respectively. From Meir and Zhang (2003, Theorem 7), the loss-composed conditional Rademacher average of a function class F is bounded by the scaled conditional Rademacher average: Lemma 7 (Rademacher Loss Comparison Lemma). For every function class F, sample of m points x, and φ which is L-Lipschitz continuous in its second argument: Rm|z (φ ◦ F) ≤ LRm|x (F).

19

Additionally, from Ledoux and Talagrand (1991, a brief argument following Lemma 4.5), the conditional Rademacher average of a function class F is bounded up to a constant by the conditional Gaussian average of F: Lemma 8 (Rademacher-Gaussian Average Comparison Lemma). For every function class F and sample of m points x: r π Gm|x (F). Rm|x (F) ≤ 2 The next relation is due to Slepian (1962). Lemma 9 (Slepian’s Lemma). Let Ω and Γ be mean zero, separable Gaussian processes indexed by a 2 2 set T such that E (Ωt1 − Ωt2 ) ≤ E (Γt1 − Γt2 ) for all t1 , t2 ∈ T . Then E supt ∈T Ωt ≤ E supt ∈T Γt . Slepian’s Lemma essentially says that if the variance of one Gaussian process is bounded by the variance of another, then, in expectation, the sup of the first is bounded by the sup of the second. We also will make use of McDiarmid’s inequality (McDiarmid, 1989). Theorem 6 (McDiarmid’s Inequality). Let X1 , . . . , Xm be random variables drawn iid according to a probability measure µ over a space X . Suppose that a function f : X m → R satisfies sup x1 ,...,xm ,xi0 ∈X

|f (x1 , . . . , xn ) − f (x1 , . . . , xi −1 , xi0 , xi +1 , . . . , xm | ≤ ci

for any i ∈ [m]. Then 



2

PrX1 ,...,Xn f (X1 , . . . , Xn ) − E f (X1 , . . . , Xn ) ≥ t ≤ exp −2t /

m X

!

ci2

.

i =1

D.2

Proofs

o n Proof (of Lemma 4): By definition, Prx0 x00 Υs ,τ (f , x00 ) ∧ Υη,s ,τ (f , x0 ) is equal to o n Prx0 x00 As (f , x00 ) ∧ Cτ (f , x00 ) ∧ ∃˜ x ⊂ x0 |˜ x| > η ∧ ∀x ∈ ˜ x As (f , x ) ∨ Cτ (f , x ) . From the permutation argument, if no point in x00 violates As (f , ·), then the probability that over log δ2 points of x0 will violate As (f , ·) is at most δ2 ; in addition, if no point of x00 violates Cτ (f , ·), then the probability that over log δ2 points of x0 will violate Cτ (f , ·) is at most δ2 . Thus, the probability that more than η = 2 log δ2 points of x0 violate either As (f , ·) or Cτ (f , ·) is at most δ. Proof (of Lemma 5): From the definitions of Υs ,τ and Υη,s ,τ : n  t o Prz z0 ∃f ∈ Fµ∗ Υs ,τ (f , x) ∧ Υη,s ,τ (f , x0 ) ∧ Pz’ fφ − Pz fφ ≥ 2 ( ) m 1 X t 0 0 = Prz z0 sup (φ(yi , f (xi )) − φ(yi , f (xi ))) ≥ . 2 f ∈Fµ∗ ,Υ (x)∩Fµ∗ ,Υη (x0 ) m i =1

20

Now, a routine application of symmetrization by random signs yields ) ( m 1 X t 0 0 (φ(yi , f (xi )) − φ(yi , f (xi ))) ≥ Przz0 sup 2 f ∈Fµ∗ ,Υ (x)∩Fµ∗ ,Υη (x0 ) m i =1 ( ) m 1 X t 0 0 = Przz0 ,σ sup σi (φ(yi , f (xi )) − φ(yi , f (xi ))) ≥ 2 f ∈Fµ∗ ,Υ (x)∩Fµ∗ ,Υη (x0 ) m i =1 ! !) ( m m 1 X 1 X t t 0 0 ∨ sup ≤ Przz0 ,σ σi φ(yi , f (xi )) ≥ σi φ(yi , f (xi )) ≥ sup 4 4 f ∈Fµ∗ ,Υ (x)∩Fµ∗ ,Υη (x0 ) m i =1 f ∈Fµ∗ ,Υ (x)∩Fµ∗ ,Υη (x0 ) m i =1 ( ) ) ( m m 1 X t 1 X t ≤ Prz,σ sup σi φ(yi , f (xi )) ≥ + Prz,σ σi φ(yi , f (xi )) ≥ . sup m 4 m 4 f ∈Fµ∗ ,Υ (x) f ∈Fµ∗ ,Υη (x) i =1

i =1

Proof (of Lemma 6): By definition, ϕUS (x ) = arg minz ∈Rk kx − USv k2 + λkz k1 . First, note the equivalences arg minz ∈Rk kx − USz k2 = arg minz ∈Rk kU T x − U T USz k2 = arg minz ∈Rk kU T x − Sz k2 , where the first equality follows because any z ∈ Ker(U ) (i.e., in the complement of the range space of U ) will be orthogonal to USz , for any z ; hence, it is sufficient to approximate the projection of x onto the range space of U . Thus, ϕUS (x ) = arg minz ∈Rk kU T x − Sz k22 + λkz k1 . Our approach is to use the well-known reformulation as a quadratic program with linear constraints:  T     S S 0k ×2k 2S T U T T T T minimize QU (¯z ) := ¯z ¯z − ¯z x + λ(0T z k 12k )¯ 02k ×k 02k ×2k 02k ×d ¯z ∈R3k

z + ≥ 0k

subject to T

z − ≥ 0k

z − z + + z − = 0k ,

T

where ¯z := ¯z := (z T z + z − )T .

   z∗ t∗ For optimal solutions ¯z∗ :=  z∗+  and ¯t∗ :=  t∗+  of QU and QU 0 respectively, from Daniel (1973), z∗− t∗− we have (¯u − ¯z∗ )T ∇QU (¯z∗ ) ≥ 0

(10)

T

(¯u − ¯t∗ ) ∇QU 0 (¯t∗ ) ≥ 0

(11)

for all ¯u ∈ R3k . Setting ¯u to ¯t∗ in (10) and ¯u to ¯z∗ in (11) and adding (10) and (11) yields (¯t∗ − ¯z∗ )T (∇QU (¯z∗ ) − ∇QU 0 (¯t∗ )) ≥ 0, which is equivalent to (¯t∗ − ¯z∗ )T (∇QU 0 (¯t∗ ) − ∇QU 0 (¯z∗ )) ≤ (¯t∗ − ¯z∗ )T (∇QU (¯z∗ ) − ∇QU 0 (¯z∗ )). (12)  T      0k S S 0k ×2k 2S T U T Here, ∇QU (z ) = z− x +λ . After plugging in the expansions 12k 02k ×k 02k ×2k 02k ×d of ∇QU and ∇QU 0 and incurring cancellations from the zeros, (12) becomes T

T

T

(t∗ − z∗ )T (S T St∗ − 2S T U 0 x − S T Sz∗ + 2S T U 0 x ) ≤ (t∗ − z∗ )T (S T Sz∗ − 2S T U T x − S T Sz∗ + 2S T U 0 x ) ⇓ T

(t∗ − z∗ )T S T S (t∗ − z∗ ) ≤ 2(t∗ − z∗ )T S T (U 0 − U T )x . If t∗ and z∗ both have sparsity level at most s , then wherever we typically would consider the operator norm kS k2 := supkt k=1 kSt k2 , we instead need only consider the 2s -restricted operator norm kS k2,2s . 21

Note that (t∗ − z∗ )T S T S (t∗ − z∗ ) ≥ µ2s (S )kt∗ − z∗ k22 , which implies that kt∗ − z∗ k22 ≤

2kS k2,2s 2 T T kt∗ − z∗ kkS k2,2s k(U 0 − U T )x k ⇒ kt∗ − z∗ k2 ≤ k(U 0 − U T )x k2 . µ2s (S ) µ2s (S )

Pm Proof (of Theorem 3): Define a Gaussian process Ω , indexed by U , by ΩU := i =1 γi hw , ϕUS (xi )i. Our goal is to apply Slepian’s Lemma to bound the expectation of the supremum of of Ω , which depends on ϕUS , by the expectation of the supremum of a Gaussian process Γ which depends only on U . m X

2

Eγ (ΩU − ΩU 0 ) = Eγ

γi hw , ϕUS (xi )i −

i =1

=

m X

m X

!2 γi hw , ϕU 0 S (xi )i

i =1

(hw , ϕUS (xi ) − ϕU 0 S (xi )i)2

i =1

≤ r2

m X

kϕUS (xi ) − ϕU 0 S (xi )k2

(13)

i =1

Applying the result from Lemma 6, we have Eγ (ΩU − ΩU 0 )2 ≤ r 2

m X

kϕUS (xi ) − ϕU 0 S (xi )k2

i =1

 ≤  =  =

2r kS k2,2s µ2s (S )

2 X m

2

0T

(U − U T )xi

2r kS k2,2s µ2s (S )

2 X k m X

2r kS k2,2s µ2s (S )

2

2

i =1

(hU 0 ej , xi i − hUej , xi i)2

i =1 j =1

 Eγ 

m X k X

! γij hU 0 ej , xi i

i =1 j =1



m X k X i =1 j =1

= Eγ (ΓU − ΓU 0 )2

for

ΓU :=

k m 2r kS k2,2s X X γij hUej , xi i. µ2s (S ) i =1 j =1

22

!2 γij hUej , xi i 

By Slepian’s Lemma, Eγ supU ΩU ≤ Eγ supU ΓU . It remains to bound Eγ supU ΓU : m

k

XX µ2s (S ) Eγ sup ΓU =Eγ sup γij hUej , xi i 2r kS k2,2s U U i =1 j =1

=Eγ sup U

j =1

≤Eγ sup U

=k Eγ k

k m X X hUej , γij xi i k X

i =1

kUej kk

j =1 m X

m X

γij xi k

i =1

γi 1 xi k

i =1

v u m X u γi 1 xi k2 ≤k tEγ k i =1

v * v + u u m m m X X u uX √ t kxi k2 ≤ k m. =k Eγ γi 1 xi , γi 1 xi = k t i =1

i =1

i =1

Hence, m 2 X Eγ sup γi hw , ϕUS (xi )i ≤ U ∈U m i =1

where we used the fact that kS k2,2s ≤



4r kS k2,2s k √ µ2s (S ) m



√ 4r k 2s √ , µ2s (S ) m

2s (see Lemma 10 in Appendix D.2 for a proof).

23

Proof (of Theorem 4): In the course of the proof, we will use the following factorization of the space of dictionaries D that was introduced by Maurer and Pontil (2008). Each dictionary D ∈ D will be factorized k as D = US , where S ∈ S := (BRk ) and

U ∈ U := {U 0 ∈ Rd ×k : kU 0 x k2 = kx k2 for x ∈ Rk }. U is the set of semi-orthogonal matrices in Rd ×k ; these matrices are left isometries (i.e. they satisfy U T U = I ). Let Sε be a minimum-cardinality proper ε-cover (in operator norm) of {S ∈ S : µs (S ) ≥ µ∗s , µ2s (S ) ≥ µ∗2s }, the set of suitably incoherent elements of S. Now, onward to the proof. The goal is to bound ) ( m t 1 X σi φ(yi , f (xi )) ≥ . Prx,σ sup 4 f ∈Fµ∗ ,Υη (x) m i =1

We say an index i is good if and only if Υs ,τ (f , xi ). An index is bad if and only if it is not good. Consider a fixed sample z and the occurrence  of a set of m − η good indices; each of the remaining indices can be either ∗ good or bad. There are N := m η ways to choose this set of indices. Partition Fµ ,Υη (x) into N subclasses S j via Fµ∗ ,Υη (x) = j ∈[N ] Fµ∗ ,Υη (x) where, for all functions in a given subclass, a specific set of m − η indices is guaranteed to be good. More formally, we can choose distinct good index sets Γ1 , . . . , ΓN , each of size m − η, such that for each j : \ Γj ⊂ {i ∈ [m] : Υs ,τ (f , xi )}. j f ∈Fµ ∗ ,Υ (x) η

Clearly, m X

sup

σi φ(yi , f (xi )) = max

j ∈[N ]

f ∈Fµ∗ ,Υη (x) i =1

sup

m X

σi φ(yi , f (xi )).

j f ∈Fµ ∗ ,Υη (x) i =1

For each j ∈ [N ], define the ε-neighborhood of Fµj ∗ ,Υη (x) defined as n o F¯µj ∗ ,Υη (x) := f = (US 0 , w 0 ) : kS − S 0 k ≤ ε, kw − w 0 k ≤ ε, S ∈ S, w ∈ W, (US , w ) ∈ Fµj ∗ ,Υη (x) . Also, let Wε be a minimum-cardinality ε-cover of W and define a particular epsilon-cover of F: Fε := {f = (US 0 , w 0 ) ∈ F : U ∈ U, S 0 ∈ Sε , w 0 ∈ Wε } . Take the intersection F¯µj ∗ ,Υη (x) ∩ Fε , a disjoint union of subclasses equal to 0

[

0

Fµj ,∗S,Υ,wη (x)

S 0 ∈Sε ,w 0 ∈Wε

for

0

0

Fµj ,∗S,Υ,wη (x) := Fµj ∗ ,Υη (x) ∩ {f ∈ F : f = (US 0 , w 0 ) : U ∈ U} . Now, for each j ∈ [N ] and arbitrary σ ∈ {−1, 1}m , we compare m 1 X σi φ(yi , f (xi )) sup m f ∈F j ∗ (x) µ ,Υη

and

max

S 0 ∈Sε ,w 0 ∈Wε

i =1

24

m X

sup 0

0

j,S ,w f ∈Fµ ∗ ,Υη (x) i =1

σi φ(yi , f (xi )).

Without loss of generality, choose j = 1 and take Γ1 = [m − η]. If f ∈ Fµ1 ∗ ,Υη (x), it follows that there S 0 0 exists f 0 ∈ S 0 ∈Sε ,w 0 ∈Wε Fµ1,∗S,Υ,wη (x) such that m 1 X σi |φ(yi , hw , ϕD (xi )i) − φ(yi , hw 0 , ϕD 0 (xi )i)| m i =1 ! m−η 1 L X 0 σi |hw , ϕD (xi )i − hw , ϕD 0 (xi )i| + ≤ m m i =1   √ r 1 bη s ( ≤ Lε + 1) + + , ∗ µs λ λ m

m X

σi |φ(yi , hw , ϕD (xi )i) − φ(yi , hw 0 , ϕD 0 (xi )i)|

i =m−η+1

where the last line is due to the Sparse Coding Stability Theorem (Theorem 1). Therefore, for any σ ∈ {−1, 1}m it holds that m 1 X sup σi φ(yi , f (xi )) f ∈Fµ∗ ,Υη (x) m i =1

  √ m r s 1 bη 1 X σi φ(yi , f (xi )) + Lε ( + 1) + + . sup ≤ max 0 max0 ∗ 0 0 m µs λ λ m j ∈[N ] S ∈Sε ,w ∈Wε f ∈F j ,S ,w (x) µ∗ ,Υη

i =1

Hence, for z fixed ( ) m 1 X t Prσ sup σi φ(yi , f (xi )) ≥ 4 f ∈Fµ∗ ,Υη (x) m i =1     √ m  t r 1 bη  1 X s sup σi φ(yi , f (xi )) ≥ − Lε ( + 1) + − ≤ Prσ max 0 max0 j ∈[N ] S ∈Sε ,w ∈Wε f ∈F j ,S 0 ,w 0 (x) m 4 µ∗s λ λ m i =1 µ∗ ,Υη    (k +1)k   √ m  1 X 8(r /2)1/(k +1) t r 1 bη  s Prσ sup ≤N max σi φ(yi , f (xi )) ≥ − Lε + 1) + − . ( f ∈F j ,S 0 ,w 0 (x) m ε 4 µ∗s λ λ m j ∈[N ] S 0 ∈Sε ,w 0 ∈Wε

i =1

µ∗ ,Υη

Now, from McDiarmid’s inequality, for any fixed j ∈ [N ], S 0 ∈ Sε and w 0 ∈ Wε :   m m   X X 1 1 Prσ sup σi φ(yi , f (xi )) > Eσ sup σi φ(yi , f (xi )) + t1 ≤ exp(−mt12 /(2b 2 )). 0 0 f ∈F j,S 0 ,w 0 (x) m  m f ∈F j,S ,w (x) µ∗ ,Υη

i =1

µ∗ ,Υη

i =1

Now, we bound Eσ

m 1 X σi φ(yi , f (xi )). 0 0 m f ∈F j,S ,w (x)

sup

µ∗ ,Υη

i =1

25

Without loss of generality, again take j = 1 and Γ1 = [m − η]. Then m 1 X σi φ(yi , f (xi )) m j ,S 0 ,w 0 f ∈Fµ (x) i =1 ∗ ,Υ η   m−η m   1 X 1 X = Eσ sup σi φ(yi , f (xi )) + σi φ(yi , f (xi )) f ∈F j ,S 0 ,w 0 (x) m  m i =1 i =m−η+1 µ∗ ,Υη    m−η    1 1 X σi φ(yi , f (xi )) + Eσm−η+1 ,...,σm sup ≤ Eσ1 ,...,σm−η sup  f ∈F j ,S 0 ,w 0 (x) m f ∈F j ,S 0 ,w 0 (x) m i =1 µ∗ ,Υη µ∗ ,Υη   m−η   bη 1 X ≤ Eσ1 ,...,σm−η sup σi φ(yi , f (xi )) + . f ∈F j ,S 0 ,w 0 (x) m  m



sup

m X

σi φ(yi , f (xi ))

i =m−η+1

  

i =1

µ∗ ,Υη

Now, Theorem 3, the Rademacher Loss Comparison Lemma (Lemma 7), and the Rademacher-Gaussian Average Comparison Lemma (Lemma 8) imply that Eσ

sup U ∈U As (U 0 S ,x)

√ √ √ m−η  1 X m − η 2L πr k s σi φ yi , hw , ϕUS (xi )i ≤ m m µ2s (S ) i =1

√ √ 2L πr k s √ , ≤ µ2s (S ) m

and hence  √ √ m  X bη 1 2L πr k s √ + Prσ sup σi φ(f (xi )) > + t ≤ exp(−mt12 /(2b 2 )). 1 ∗ f ∈F j,S 0 ,w 0 (x) m  m µ m 2s i =1  

µ∗ ,Υη

Combining this bound with the fact that the bound is independent of the draw of z and applying Proposition 4 (with d = k ) to extend the bound over all choices of j , S 0 , and w 0 results in )   ( (k +1)k m 1 X m 8(r /2)1/(k +1) t σi φ(yi , f (xi )) ≥ ≤ exp(−mt32 /(2b 2 )), Prz,σ sup η 4 ε f ∈Fµ∗ ,Υη (x) m i =1

for

t3 :=

t − Lε 4



√  √ √ 1 2L πr k s 2b η r s √ + 1) + − − . ( ∗ ∗ µs λ λ m µ2s m

26

k

Lemma 10. If S ∈ (BRk ) , then kS k2,2s ≤



2s .

Proof: Define SΛ as the submatrix of S that selects the columns indexed by Λ. Similarly, for t ∈ Rk define the coordinate projection tΛ of t . kSt k2

sup {t :kt k=1,| supp(t )|≤2s }

=

min

sup

{Λ⊂[k ]:|Λ|≤2s } {t :kt k=1,supp(t )⊂Λ}

kSΛ tΛ k2



X

tω Sω = min sup

{Λ⊂[k ]:|Λ|≤2s } {t :kt k=1,supp(t )⊂Λ} ω∈Λ 2 X ≤ min sup |tω |kSω k2 {Λ⊂[k ]:|Λ|≤2s } {t :kt k=1,supp(t )⊂Λ}

≤ ≤ ≤ =

E

min

sup

{Λ⊂[k ]:|Λ|≤2s } {t :kt k=1,supp(t )⊂Λ}

min

sup

min

sup

{Λ⊂[k ]:|Λ|≤2s } {t :kt k=1,supp(t )⊂Λ} {Λ⊂[k ]:|Λ|≤2s } {t :kt k=1,supp(t )⊂Λ}



ω∈Λ

X

|tω |

ω∈Λ

ktΛ k1 √

2s ktΛ k2

2s .

Covering numbers

Cucker and Smale (2002, Chapter I, Proposition 5) state that, for a Banach space E of dimension d , the ε-covering numbers of the radius r ball of E are bounded as N (r BE , ε) ≤ (4r /ε)d . For spaces of dictionaries obeying some deterministic property, such as Dµ = {D ∈ D : µs (D ) ≥ µ}, one must be careful to use a proper ε-cover so that the representative elements of the cover also obey the desired property: A proper cover is more restricted than a cover in that a proper cover must be a subset of the set being covered (rather than simply being a subset of the ambient Banach space). That is, if A is a proper cover of a subset T of a Banach space E , then A ⊂ T . For a cover, we need only A ⊂ E . The following bound relates proper covering numbers to covering numbers (a simple proof can be found in (Vidyasagar, 2002, Lemma 2.1)): If E is a Banach space and T ⊂ E is a bounded subset, then N (E , ε, T ) ≤ Nproper (E , ε/2, T ). k Let d , k ∈ N. Define Eµ := {E ∈ (BRd ) : µs (D ) ≥ µ} and W := r BRd . The following bounds derive directly from the above. Proposition 3. The proper ε-covering number of Eµ is bounded by  dk 8 . ε Proposition 4. The product of the proper ε-covering number of Eµ and the ε-covering number of W is bounded by  (d +1)k 8(r /2)1/(d +1) . ε

27