Deep Convolutional Neural Networks on Cartoon Functions Philipp Grohs∗ , Thomas Wiatowski† , and Helmut B¨olcskei†
arXiv:1605.00031v1 [cs.LG] 29 Apr 2016
∗ Dept.
Math., ETH Zurich, Switzerland, and Dept. Math., University of Vienna, Austria † Dept. IT & EE, ETH Zurich, Switzerland, ∗
[email protected], † {withomas, boelcskei}@nari.ee.ethz.ch
Abstract—Wiatowski and B¨olcskei, 2015, proved that deformation stability and vertical translation invariance of deep convolutional neural network-based feature extractors are guaranteed by the network structure per se rather than the specific convolution kernels and non-linearities. While the translation invariance result applies to square-integrable functions, the deformation stability bound holds for band-limited functions only. Many signals of practical relevance (such as natural images) exhibit, however, sharp and curved discontinuities and are hence not band-limited. The main contribution of this paper is a deformation stability result that takes these structural properties into account. Specifically, we establish deformation stability bounds for the class of cartoon functions introduced by Donoho, 2001.
I. I NTRODUCTION Feature extractors based on so-called deep convolutional neural networks have been applied with tremendous success in a wide range of practical signal classification tasks [1]– [3]. These networks are composed of multiple layers, each of which computes convolutional transforms, followed by the application of non-linearities and pooling operations. The mathematical analysis of feature extractors generated by deep convolutional neural networks was initiated in a seminal paper by Mallat [4]. Specifically, Mallat analyzes so-called scattering networks, where signals are propagated through layers that compute semi-discrete wavelet transforms (i.e., convolutional transforms with pre-specified filters obtained from a mother wavelet through scaling operations), followed by modulus non-linearities. It was shown in [4] that the resulting wavelet-modulus feature extractor is horizontally translationinvariant [5] and deformation-stable, with the stability result applying to a function space that depends on the underlying mother wavelet. Recently, Wiatowski and B¨olcskei [5] extended Mallat’s theory to incorporate convolutional transforms with filters that are (i) pre-specified and potentially structured such as Weyl-Heisenberg (Gabor) functions [6], wavelets [7], curvelets [8], shearlets [9], and ridgelets [10], (ii) pre-specified and unstructured such as random filters [11], and (iii) learned in a supervised [12] or unsupervised [13] fashion. Furthermore, the networks in [5] may employ general Lipschitz-continuous nonlinearities (e.g., rectified linear units, shifted logistic sigmoids, hyperbolic tangents, and the modulus function) and pooling through sub-sampling. The essence of the results in [5] is that vertical translation invariance and deformation stability are induced by the network structure per se rather than the
specific choice of filters and non-linearities. While the vertical translation invariance result in [5] is general in the sense of applying to the function space L2 (Rd ), the deformation stability result in [5] pertains to square-integrable band-limited functions. Moreover, the corresponding deformation stability bound depends linearly on the bandwidth. Many signals of practical relevance (such as natural images) can be modeled as square-integrable functions that are, however, not band-limited or have large bandwidth. Large bandwidths render the deformation stability bound in [5] void as a consequence of its linear dependence on bandwidth. Contributions. The question considered in this paper is whether taking structural properties of natural images into account can lead to stronger deformation stability bounds. We show that the answer is in the affirmative by analyzing the class of cartoon functions introduced in [14]. Cartoon functions satisfy mild decay properties and are piecewise continuously differentiable apart from curved discontinuities along Lipschitz-continuous hypersurfaces. Moreover, they provide a good model for natural images such as those in the MNIST [15], Caltech-256 [16], and CIFAR-100 [17] datasets as well as for images of geometric objects of different shapes, sizes, and colors [18], [19]. The proof of our main result is based on the decoupling technique introduced in [5]. The essence of decoupling is that contractivity of the feature extractor combined with deformation stability of the signal class under consideration—under smoothness conditions on the deformation—establishes deformation stability for the feature extractor. Our main technical contribution here is to prove deformation stability for the class of cartoon functions. Moreover, we show that the decay rate of the resulting deformation stability bound is best possible. The results we obtain further underpin the observation made in [5] of deformation stability and vertical translation invariance being induced by the network structure per se. Notation. We refer the reader to [5, Sec. 1] for the general notation employed in this paper. In addition, we will need the following notation. For x ∈ Rd , we set hxi := (1 + |x|2 )1/2 . The Minkowski sum of sets A, B ⊆ Rd is (A + B) := {a + b | a ∈ A, b ∈ B}. A Lipschitz domain D is a set D ⊆ Rd whose boundary ∂D is “sufficiently regular” to be thought of as locally being the graph of a Lipschitz-continuous function, for a formal definition see [20, Def. 1.40]. The indicator function of a set B ⊆ Rd is defined as 1B (x) := 1,
(j) (l) (m) U λ1 , λ2 , λ3 f
(p) (r) (s) U λ1 , λ2 , λ3 f
(j) (l) U λ1 , λ2 f
(p) (r) U λ1 , λ2 f
(j) (l) U λ1 , λ2 f ∗ χ2
(p) (r) U λ1 , λ2 f ∗ χ2
(j) U λ1 f (j) U λ1 f ∗ χ1
(p) U λ1 f U [e]f = f
(p) U λ1 f ∗ χ1
f ∗ χ0 (k)
Fig. 1: Network architecture underlying the feature extractor (2). The index λn corresponds to the k-th atom gλ(k) of the n collection Ψn associated with the n-th network layer. The function χn is the output-generating atom of the n-th layer. for x ∈ B, and 1B (x) := 0, for Rx ∈ Rd \B. For aR measurable set B ⊆ Rd , we let vold (B) := Rd 1B (x)dx = B 1dx. II. D EEP CONVOLUTIONAL NEURAL NETWORK - BASED FEATURE EXTRACTORS
We set the stage by briefly reviewing the deep convolutional feature extraction network presented in [5], the basis of which is a sequence of triplets Ω := (Ψn , Mn , Rn ) n∈N referred to as module-sequence. The triplet (Ψn , Mn , Rn )—associated with the n-th network layer—consists of (i) a collection Ψn := {gλn }λn ∈Λn of so-called atoms gλn ∈ L1 (Rd ) ∩ L2 (Rd ), indexed Pby a countable set Λn and satisfying the Bessel condition λn ∈Λn kf ∗gλn k2 ≤ Bn kf k22 , for all f ∈ L2 (Rd ), for some Bn > 0, (ii) an operator Mn : L2 (Rd ) → L2 (Rd ) satisfying the Lipschitz property kMn f − Mn hk2 ≤ Ln kf − hk2 , for all f, h ∈ L2 (Rd ), and Mn f = 0 for f = 0, and (iii) a sub-sampling factor Rn ≥ 1. Associated with (Ψn , Mn , Rn ), we define the operator Un [λn ]f := Rnd/2 Mn (f ∗ gλn ) (Rn ·), (1) and extend it to paths on index sets q = (λ1 , λ2 , . . . , λn ) ∈ Λ1 × Λ2 × · · · × Λn := Λn1 , n ∈ N, according to U [q]f = U [(λ1 , λ2 , . . . , λn )]f := Un [λn ] · · · U2 [λ2 ]U1 [λ1 ]f, where for the empty path e := ∅ we set Λ01 := {e} and U [e]f := f , for f ∈ L2 (Rd ). Remark 1. on the atoms gλn is equiP The Bessel condition valent to λn ∈Λn |d gλn (ω)|2 ≤ Bn , for a.e. ω ∈ Rd (see [5, Prop. 2]), and is hence easily satisfied even by learned filters [5, Remark 2]. An overview of collections Ψn = {gλn }λn ∈Λn of structured atoms gλn (such as, e.g., Weyl-Heisenberg (Gabor) functions, wavelets, curvelets, shearlets, and ridgelets) and non-linearities Mn widely used in the deep learning literature (e.g., hyperbolic tangent, shifted logistic sigmoid, rectified linear unit, and modulus function) is provided in [5, App. B-D].
For every n ∈ N, we designate one of the atoms Ψn = {gλn }λn ∈Λn as the output-generating atom χn−1 := gλ∗n , λ∗n ∈ Λn , of the (n − 1)-th layer. The atoms {gλn }λn ∈Λn \{λ∗n } ∪{χn−1 } are thus used across two consecutive layers in the sense of χn−1 = gλ∗n generating the output in the (n−1)-th layer, and the remaining atoms {gλn }λn ∈Λn \{λ∗n } propagating signals to the n-th layer according to (1), see Fig. 1. From now on, with slight abuse of notation, we write Λn for Λn \{λ∗n } as well. The extracted features ΦΩ (f ) of a signal f ∈ L2 (Rd ) are defined as [5, Def. 3] ΦΩ (f ) :=
∞ [
{(U [q]f ) ∗ χn }q∈Λn1 ,
(2)
n=0
where (U [q]f ) ∗ χn , q ∈ Λn1 , is a feature generated in the n-th layer of the network, see Fig. 1. It is shown in [5, Thm. 2] that for all f ∈ L2 (Rd ) the feature extractor ΦΩ is vertically translation-invariant in the sense of the layer depth n determining the extent to which the features (U [q]f ) ∗ χn , q ∈ Λn1 , are translation-invariant. Furthermore, under the condition max max{Bn , Bn L2n } ≤ 1, (3) n∈N
referred to as weak admissibility condition in [5, Def. 4] and satisfied by a wide variety of module sequences Ω (see [5, Sec. 3]), the following result is established in [5, Thm. 1]: The feature extractor ΦΩ is stable on the space of R-bandlimited functions L2R (Rd ) w.r.t. deformations (Fτ f )(x) := f (x − τ (x)), i.e., there exists a universal constant C > 0 (that does not depend on Ω) such that for all f ∈ L2R (Rd ) and 1 , all (possibly non-linear) τ ∈ C 1 (Rd , Rd ) with kDτ k∞ ≤ 2d it holds that |||ΦΩ (Fτ f ) − ΦΩ (f )||| ≤ CRkτ k∞ kf k2 . Here, the feature P∞ P space norm is defined |||ΦΩ (f )|||2 := n=0 q∈Λn k(U [q]f ) ∗ χn k22 . 1
(4) as
for some C > 0 (that does not depend on f1 ,f2 ). Furthermore, we denote by K CCART := {f1 + 1B f2 | fi ∈ L2 (Rd ) ∩ C 1 (Rd , C), i = 1, 2,
|∇fi (x)| ≤ Khxi−d , vold−1 (∂B) ≤ K, kf2 k∞ ≤ K} the class of cartoon functions of maximal size K > 0.
Fig. 2: Left: A natural image (image credit: [21]) is typically governed by areas of little variation, with the individual areas separated by edges that can be modeled as curved singularities. Right: An image of a handwritten digit.
For practical classification tasks, we can think of the deformation Fτ as follows. Let f be a representative of a certain signal class, e.g., f is an image of the handwritten digit 1 } is a “8” (see Fig. 2, right). Then, {Fτ f | kDτ k∞ < 2d collection of images of the handwritten digit “8”, where each Fτ f may be generated, e.g., based on a different handwriting 1 style. The bound kDτ k∞ < 2d on the Jacobian matrix of τ imposes a quantitative limit on the amount of deformation tolerated, rendering the bound (4) to implicitly depend on Dτ . The stability bound (4) now guarantees that the features 1 corresponding to the images in the set {Fτ f | kDτ k∞ < 2d } do not differ too much. III. C ARTOON FUNCTIONS The bound in (4) applies to the space of square-integrable R-band-limited functions. Many signals of practical significance (e.g., natural images) are, however, not band-limited (due to the presence of sharp and possibly curved edges, see Fig. 2) or exhibit large bandwidths. In the latter case, the deformation stability bound (4) becomes void as it depends linearly on R. The goal of this paper is to take structural properties of natural images into account by considering the class of cartoon functions introduced in [14]. These functions satisfy mild decay properties and are piecewise continuously differentiable apart from curved discontinuities along Lipschitz-continuous hypersurfaces. Cartoon functions provide a good model for natural images (see Fig. 2, left) such as those in the Caltech256 [16] and CIFAR-100 [17] data sets, for images of handwritten digits [15] (see Fig. 2, right), and for images of geometric objects of different shapes, sizes, and colors [18], [19]. We proceed to the formal definition of cartoon functions. Definition 1. The function f : Rd → C is referred to as a cartoon function if it can be written as f = f1 + 1B f2 , where B ⊆ Rd is a compact Lipschitz domain with boundary of finite length, i.e., vold−1 (∂B) < ∞, and fi ∈ L2 (Rd ) ∩ C 1 (Rd , C), i = 1, 2, satisfies the decay condition |∇fi (x)| ≤ Chxi−d ,
i = 1, 2,
(5)
We chose the term size to indicate the length vold−1 (∂B) of the boundary ∂B of the Lipschitz domain B. Furthermore, K CCART ⊆ L2 (Rd ), for all K > 0; this simply follows from the triangle inequality according to kf1 + 1B f2 k2 ≤ kf1 k2 + k1B f2 k2 ≤ kf1 k2 + kf2 k2 < ∞, where in the last step we used f1 , f2 ∈ L2 (Rd ). Finally, we note that our main results— presented in the next section—can easily be generalized to finite linear combinations of cartoon functions, but this is not done here for simplicity of exposition. IV. M AIN RESULTS We start by reviewing the decoupling technique introduced in [5] to prove deformation stability bounds for band-limited functions. The proof of the deformation stability bound (4) for band-limited functions in [5] is based on two key ingredients. The first one is a contractivity property of ΦΩ (see [5, Prop. 4]), namely |||ΦΩ (f ) − ΦΩ (h)||| ≤ kf − hk2 , for all f, h ∈ L2 (Rd ). Contractivity guarantees that pairwise distances of input signals do not increase through feature extraction. The second ingredient is an upper bound on the deformation error kf −Fτ f k2 (see [5, Prop. 5]), specific to the signal class considered in [5], namely band-limited functions. Recognizing that the combination of these two ingredients yields a simple proof of deformation stability is interesting as it shows that whenever a signal class exhibits inherent stability w.r.t. deformations of the form (Fτ f )(x) = f (x − τ (x)), we automatically obtain deformation stability for the feature extractor ΦΩ . The present paper employs this decoupling technique and establishes deformation stability for the class of cartoon functions by deriving an upper bound on the K . deformation error kf − Fτ f k2 for f ∈ CCART Proposition 1. For every K > 0 there exists a constant CK > K and all (possibly non-linear) 0 such that for all f ∈ CCART d d τ : R → R with kτ k∞ < 21 , it holds that kf − Fτ f k2 ≤ CK kτ k1/2 ∞ .
(6)
Proof. see Appendix A. The Lipschitz exponent α = 12 on the right-hand side (RHS) of (6) determines the decay rate of the deformation error kf − Fτ f k2 as kτ k∞ → 0. Clearly, larger α > 0 results in the deformation error decaying faster as the deformation becomes smaller. The following simple example shows that the Lipschitz exponent α = 12 in (6) is best possible, i.e., it can not be larger. Consider d = 1 and τs (x) = s, for a fixed s satisfying 0 < s < 12 ; the corresponding deformation Fτs amounts to a simple translation by s with kτs k∞ = s < 12 . K Let f = 1[−1,1] . Then f ∈ CCART for some K > 0 and √ √ 1/2 kf − Fτs f k2 = 2s = 2kτ k∞ .
Remark 2. It is interesting to note that in order to obtain bounds of the form kf − Fτ f k2 ≤ Ckτ kα ∞ , for f ∈ C ⊆ L2 (Rd ), for some C > 0 (that does not depend on f , τ ) and some α > 0, we need to impose non-trivial constraints on the set C ⊆ L2 (Rd ). Indeed, consider, again, d = 1 and τs (x) = s, for small s > 0. Let fs ∈ L2 (Rd ) be a function that has its energy kfs k2 = 1 concentrated in a small interval according to supp(fs ) ⊆ [−s/2, s/2]. Then, fs and √ Fτs fs have disjoint support sets and hence kfs − Fτs fs k2 = 2, which does not α decay with kτ kα ∞ = s for any α > 0. More generally, the amount of deformation induced by a given function τ depends strongly on the signal (class) it is applied to. Concretely, the 2 deformation Fτ with τ (x) = e−x , x ∈ R, will lead to a small bump around the origin only when applied to a lowpass function, whereas the function fs above will experience a significant deformation. We are now ready to state our main result. Theorem 1. Let Ω = (Ψn , Mn , Rn ) n∈N be a modulesequence satisfying the weak admissibility condition (3). For every size K > 0, the feature extractor ΦΩ is stable on K w.r.t. deformations the space of cartoon functions CCART (Fτ f )(x) = f (x − τ (x)), i.e., for every K > 0 there exists a constant CK > 0 (that does not depend on Ω) such that for K all f ∈ CCART , and all (possibly non-linear) τ ∈ C 1 (Rd , Rd ) 1 1 with kτ k∞ < 2 and kDτ k∞ ≤ 2d , it holds that |||ΦΩ (Fτ f ) − ΦΩ (f )||| ≤ CK kτ k1/2 ∞ .
(7)
Proof. Applying the contractivity property |||ΦΩ (g) − ΦΩ (h)||| ≤ kg − hk2 with g = Fτ f and h = f , and using (6) yields (7) upon invoking the same arguments as in [5, Eq. 58] and [5, Lemma 2] to conclude that f ∈ L2 (Rd ) implies 1 Fτ f ∈ L2 (Rd ) thanks to kDτ k∞ ≤ 2d . The strength of the deformation stability result in Theorem 1 derives itself from the fact that the only condition we need to impose on the underlying module-sequence Ω is weak admissibility according to (3), which as argued in [5, Sec. 3], can easily be met by normalizing the elements in Ψn , for all n ∈ N, appropriately. We emphasize that this normalization does not have an impact on the constant CK in (7), which is shown in Appendix A to be independent of Ω. The dependence of CK on K does, however, reflect the intuition that the deformation stability bound should depend on the signal class description complexity. For band-limited signals, this dependence is exhibited by the RHS in (4) being linear in the bandwidth R. Finally, we note that the vertical translation invariance result [5, Thm. 2] applies to all K f ∈ L2 (Rd ), and, thanks to CCART ⊆ L2 (Rd ), for all K > 0, carries over to cartoon functions. Remark 3. We note that thanks to the decoupling technique underlying our arguments, the deformation stability bounds (4) and (7) are very general in the sense of applying to every contractive (linear or non-linear) mapping Φ. Specifically, the identity mapping Φ(f ) = f also leads to deformation stability on the class of cartoon functions (and the class of band-limited
functions). This is interesting as it was recently demonstrated that employing the identity mapping as a so-called shortcutconnection in a subset of layers of a very deep convolutional neural network yields state-of-the-art classification performance on the ImageNet dataset [22]. Our deformation stability result is hence general in the sense of applying to a broad class of network architectures used in practice. For functions that do not exhibit discontinuities along Lipschitz-continuous hypersurfaces, but otherwise satisfy the decay condition (5), we can improve the decay rate of the deformation error from α = 21 to α = 1. Corollary 1. Let Ω = (Ψn , Mn , Rn ) n∈N be a modulesequence satisfying the weak admissibility condition (3). For every size K > 0, the feature extractor ΦΩ is stable on the space HK := {f ∈ L2 (Rd ) ∩ C 1 (Rd , C) | |∇f (x)| ≤ Khxi−d } w.r.t. deformations (Fτ f )(x) = f (x−τ (x)), i.e., for every K > 0 there exists a constant CK > 0 (that does not depend on Ω) such that for all f ∈ HK , and all (possibly non1 , linear) τ ∈ C 1 (Rd , Rd ) with kτ k∞ < 21 and kDτ k∞ ≤ 2d it holds that |||ΦΩ (Fτ f ) − ΦΩ (f )||| ≤ CK kτ k∞ . Proof. The proof follows that of Theorem 1 apart from employing (12) instead of (6). A PPENDIX A P ROOF OF P ROPOSITION 1 The proof of (6) is based on judiciously combining deformation stability bounds for the components f1 , f2 in (f1 + K 1B f2 ) ∈ CCART and for the indicator function 1B . The first bound, stated in Lemma 1 below, reads kf − Fτ f k2 ≤ CDkτ k∞ ,
(8)
and applies to functions f satisfying the decay condition (11), with the constant D > 0 not depending on f and τ (see (14)). The bound in (8) needs the assumption kτ k∞ < 21 . The second bound, stated in Lemma 2 below, is k1B − Fτ 1B k2 ≤ 2 vold−1 (∂B)
1/2
kτ k1/2 ∞ .
(9)
We now show how (8) and (9) can be combined to establish K (6). For f = (f1 + 1B f2 ) ∈ CCART , we have kf − Fτ f k2 ≤ kf1 − Fτ f1 k2 + k1B (f2 − Fτ f2 )k2 + k(1B − Fτ 1B )(Fτ f2 )k2
(10)
≤ kf1 − Fτ f1 k2 + kf2 − Fτ f2 k2 + k1B − Fτ 1B k2 kFτ f2 k∞ , where in (10) we used Fτ (1B f2 )(x) = (1B f2 )(x − τ (x)) = 1B (x − τ (x))f2 ((x − τ (x))) = (Fτ 1B )(x)(Fτ f2 )(x). With the upper bounds (8) and (9), invoking properties of the class K of cartoon functions CCART (namely, (i) vold−1 (∂B) ≤ K, (ii) f1 ,f2 satisfy (11) and thus (8) with C = K, and (iii)
kFτ f2 k∞ = supx∈Rd |f2 (x − τ (x))| ≤ supy∈Rd |f2 (y)| = kf2 k∞ ≤ K), this yields √ 1/2 kf − Fτ f k2 ≤ 2 KD kτ k∞ + 2 K 3/2 kτ k∞ √ 3/2 ≤ 2 max{2KD, 2K } kτ k1/2 ∞ , {z } | =:CK
which completes the proof of (6). It remains to show (8) and (9). Lemma 1. Let f ∈ L2 (Rd ) ∩ C 1 (Rd , C) be such that |∇f (x)| ≤ Chxi−d ,
∞
the proof. (11) R EFERENCES
for some constant C > 0, and let kτ k∞ < 12 . Then, kf − Fτ f k2 ≤ CDkτ k∞ ,
(12)
for a constant D > 0 that does not depend on f and τ . the integrand in kf − Fτ f k22 = RProof. We first upper-bound 2 |f (x)−f (x−τ (x))| dx. Owing to the mean value theorem Rd [23, Thm. 3.7.5], we have |f (x) − f (x − τ (x))| ≤ kτ k∞
sup
|∇f (y)|
y∈Bkτ k∞ (x)
≤ Ckτ k∞
sup
hyi−d ,
y∈Bkτ k∞ (x)
|
k1B − Fτ 1B k22 = RProof. In order to upper-bound 2 |1B (x)−1B (x−τ (x))| dx, we first note that the integrand Rd h(x) := |1B (x) − 1B (x − τ (x))|2 satisfies h(x) = 1, for x ∈ S, where S := {x ∈ Rd | x ∈ B and x − τ (x) ∈ / B} ∪ {x ∈ Rd | x ∈ / B and x−τ (x) ∈ B}, and h(x) = 0, for x ∈ Rd \S. Since S ⊆ ∂B + Bkτ k∞ (0) , where (∂B + Bkτ k∞ (0)) is a tubular neighborhood of width kτ k∞ boundary ∂B R around the R 2 2 of B, we have k1 − F 1 k = |h(x)| dx = 1dx ≤ B τ B 2 Rd S R d−1 1dx ≤ 2 vol (∂B)kτ k , which completes ∞ ∂B+Bkτ k (0)
{z
=:h(x)
}
where the last inequalityRfollows by assumption. The idea is now to split the integral Rd |h(x)|2 dx into integrals over the sets B1 (0) and Rd \B1 (0). For x ∈ B1 (0), the monotonicity of the function x 7→ hxi−d implies h(x) ≤ Ckτ k∞ h0i−d = Ckτ k∞ , and for x ∈ Rd \B1 (0), we have (1 − kτ k∞ ) ≤ k∞ (1 − kτ|x| ), which together with the monotonicity of x 7→ k∞ hxi−d yields h(x) ≤ Ckτ k∞ h(1− kτ|x| )xi−d ≤ Ckτ k∞ h(1− kτ k∞ )xi−d . Putting things together, we hence get kf − Fτ f k22 ≤ C 2 kτ k2∞ vold B1 (0) Z + 2d hui−2d du (13) Rd (14) ≤ C 2 kτ k2∞ vold B1 (0) + 2d kh·i−d k22 , {z } | =:D 2
where in (13) we used the change of variables u = (1 − kτ k∞ )x, together with du = (1 − kτ k∞ )d ≥ 2−d . (15) dx The inequality in (15) follows from kτ k∞ < 12 , which is by assumption. Since kh·i−d k2 < ∞, for d ∈ N (see, e.g., [24, Sec. 1]), and, obviously, vold B1 (0) < ∞, it follows that D2 < ∞, which completes the proof. We continue with a deformation stability result for indicator functions 1B . Lemma 2. Let B ⊆ Rd be a compact Lipschitz domain with boundary of finite length, i.e., vold−1 (∂B) < ∞. Then, k1B − Fτ 1B k2 ≤ (2 vold−1 (∂B))1/2 kτ k1/2 ∞ .
[1] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1798–1828, 2013. [2] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” in Proc. of the IEEE, 1998, pp. 2278– 2324. [3] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436–444, 2015. [4] S. Mallat, “Group invariant scattering,” Comm. Pure Appl. Math., vol. 65, no. 10, pp. 1331–1398, 2012. [5] T. Wiatowski and H. B¨olcskei, “A mathematical theory of deep convolutional neural networks for feature extraction,” arXiv:1512.06293, 2015. [6] K. Gr¨ochening, Foundations of time-frequency analysis. Birkh¨auser, 2001. [7] I. Daubechies, Ten lectures on wavelets. Society for Industrial and Applied Mathematics, 1992. [8] E. J. Cand`es and D. L. Donoho, “Continuous curvelet transform: II. Discretization and frames,” Appl. Comput. Harmon. Anal., vol. 19, no. 2, pp. 198–222, 2005. [9] G. Kutyniok and D. Labate, Eds., Shearlets: Multiscale analysis for multivariate data. Birkh¨auser, 2012. [10] E. J. Cand`es, “Ridgelets: Theory and applications,” Ph.D. dissertation, Stanford University, 1998. [11] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, “What is the best multi-stage architecture for object recognition?” in Proc. of IEEE International Conference on Computer Vision (ICCV), 2009, pp. 2146– 2153. [12] F. J. Huang and Y. LeCun, “Large-scale learning with SVM and convolutional nets for generic object categorization,” in Proc. of IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2006, pp. 284–291. [13] M. A. Ranzato, F. J. Huang, Y. Boureau, and Y. LeCun, “Unsupervised learning of invariant feature hierarchies with applications to object recognition,” in Proc. of IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2007, pp. 1–8. [14] D. L. Donoho, “Sparse components of images and optimal atomic decompositions,” Constructive Approximation, vol. 17, no. 3, pp. 353– 382, 2001. [15] Y. LeCun and C. Cortes, “The MNIST database of handwritten digits,” http://yann.lecun.com/exdb/mnist/, 1998. [16] G. Griffin, A. Holub, and P. Perona, “Caltech-256 object category dataset,” http://authors.library.caltech.edu/7694/, 2007. [17] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Master’s thesis, University of Toronto, 2009. [18] “The baby AI school dataset,” http://www.iro.umontreal.ca/%7Elisa/ twiki/bin/view.cgi/Public/BabyAISchool, 2007. [19] “The rectangles dataset,” http://www.iro.umontreal.ca/∼lisa/twiki/bin/ view.cgi/Public/RectanglesData, 2007. [20] B. Dacorogna, Introduction to the calculus of variations. Imperial College Press, 2004. [21] G. Kutyniok and D. Labate, “Introduction to shearlets,” in Shearlets: Multiscale analysis for multivariate data, G. Kutyniok and D. Labate, Eds. Birkh¨auser, 2012, pp. 1–38. [22] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXiv preprint arXiv:1512.03385, 2015. [23] M. Comenetz, Calculus: The elements. World Scientific, 2002. [24] L. Grafakos, Classical Fourier analysis, 2nd ed. Springer, 2008.