Sparse Components of Images and Optimal

0 downloads 0 Views 819KB Size Report
We aim to find an optimal dictionary, one achieving the optimal ...... This lower bound shows that no dictionary can provide better than a certain sparsity. It is.
Sparse Components of Images and Optimal Atomic Decompositions David L. Donoho Statistics Dept., Stanford University December, 1998 Revised April, 2000 Abstract Recently, Field, Lewicki, Olshausen, and Sejnowski have reported efforts to identify the “Sparse Components” of image data. Their empirical findings indicate that such components have elongated shapes and assume a wide range of positions, orientations, and scales. To date, Sparse Components Analysis (SCA) has only been conducted on databases of small (e.g. 16-by-16) image patches and there seems limited prospect of dramatically increased resolving power. In this article, we apply mathematical analysis to a specific formalization of SCA using synthetic image models, hoping to gain insight into what might emerge from a higher-resolution SCA based on n by n image patches for large n but constant field of view. In our formalization, we study a class of objects F in a functional space; they are to be represented by linear combinations of atoms from an overcomplete dictionary, and sparsity is measured by the `p norm of the coefficients in the linear combination. We focus on the class F = Starα of black-and-white images with the black region consisting of a starshaped set with α-smooth boundary. We aim to find an optimal dictionary, one achieving the optimal sparsity in an atomic decomposition uniformly over members of the class Starα . We show that there is a well-defined optimal sparsity of representation of members of Starα ; there are decompositions with finite `p norm for p > 2/(α + 1) but not for p < 2/(α + 1). We show that the optimal degree of sparsity is nearly attained using atomic decompositions based on the wedgelet dictionary. Wedgelets provide a system of representation by elements in a dyadically-organized collection, at all scales, locations, orientations, and positions. The atoms of our atomic decomposition contain both coarse-scale dyadic ‘blobs’, which are simply wedgelets from our dictionary, and fine-scale ‘needles’, which are differences of pairs of wedgelets. The fine-scale atoms used in the adaptive atomic decomposition are highly anisotropic and occupy a range of positions, scales, and locations. This agrees qualitatively with the visual appearance of empirically-determined sparse components of natural images. The set has certain definite scaling properties; for example, the number of atoms of length l scales as 1/l, and, when the object has α-smooth boundaries, the number of atoms with anisotropy ≈ A scales as ≈ Aα−1 .

Key Words and Phrases. Sparse Components Analysis. Independent Components Analysis. Synthetic Image Models. Computational Harmonic Analysis. Edgelets. Wedgelets. Acknowledgements. This work was written while the author was a guest of the Institut Henri Poincar´e during the Trimestre Image, Autumn 1998. The author would like to thank the organizers and staff of this Trimestre for the opportunity to visit – particularly Yves Meyer. This research was supported by National Science Foundation grants DMS 98–72890 (KDI) and DMS 95–05151; and by AFOSR MURI-95-P49620-96-1-0028. The author would like to thank Yacov Hel-Or for introducing him to this research area, and Bruno Olshausen and Pamela Reinagel for enabling him to attend the 1997 Workshop on Natural Scene Statistics, which gave the author a chance to meet and speak with some of the principals in this area of research. 1

1

Introduction

This article was stimulated by recent interactions between computational neuroscience, visual physiology, and statistical analysis. Our goal is to show that recent innovations in harmonic analysis may contribute to the discussion.

1.1

Stimulating Computer Experiments

In an article in Nature [32], Olshausen and Field reported the analysis of a database built from natural images by sampling 16 by 16 image patches from the images. Such image patches may be viewed as arrays of numbers, the p-th patch taking the form X p = (X p (i1 , i2 ) : 1 ≤ i1 , i2 ≤ n), where here n is the image patch extent: in Olshausen and Field’s case, n = 16. The database X = {X p : p = 1, . . . , P } of such image patches was subjected to statistical modelling: Olshausen and Field tried to model each patch X p as a linear combination of some underlying basis elements {φµ } according to X Xp ≈ θµp φµ . µ

Here the key point is that the basis was not fixed in advance but was to be learned from the data. Olshausen and Field proposed that a basis could be learned by searching for a basis that optimized the sparsity of the coefficients (θµp ). The specific objective they proposed was to minimize, over bases Φ for the space of n by n arrays, the functional ) ( P X X X p p 2 p 2 kX − θµ φµ k2 + λ log(1 + (θµ ) ) (1.1) min S(Φ) = p p=1

θµ

µ

µ

subject to the appropriate scale normalization of Φ. The functional S prefers bases which P allow sparse representations. Subject to a fixed budget of coefficient ‘energy’ µ (θµp )2 , the form P p 2 p µ log(1+(θµ ) ) is smallest if all but one of the θµ can be made 0, as can be seen from concavity of log(1 + x) on x > 0. Of course, because of the concavity, true global optimization of S(Φ) is a doubtful project. However, adapting simple “hill-climbing” ideas to improving an initial guess by following a descent direction provides a way to discover interesting local optima. The result published by Olshausen and Field shows a collection of basis elements obtained by local optimization of S(Φ). The elements are narrow edge-like features exhibiting a range of orientations and locations. See Figure 1. This result can be taken to mark the beginning of a body of interesting work pursued by several groups, each one seeking to locally optimize a different objective [1, 19, 20, 22, 21, 24, 29]. Much of this work diverged from the original goal of optimizing sparsity of the recovered components and, following the lead of Bell and Sejnowksi [1], emphasized the goal of finding independent components via independent components analysis (ICA). By now, many different ICA-style algorithms have been tried on image-patch data, with the typical conclusion that ICAstyle analysis produces basis elements displaying in many cases quite pronounced orientational preferences, and ranging over a variety of orientations and positions and scales. For a review of some of this work, compatible with the viewpoint of this article, see [14].

1.2

Sparsity and Overcompleteness

While the ICA viewpoint on this ‘basis from images’ phenomenon was being developed extensively, Olshausen and Field continued to elaborate their original ideas on sparsity. In [33] they proposed that the human visual system is unlikely to use a basis, and that the mathematical concept of overcomplete representation – in which the analysis system contains many more analyzing elements than would seem minimally necessary – is a much more plausible structure to compare with the human visual system. They proposed that one should seek representations from a dictionary Φ which is overcomplete but where, in the analysis of any individual image, only a relatively few dictionary elements would need to be used. They continued to pursue the 2

Figure 1: Sparse Components of Naturally-Occurring Image Data. Result of Olshausen and Field [32]

3

solution of optimization problem (1.1), only now the functional Φ was not restricted to be a basis, but instead could be an m-fold overcomplete set, and was to be learned from the data. The idea of overcomplete-but-sparse component modelling has recently been carried further by Lewicki and Sejnowski [30], who proposed the use of this system outside of image modelling, for example in audio signal analysis, and by Lewicki and Olshausen [29] who did a careful analysis of the resulting dictionary elements in an image analysis setting.

1.3

Stimulating Questions

The venture into sparsity and overcompleteness seems, to the author, a compelling initiative, particularly when compared with other proposals, such as finding orthogonal bases of independent components. Regarding sparsity: • The goal of sparse representation seems intrinsically important and fundamental. The human visual system is thought to do a tremendous job in achieving sparse representation of image data, taking 107 bits/sec at the periphery of the visual pathway and winnowing it down to about 50 bits/sec deep inside. Details of how this happens remain a mystery, and a particularly interesting one, in light of the fact that human-engineered data compression systems have worse performance by many orders of magnitude. • Sparsity seems more plausible as a modelling assumption than the goal of independent components representation. We know from considerable experience that sparsity can be achieved, but we don’t know that independence can be achieved. – Independence is a probabilistic assumption, but there is no ‘natural’ probability on images, and no mathematically-tractable model for image composition and formation which has wide acceptance. Hence, the ICA concept invokes as a goal a concept of unknown applicability. – In fact, it is even clear that independent components, in the strict sense, cannot exist, because images are composed by occlusion. The occlusion of one component of an image by another creates intrinsically dependent components; certain features are not visible precisely because other components are visible. Regarding overcompleteness: • Overcompleteness seems a very natural area to explore on biological grounds. There seems no reason to suppose that the notion of basis is helpful in constructing analyzing elements for the visual system. A basis arises from the problem of finding a maximal linearly independent set or minimal complete set; such structure is mathematically satisfying but perhaps biologically artificial. Daugman [7] has argued that, in fact, orthogonality and even minimal completeness would be undesirable and unattainable in a visual system. For example, it would be hard to construct a set of ‘orthogonal receptive fields’, and once constructed, it would be difficult to maintain orthogonality across time as the system aged, and the death of some neurons in a minimally complete system would cause blindness to specific stimuli. [7] also cited evidence showing that in fact non-orthogonality is ubiquitous and ample overcompleteness is typical. • Overcompleteness seems a promising method to achieve sparse representation. Simultaneous with the developments of [33, 29, 30], there was developing in the signal processing and computational harmonic analysis literature [6, 31, 5] a series of proposals for representation using overcomplete systems, together with a series of examples [12, 13] showing that adaptive representation in overcomplete systems could achieve substantially better sparsity than representation in a single basis.

4

On the other hand, the SCA problem presents challenges as well. • The problem of ‘sparse components analysis’ has so far been defined largely operationally [32, 33, 30, 29]. The operational definition has been based on stochastic models which assume an overcomplete representation by independent components. Hence the current model of SCA lacks, in our view, a clear conceptual foundation distinct from the ICA foundation. • As operationalized, SCA demands global optimization of a highly non-convex objective. Existing heuristic procedures for SCA, based on local optimization, are computationally intensive, and have been used mainly on small-scale problems (e.g. image patches of size 16 by 16), while it would instead be most instructive to be able to use them on large databases of large-scale structures. • In particular, to the extent that SCA might be revealing something fundamental about the problems it is being applied to, it would be interesting to say more about the structure of the whole collection of sparse components – i.e. to find a method of SCA which gives interpretable results.

1.4

This Paper

The goal of the present paper is to explore some of the issues raised by the notion of SCA, to propose a mathematical problem related to SCA, and to use applied mathematical analysis to approximately solve this problem in a specific model of edge-dominated images. Our claim is as follows. It seems unlikely that any general effective procedure for SCA will be found any time soon, at least for problem sizes capable of giving detailed high resolution information, because (1) SCA requires global non-convex optimization; and (2) Even the local optimization component of the SCA task is computationally intensive. Hence, we have decided to pursue theoretical analysis of model problems in which we can obtain approximate answers analytically. Owing to some recent work of the author [13], it is possible to study a mathematical formalization of the problem of SCA for images containing curved edges, and to obtain information about the (approximate) sparse components of such images. As we shall see, these are in some sense narrow needle-like structures assuming a range of lengths, anisotropies, orientations, and positions, and so agreeing in a crude sense with the visual appearance of empirically-determined sparse components. In this paper we develop, in Sections 2-6, a general framework for a mathematical model of SCA, and in Sections 7-10, we solve this model approximately for the case of images containing edges which are curved but for which the curvature obeys regularity conditions. Section 11 discusses the interpretation of these results.

2

Sparsity and `p Norm

Suppose we are given a vector x = (xj ) and we wish to measure its sparsity. An obvious measure is simply the number of nonzero terms, kxk0 ≡ #{j : xj 6= 0}, but that is highly nonrobust against small perturbations of the zero elements. Consider instead the `p norm: X kxkp = ( |xj |p )1/p . j

The interesting range here is 0 < p ≤ 1. The smaller we choose p, the more we are putting a premium on sparsity. In fact, lim kxkpp = kxk0 , p→0

5

so as p → 0, we are measuring simply the number of nonzero components. The condition kxkp ≤ C is a sparsity condition: it can hold only if the number of |xj |’s exceeding 1/m is bounded by C · m1/p for each m. So instead of saying that x has only N nonzero entries, say, we can say that the N -th largest entry is not bigger than CN −1/p . We sometimes speak of the `p “norm” and even the `0 “norm”, although these functionals lack the convexity properties of such norms. Although `p norms lack the triangle inequality, they are not entirely pathological, they obey the p-triangle inequality: kx + ykpp ≤ kxkpp + kykpp .

(2.1)

An important fact about this measure of sparsity is its relationship to penalized optimization. Suppose we are given a vector θ = (θi ) and consider the optimization problem min kθ − tk22 + λ · ktk0 ,

(Pλ )

θ

here the optimization is over all is obtained by thresholding, √ vectors t. A solution of this problem √ letting ti = θi where |θi | ≥ λ and ti = 0 where |θi | < λ. The value of the optimization problem is X Sλ (θ) = min(|θi |2 , λ). i

This expression measures precisely the tradeoff between sparsity of approximation and the goodness of fit of a sparse model. Now the `p norm of θ can be used to obtain insight about the behavior of the sparsityfunctional as a function of λ. Indeed, if we define the weak `p norm by kθkw`p = max λ · #{i : |θi | > λ}−1/p , λ

then we see immediately that for a universal constant Cp , Sλ (θ) ≤ Cp · kθk2w`p λ1−p/2 . In fact, there is a kind of equivalence between the information in Sλ and the information in the `p norm. Up to multiplication by implicit constants depending only on p, we have µ

¶1/2 sup λp/2−1 Sλ (θ)

³ kθkw`p ,

λ>0

and

µZ λ λ>0

p/2−1

dλ Sλ (θ) λ

¶1/2 ³ kθk`p ;

so the `p and weak-`p ‘norms’ contain within them summaries of the size of Sλ , summarized across all values of λ. Compare [3, 9]. These remarks are interesting in comparison with the sparsity objective of Olshausen and Field, (1.1). considered here. The concave P In effect, their functional S is similar to the oneP function i log(1+(θi )2 ) is used in place of the concave function i 1{|θi |>0} . From a conceptual point P of view, the difference between the two objectives is minor, however, the approach based on i 1{|θi |>0} has the advantage of fitting more directly within an existing stream of research [10, 11].

3

Proposed Formalization of SCA

We can now give a specific context and definition for sparse components analysis. It has the following ingredients.

6

• A class of mathematically-defined objects F . We suppose we are given a mathematicallyspecified class of objects f (t) defined on a compact T and forming a compact subset in L2 (T ). For concreteness, and for reference to the image analysis setting, we take for our domain T the unit square [0, 1]2 . We discuss below a specific class F of “black-andwhite” images in which each f = 1B (t) is the indicator of a starshaped set B with smooth boundary. • Overcomplete Dictionaries. We consider dictionaries Φ = {ϕ} of representing elements, or atoms. We intend that these be overcomplete – i.e. consist of ‘more than a basis’ – so that many distinct representations of the same object are possible using the dictionary, although mere bases are allowable as well. Some natural examples of dictionaries of the type we intend include: – Orthogonal bases, such as the Fourier or Wavelets bases; – Finite Unions of Orthogonal Bases, such as Wavelets ∪ Sinusoids; – Libraries of Orthogonal Bases, such as the Cosine packets or Wavelet Packets systems of Coifman and Meyer (1989), and the libraries of anisotropic Haar bases discussed in [12]; and – General Non-Orthogonal Dictionaries, such as collections of multiscale Gabor functions [31]. However, a feature of our approach is that any conceivable dictionary is allowed, and we seek a dictionary which is approximately optimal among all dictionaries. • Adaptive Expansions in overcomplete dictionaries. Working in a given dictionary we consider adaptively selected decompositions X f= θi φ i , i

where Φf = {φi } is a countable normalized set in L2 (T ) – elements obeying kφi k2 = 1 – derived by selecting a sequence of elements from the dictionary Φ. • Polynomial Constraint on Search Depth. In fact, we limit attention to dictionaries Φf = {φ1 , φ2 , . . .}, constructed by subselection from countable dictionaries according to a Polynomial Depth Search constraint. We assume that the i-th term in the expansion is obtained according to a selection rule σ(i, f ), which is allowed to adaptively depend on f and that the selection function must obey σ(i, f ) ≤ π(i) for a fixed polynomial π(i). The significance of this condition – adapted from [15] – will be discussed further below. However, we allow something slightly broader than simple selection We allow φi to be an arbitrary linear combination of the first i terms ϕ selected from the dictionary Φ. For example, φi can be obtained by orthonormalizing ϕi with respect to the first i − 1 selected terms (ϕj : 1 ≤ j < i). • A minimax problem. Performance will be measured by the `p quasi-norm of the expansion coefficients in a worst-case sense, as measured by min Φ

min σ(n,f )≤π(n)

max kθ(f )kp F

(3.1)

As the `p norm measures the sparsity of the coefficients, a dictionary Φ achieving the minimum is a kind of optimal overcomplete set of elements for synthesizing objects from class F. A fundamental property of problem (3.1) is that there is, in certain settings, a well-defined optimal degree of sparsity of representation of a functional class, that is, a p0 = p0 (F ) with the property that the optimal value of (3.1) is finite for every p > p0 and infinite for every p < p0 . In this paper, it is this qualitative property that we will examine, rather than the precise constant involved in the optimal solution of (3.1). 7

4

Relation to the Original Problem

We now briefly review the links of this problem to the original problem posed by Olshausen and Field. We have made three noticeable changes. P P [Ch1] Change of functional. We replace the objective for optimization kf − i ci φi k22 +λ log(1+ (c2i )) by kθkp . The connection between these two objectives has been discussed in an earlier section. [Ch2] Change from Database X to functional class F . The original problem required the analysis of a database of image patches sampled from naturally-occurring data. Instead, we study a mathematically-defined class of objects F. The point of this replacement is, of course, that we expect to gain a certain mathematical structure which allows us to say something about the continuum situation. [Ch3] Change from an average to a worst case. The original problem optimized a sum across images in the database or, implicitly, an average across images. We propose to optimize the worst case in our model functional class. Worst-case reasoning is often (rightly) ridiculed as non-robust, but since here, the class F is a mathematical model we define, this problem doesn’t really present itself. It might seem that [Ch1] is a rather innocuous change and [Ch2]-[Ch3] are rather serious. In our opinion the truth is just the reverse: [Ch2]-[Ch3] are minor matters, while [Ch1] has rather far-reaching implications on the structure of a solution. It seems to impose on the overcomplete set a sort of self-similar multiscale structure, whereas optimization at a single λ need not.

5

Star-Shaped Images

We now elaborate on the class of star-shaped objects mentioned earlier. Our notation and exposition are taken from [13, 15]. Related models were developed some time ago in the mathematical statistics literature by [27, 28]. A star-shaped set B ⊂ [0, 1]2 has an origin b0 ∈ [0, 1]2 from which every point of B is ‘visible’: the line segment {(1 − t)b0 + tb : t ∈ [0, 1]} ⊂ B whenever b ∈ B. We define Star-Setα (C), a class of star-shaped sets with α-smooth boundaries, by imposing regularity on the boundaries using a kind of polar coordinate system. Let ρ(θ) : [0, 2π) → [0, 1] be a radius function and b0 = (x1,0 , x2,0 ) be an origin with respect to which the set of interest in star-shaped. Define ∆1 (x) = x1 − x1,0 and ∆2 (x) = x2 − x2,0 ; then define functions θ(x1 , x2 ) and r(x1 , x2 ) by r = ((∆1 )2 + (∆2 )2 )1/2 .

θ = arctan(−∆2 /∆1 );

For a starshaped set, we have (x1 , x2 ) ∈ B iff 0 ≤ r ≤ ρ(θ). In particular, the boundary ∂B is given by the curve (5.1) β(θ) = (ρ(θ) cos(θ) + x1,0 , ρ(θ) sin(θ) + x2,0 ) Figure 2 gives a graphical indication of some of the objects just described. The class Star-Setα (C) of interest to us can now be defined by Star-Setα (C) = {B : B ⊂ [

1 1 9 2 1 , ] , ≤ ρ(θ) ≤ 10 10 10 2

θ ∈ [0, 2π),

¨ lderα (C)}. ρ ∈ Ho

α

¨ lder (C) means that ρ is continuously differentiable and Here the condition ρ ∈ Ho |ρ0 (θ) − ρ0 (θ0 )| ≤ C · |θ − θ0 |α−1 ,

θ, θ0 ∈ [0, 2π).

Notes 1: Some star-shaped sets have more than one possible choice of origin b0 ; different choices lead to different radius functions ρ; we demand only that some valid choice of b0 lead to a ρ obeying the above conditions. 8

ρ θ b0

Figure 2: Typical Star-Shaped Set, and associated notation. 2: We consider only the range 1 < α ≤ 2; α ≤ 1 is excluded. The actual objects of interest are the indicators of sets in Star-Setα , so we introduce the functional class (5.2) Starα (C) = {f = 1B : B ∈ Star-Setα (C)} . Theorem 1 For the class F = Starα (C) we have an optimal degree of sparsity p0 (Starα (C)) = 2/(α + 1). That is, for p > 2/(α + 1) the value of min Φ

min

max

σ(n,f )≤π(n) Starα (C)

kθ(f )kp

is finite, while for p < 2/(α + 1) the value is infinite. We prove this over the next several sections.

6

Polynomial Depth Search in Dictionaries

We next elaborate on the definition of constrained search in a dictionary described earlier. We first indicate the reasons for such a restriction. When we allow the use of a non-orthogonal dictionary without any restriction, we allow truly pathological dictionaries, for example, dictionaries consisting of countable dense subsets of L2 (T ). Such dictionaries have the property of universal sparse approximation, i.e. the property that for every element f of L2 (T ), there is an element in the dictionary Φ at arbitrarily small distance from such an f . Unconstrained selection from such a dictionary can lead to arbitrarily sparse approximation to every f , i.e. an orthogonal sequence φˆi achieving kθkp ≤ (1−ε)−1/p kf k2 for any given p > 0. Indeed, we simply let kφi − f k2 ≤ εi kf k2 , i = 1, 2, . . . , and obtain φˆi by the Gram-Schmidt procedure. Such a representation is uninformative and has no computationally sensible interpretation. So the concept of dictionary leads to trivial schemes of approximation without some restriction. The Polynomial Depth Search restriction requires that, for a fixed polynomial π(n) independent of f , the first n terms in the approximation must come from one of the first π(n) terms in 9

the dictionary. This eliminates the possibility of universal sparse approximation by countable dense dictionaries for infinite-dimensional functional classes. Indeed to guarantee arbitrarily sparse approximation by the selected first term in such a countable dense dictionary, one would have to allow search to an unbounded depth (e.g. to find kφ1 − f k2 ≤ ε); one could not place in advance of seeing a particular f any upper bound on the selected i. When applied in the context of traditional orthogonal series expansions, the polynomial p condition depth search restriction is very light. P∞ As the `2 -norm is rearrangement invariant, the simply says that the tail sums i>n |hf, φi i| decay at some algebraic rate O(n−η ) for some η > 0. Almost any kind of traditional smoothness is enough to imply this. Consider, for example, functions of a single variable which have bounded variation on the unit interval [0, 1]: T V (f ) ≤ C. It is easy to see that, because such functions have jumps, large wavelet coefficients may occur either at coarse scales or in clusters near jumps. As a result, the 2j largest wavelet coefficients need not occur among the first 2j coefficients, i.e. the first j scales: some of these large coefficients may appear at finer scales, in fact as fine as scale 32 j. One can easily construct examples of piecewise constant functions with a single jump where if one considers the wavelets ordered in the standard lexicographic fashion (position index varies fastest) it may be necessary to 3 look among the first O(2 2 j ) coefficients to find the 2j biggest ones. However, in a certain maximal sense, it is not necessary to look deeper than this. If we let π(n) = n2 , and we let σ(n, f ) be the n-th largest coefficient among the first π(n), then we will obtain an adaptive representation obeying a polynomial depth search constraint, as well as the embedding kθkp ≤ Cp , for all p > 2/3, which is optimal. The polynomial depth search constraint can also be motivated by appeal to the concept of sparse coding [17, 18]. We allow sparse decomposition, where most atoms in the original dictionary are not used in any given decomposition; but we allow only rules in which the sparsity (propensity to use elements near position n) scales like a power law in n rather than like a decaying exponential in n. When the propensity scales like a power law, the number of bits required to code a selected position is at most comparable to the number of bits required to approximately represent the selected coefficients. When the propensity grows faster, the number of bits required to code a selected position can dominate the cost of the representation.

7

Lower Bounds

The polynomial depth search restriction enables us to bound the performance of dictionary decompositions in `p norm. Definition 1 We say that a function class F contains an embedded orthogonal hypercube of dimension m and side δ if there exist f0 ∈ F , and orthogonal functions ψi,m,δ , i = 1, ..., m with kψi,m,δ kL2 = δ, such that the collection of hypercube vertices ½ H (m; f0 , (ψi )) =

h = f0 +

m X

¾ ξi ψi,m, ξi ∈ {0, 1}

(7.1)

i=1

embed in F:

H (m; f0 , (ψi )) ⊂ F.

(7.2)

(We emphasize that H just consists of its vertices). Our lower bounds are obtained by showing that a functional class contains sufficiently highdimensional hypercubes of sufficiently high sidelength. By sufficiently we mean ‘sufficient to resemble an `p ball’. Recall that a standard `p ball of radius C contains hypercubes of dimension m and sidelength C · m−1/p . Definition 2 We say that a function class F contains a copy of `p0 if F contains embedded orthogonal hypercubes of dimension m(δ) and side δ, and if, for some sequence δk → 0, and some constant C > 0, k = k0 , k0 + 1, . . . . m(δk ) ≥ C · δk−p

10

We note that an `p -ball contains a copy of `τ0 , for every τ ≤ p, and that in fact every Lorentz ball `pq with q > 0 contains a copy of `τ0 , for every τ ≤ p. Theorem 2 Suppose that F contains a copy of `p0 . Then for every τ < p and every method of atomic decomposition based on polynomial depth search from a dictionary Φ, min

max kθ(f )kτ = +∞.

σ(n,f )≤π f ∈F

(7.3)

Proof. Without loss of generality, we consider only adaptive orthonormal decompositions, i.e. decompositions where kφi k2 = 1, hφi , φj i = 0, i 6= j. Suppose we measure the n-term approximation error to a sequence θ = (θi ) using the `2 norm. Formally, this can be written as   1/2    X  |θi |2  : #I ≤ n . en (θ) = inf     i6∈I  Now, suppose the contrary of (7.3), so that min

max kθ(f )kτ ≤ C,

σ(n,f )≤π(n) f ∈F

for some C fixed and finite. Then it would follow that, for everyP object f in F, we can find by polynomial depth search an adaptive atomic decomposition f = i θi φi obeying en (θ) ≤ C · n−(1/τ −1/2) .

(7.4)

We will be able to show that (7.4) fails. In fact, for any adaptive decomposition method based on a specific dictionary and specific polynomial depth search limit π(·), there is a constant C 00 , and specific choices f ∈ F and n so that en (θ(f )) ≥ C 00 (n log(n))−(1/p−1/2) ,

(7.5)

for arbitrarily large n. It follows that, for any τ < p, (7.4) cannot hold for all n and all f ∈ F . At the center of our argument is a fact from Rate-Distortion Theory [2]. Suppose we have a fair coin tossing sequence Xi , i = 1, . . . , m, where the random variables Xi are mutually independent and equally likely to be heads and tails. Obviously, this sequence can be represented exactly using m bits. We are interested in representing this sequence with fewer bits while tolerating a certain degree of distortion. We concoct a coder-decoder combination, where a coder Encode() is a mapping from strings of m bits into strings of R bits, R < m, and a decoder Decode() is a mapping from strings of R bits into m bits. Overall, we have ˆ = Decode(Encode(X)). X Our measure of distortion will be simply the number of bit-level errors; we define this by ˆ = Dist(X, X)

m X

1{Xi 6=Xˆ i } .

i=1

Hence, if we reproduce the sequence with D errors out of m symbols, the distortion is D. According to Rate-Distortion theory, there is an absolute limit on how well we can do by this scheme. We have average distortion bounded by Ave{Dist(X, Decode(Encode(X)))} ≥ Dm (R),

(7.6)

where the average is over realizations of X, and Dm (R) is the so-called (m-letter) distortion-rate function; see [2]. Results on pages 46-47 of [2] imply that for each constant ρ < 1/2 there is D1 (ρ) > 0 so that, for each m ≥ 1, R ≤ ρ · m =⇒ Dm (R) ≥ D1 (ρ) · m. 11

(7.7)

Here D1 (ρ) is called the single-letter distortion-rate function; its inverse is graphed on Page 48 of [2]. In words, (7.6)-(7.7) says that to get a meaningful fraction of the symbols reconstructed correctly, it is necessary for the coder to retain a substantial fraction of the m bits. (This of course agrees with common sense). We now use this fundamental bound in the context of embedded hypercubes. Suppose there is in F an embedded hypercube of side δ and dimension m = m(δ). Suppose we select a vertex of the hypercube at random – with all vertices equally likely. Then this is the same as selecting a binary sequence ξ of length m, via the identification X ξi ψi,m h = f0 + i

Moreover, the elements of the random binary sequence ξ can be viewed as a fair coin tossing sequence – just the same as the Xi referred to above. Now suppose we have a method of representing functions f ∈ F approximately by bit strings of length R. More properly, suppose we have a procedure for processing a function f , obtaining such an encoded bit string e ∈ {0, 1}R , and another procedure reconstructing an approximation f˜. We will give an example of such a procedure below, but for the moment we leave the discussion at a general level. Now suppose we use such a combined procedure on an h that arises as the vertex of a ˜ In general, this need not be also a vertex of that hypercube H. We get a reconstruction h. 2 ˜ Now h ˆ can be identified with a ˆ hypercube; let h be the L -closest vertex of the hypercube to h. ˆ bit string ξ, via X ˆ = f0 + ξˆi ψi,m . h i

ˆ 7→ ξˆ can be identified precisely as of Now notice that the chain of mappings ξ 7→ h 7→ e 7→ h ˆ the form ξ = Decode(Encode(ξ)). Moreover, we have the isometry ˆ 2 kh − hk 2

=

δ2 ·

m X

(ξi − ξˆi )2

i=1

=

δ2 ·

m X

1{ξi 6=ξˆi }

i=1

=

ˆ δ 2 · Dist(ξ, ξ).

From the fundamental lower bound (7.6) of Rate-Distortion Theory, and the inequality M axH ≥ AveH , we now have ˆ 2 : h ∈ H} ≥ δ 2 · Dm (R). (7.8) M ax{kh − hk 2 Here Dm (·) is as described in (7.7). Now consider the following coding procedure. We obtain an adaptive orthogonal decomposition by searching a dictionary according to polynomial depth search with search limit polynomial π(·). We take the first n terms to drive our approximation, we encode the indices of those terms as bit strings and we encode the coefficients from those n terms by rounding them to multiples of ε = n−2/p . Our bit string e encoding the approximation has information about where in the dictionary the n terms came from (n log π(n) bits) and information about the coefficients themselves (n log 2A/ε bits, where A is the largest L2 norm of an element in F). Hence, we are encoding with at most (7.9) R(n) = n(C1 + C2 log(n)) bits. To decode, we simply reconstruct the rounded values of the coefficients and then synthesize using the selected dictionary elements. This coding/decoding procedure, when applied to a vertex h of the hypercube gives us the ˜ say. As before, we pass to the L2 -closest vertex h ˆ of the hypercube. Supposing reconstruction h,

12

the hypercube H is of dimension m and side δ, then if n obeys R(n) ≤ ρm, we must suffer a distortion of ˆ 2 : h ∈ H} ≥ δ 2 · Dm (R) = δ 2 · D1 (ρ) · m. (7.10) M ax{kh − hk 2 ˆ it is closer to h ˜ than is h, so we have Now, by definition of h, ˜ − hk ˆ 2, ˜ − hk2 ≥ kh kh and by the triangle inequality, ˆ − hk ˜ 2 + kh ˜ − hk2 ˆ − hk2 ≤ kh kh so ˜ − hk2 . ˆ − hk2 ≤ 2 · kh kh

(7.11)

Using our orthogonal coordinates, we write h = ˜ = h

∞ X i=1 ∞ X

θi φ i θ˜i φi

i=1

ˆ = h

∞ X

θˆi φi

i=1

where the φi are orthonormal. By the isometry property, (7.11) becomes kθˆ − θk2 ≤ 2kθ˜ − θk2 .

(7.12)

Now θ˜ makes two kinds of errors: one in the tail (dropping terms) and one in the initial segment (rounding coefficients). Define the sequence θ¯ ½ θi i ≤ n . θ¯i = 0 i>n Taking the two types of errors into account, kθ˜ − θk22 ≤ kθ¯ − θk22 + nε2 . We conclude from this and (7.12) that kθ¯ − θk22 ≥

1 ˆ kθ − θk22 − nε2 . 2

Note that en (θ) = kθ¯ − θk2 . ˆ − hk = kθˆ − θk and (7.10), and supposing that R(m) ≤ ρn, we have Applying the isometry kh that for some h ∈ H, the corresponding n-term error obeys en (θ(h)) ≥

1 2 δ · D1 (ρ) · m − nε2 . 2

(7.13)

We now deploy the inequality (7.13) in the context of a specific choice of n, m and δ. By hypothesis, F contains a copy of `p0 . Hence we can find a sequence of hypercubes with sidelength δk → 0 as k → ∞ and dimension mk = m(δk ) ≥ Cδk−p . We fix ρ < 1/2. Now picking nk so that 2C2 · nk log(nk ) ≤ ρmk , with C2 as in (7.9), we obey the required inequality R(nk ) ≤ ρmk for all large k. Under these assumptions, the first term in (7.13) can be controlled using −(2−p)/p

δk2 · mk ≥ C 0 mk

≥ C 00 (nk log(nk ))−(2−p)/p . 13

Also, we remark that the ‘rounding error term’ in (7.13) obeys −(4/p−1)

−mk ε2k = O(mk

−(2/p−1)

) = o(mk

).

Evidently, the dominant term on the right side of (7.13) is the first one. Hence for each sufficiently large k there is a hypercube vertex hk ∈ F and an nk such that the nk -term approximation error obeys enk (θ(hk )) ≥ C 000 · (nk log(nk ))−(1/p−1/2) . This demonstrates (7.5), and (7.3) follows. ♦ This lower bound shows that no dictionary can provide better than a certain sparsity. It is closely related to work which tries to show that no orthonormal basis provides better than a certain sparsity. Such relevant work includes lower bounds developed using other techniques by Kashin [25], by Kashin and Temlyakov [26], and by Donoho [12]. The closest related paper is [11], which also relies on Rate-Distortion theory.

8

Hypercube Embeddings in Class Star-Setα

We now consider hypercube embeddings in the function class model Star-Setα . We quote the following result from [15]. Theorem 3 The class Star-Setα contains a copy of `p0 for p = 2/(α + 1). This implies immediately half of Theorem 1: that the optimal degree of sparsity obeys p0 ≥ 2/(α + 1). For the convenience of the reader, we repeat the argument given in [15]. Let ϕ be a smooth function of compact support ⊂ [0, 2π]. For scalars A and m(A, δ) to be determined, set i = 0, . . . , m − 1 . (8.1) ϕi,m = A · m−α · ϕ(mt − i) , Note that kϕi,m kΛ˙ α = A · kϕkΛ˙ α and that kϕi,m kL1 = A · kϕkL1 · m−α−1 . Fix an origin at (1/2, 1/2) and define polar coordinates (r, θ) relative to this choice of origin. Set r0 = 1/4 and set f0 = 1{r≤r0 } and set ψi,m = 1{r≤ϕi,m +r0 } − f0 ,

i = 1, . . . , m.

Then the collection of radius functions m X

rξ = 1 +

ξi ϕi,m ,

ξi ∈ {0, 1},

ξi ψi,m ,

ξi ∈ {0, 1}.

(8.2)

i = 1

corresponds to the collection of images f ξ = f0 +

m X i = 1

Figure 3 illustrates the construction. The ψi,m are ‘bulges’ and each fξ is the indicator of the circle with a collection of additional bulges ‘glued on’. Note that the ψi,m are disjointly supported, and so mutually orthogonal, with kψm k22 = kφi,m kL1 = AkϕkL1 m−α−1 . On the other hand, if r(θ) is one of the radius functions (8.2), then krkΛ˙α ≤ kϕi,m kΛ˙α = AkϕkΛ˙ α . Consequently, whenever A ≤ C/kϕkΛ˙ α we have the hypercube embedding H(m; f0 , {ψi,m }) ⊂ Starα (C). 14

ψ2,m

ψ1,m

ψ3,m

β(s1 )

ψ4,m

β[I2 ]

f0

Figure 3: The Hypercube Construction. Here f0 is the indicator of the circular region; around this region are m ‘bulges’; each bulge is disjoint from the other bulges and so the corresponding indicators ψi,m are orthogonal in L2 [0, 1]2 . Each combination of the circular region together with a specific collection of bulges makes a specific image which belongs to Starα (C); the 2m such images can be viewed as a complete m-dimensional hypercube. The sidelength ∆ = kψi,m kL2 of this hypercube satisfies, whenever A ≤ C/kϕkΛ˙ α ∆2 = A · kϕkL1 · m−α−1 ≤ C/kϕkΛα · kϕkL1 · m−α−1 Hence putting µ

δ 2 kϕkΛα m(δ) = b C kϕkL1

−1 ¶ α+1

c,

A(δ, C) = δ 2 ms+1 /kϕkL1 ,

we have A ≤ C/kϕkΛ˙ α and ∆ = δ. Hence H is a hypercube of sidelength δ and dimension m(δ) embedding in Starα (C). The dimension of the hypercube obeys m(δ) ≥ Kα · C 1/(α+1) · δ −2/(α+1) , for all 0 < δ < δ0 , where δ0 is the solution of µ 2 = and 1 Kα = 2

9

δ02 kϕkΛα C kϕkL1 µ

kϕkΛα kϕkL1

−1 ¶ α+1

−1 ¶ α+1

.

Continuum Wedgelets Dictionary

We now turn to establishing the remaining half of Theorem 1, i.e. that the optimal degree of sparsity p0 ≤ 2/(α + 1). To do so, we must exhibit a dictionary within which we can obtain the desired degree of sparsity of representation. In this section, we show how to do that while ignoring the polynomial depth search constraint; in the following section we show how to impose it. 15

Figure 4: Some Continuum Wedgelets Definition 3 The Continuum Wedgelet Dictionary for [0, 1]2 is the collection of all indicators of all dyadic squares in [0, 1]2 and of all wedges formed by splitting a dyadic square in two along a line connecting a boundary point on one side of the square to a boundary point not on the same side of the square. This continuum ‘dictionary’ is not a collection of discrete elements, so it cannot be ordered in a list and the concept of polynomial depth search is not applicable. Later we will face the issue of discretization.

9.1

Lipschitz Graph Decomposition of Boundary Curves

We now develop a decomposition of elements in Starα using this dictionary. The approach derives from [13]. Note that for objects f ∈ Starα (C), the boundary ∂B of B = supp(f ) can be parametrized by a unit-speed curve β(s) with continuous tangent. This tangent is in fact uniformly continuous, uniformly across members of the class; the tangent changes with s according to a modulus of continuity valid uniformly over all f ∈ Starα (C): ˙ + h) − β(s)k ˙ kβ(s ≤ Cα · C · |h|α−1 .

(9.1)

Such a curve can be decomposed into a finite series of Lipschitz Graphs – regions of s within which the curve takes the form of x2 = f (x1 ) or x1 = f (x2 ) (with controlled slope in either case). The significance and utility of this decomposition are known from extensive research in harmonic analysis by David and Semmes [8]. See also [23]. We can build such a decomposition according to a simple ‘region growing’ principle. This ˙ is purely horizontal; in an interval I1 about works as follows. Start with a point s1 where β(s) ˙ s, the angle that β(s) makes with the vertical exceeds π/5 in absolute value. On this interval, the image of β, β[I1 ], say, coincides with a Lipschitz function of x1 , with slope bounded by λ = cos(π/5)/ sin(π/5). Starting at the right-hand endpoint of this interval, s2 , say, we can ˙ makes with the horizontal exceeds identify an interval I2 throughout which the angle that β(s) π/5 in absolute value; on that interval the image β[I2 ] coincides with a Lipschitz function of x2 , with slope bounded by λ. Continuing in this fashion, we obtain a finite list of intervals Ii and corresponding Lipschitz graphs. See Figure 5.

16

β(s0) β[I1]

β(s4) β(s3)

β(s1) β[I4]

β[I2] β[I3]

β(s2)

Figure 5: A Decomposition into Lipschitz Graphs. The result of the region-growing procedure is illustrated, with starting point s0 indicated, along with critical points si at which β makes an angle of size π/5 with either horizontal or vertical. Also indicated are the arcs β[Ii ] which are Lipschitz graphs x2 = fi (x1 ) or x1 = fi (x2 ) Each such Lipschitz Graph can, at all sufficiently fine scales j, be approximated by linear interpolation at grid points 2−j . By this we mean that at all sufficiently fine scales, the grid points will be finely enough spaced so as to not ‘miss the interval’ entirely, and so reasonable approximation is possible. Here ‘sufficiently fine’ is determined as follows. A given Lipschitz graph, say one of the form x2 = fi (x1 ), will agree with the boundary curve over a segment x ∈ Ii . ‘Sufficiently fine’ means that 2−j < |Ii |/4, and this condition guarantees that at least three dyadic grid points xj,k = k/2j fall in the interval Ii , and so the linear approximant (‘spline’) has at least three ‘knots’. Lemma 4 There is a critical scale µ0 = µ0 (α, C) with the property that scales j ≥ µ0 are ‘sufficiently fine’ for all Lipschitz Graphs arising from all boundary curves β associated with Starα (C). Proof. Indeed, this is the same thing as saying that any Lipschitz Graph constructed by region growing has a length |Ii | > 4 · 2−µ0 . Now a component of a Lipschitz graph produced by region growing contains a point s at which the angle of β˙ with one of the axes is not bigger than π/5 and also a point s0 where the angle is not smaller than π/2 − π/5. Hence, during the interval, the direction of the tangent must change by at least p ˙ ˙ 0 )k > (1 − cos(π/10))2 + sin(π/10)2 = δ0 , kβ(s) − β(s say. From the modulus of continuity (9.1), this requires that δ0 < Cα · C · |s − s0 |α−1 . Hence we have

(δ0 /Cα · C)1/(α−1) < |s − s0 |.

We may pick 2−µ0 to be smaller than one-fourth the left-hand side, and we are guaranteed the desired property. ♦ Let then, for j > µ0 , βj denote the piecewise linear boundary curve obtained by joining together the piecewise linear Lipschitz graphs obtained by linear interpolation. We remark that 17

Figure 6: The coarse-scale atoms in Γ0 are ‘blobs’ associated with squares intersecting the interior of the set B. These are either full dyadic squares or mutilated squares (wedgelets). this joining can be performed in such a way that it guarantees that the resulting βj is a union of Lipschitz graphs with |slope| < λ and such that each Lipschitz graph has at least 3 defining knots. The sequence of curves βj converges to β, and differs from β by at most C · 2−jα in Hausdorff distance.

9.2

Atomic Decomposition in Wedgelets

We develop our atomic decomposition of the image based on the properties of these curves {βj : j ≥ µ0 }. First, we construct a layer Γ0,f of ‘coarse scale’ atoms. Take each dyadic square S at level µ0 ; say that it is ‘interior to’ βµ0 if that curve does not intersect the interior of S, and if S is topologically inside the region bounded by βµ0 . For each such interior S, let S ∈ Γ0,f . Say that S intersects βµ0 if that curve intersects the interior of S. For each such intersecting S, the curve βµ0 intersects with S in a line segment (edgel) and bounds a wedge-shaped subregion of S which is ‘interior’ to Bµ0 . Let w be this wedge-shaped region, and let w ∈ Γ0,f . Squares which are not interior and not intersecting will be omitted from Γ0,f . Figure 6 gives an illustration. Second, we construct a layer Γ1,f of ‘fine scale’ atoms, as follows. Compare the curve βj with βj+1 ; the two will agree at 50% of their knots; in between, they will differ according to triangular-shaped regions; see Figure 7. For each j > µ0 , enumerate the triangular shaped regions as ∆j,l , 1 ≤ l ≤ Lj . Each such region has an associated sign σj,l , which is +1 if ∆j,l ⊂ Bj+1 \Bj and −1 if ∆j,l ⊂ Bj \Bj+1 . These regions are not themselves in the wedgelets dictionary; but each such region is built from at most N wedgelets, for a fixed N derivable from properties of the region-growing algorithm. Indeed, the intersection of a triangular region and a dyadic square at level j + 1 is bounded by at most two sides of the triangle and two or three sides of the dyadic square; for reasons that emerge below, at fine scales these regions typically have very extreme aspect ratios. We will call such a region a needle. Each needle can be represented by a single wedgelet, or else by the difference of two wedgelets; the two wedgelets combine destructively to create the needle: + − \wj,l,k . νj,l,k = wj,l,k

See Figure 8. 18

βj+1

∆j,l

∆j,l+1

βj

k/2j

(k+1)/2j

(k+2)/2j

Figure 7: Triangular Difference Regions. Segments of two Lipschitz graphs at adjacent scales are indicated. The graphs agree at dyadic points corresponding to the coarser grid and differ at the dyadic points corresponding to the finer grid. The symmetric difference between the regions that the two curves bound will be a series of triangular regions. Each Lipschitz graph has slope ≤ λ by construction, and so each triangular region under study intersects at most 2(λ + 2) dyadic squares at level j + 1. Kj,l νj,l,k be the decomposition in needles. Associate the sign σj,l with each Let then ∆j,l = ∪k=1 needle, via σν = σj,l . Define then Γ1,f = {νj,l,k : j > µ0 , l = 1 . . . Lj , k = 1, . . . , Kj,l } Collect together now the fine-scale and coarse-scale atoms, as Γf = Γ0,f ∪ Γ1,f , and set, for each, p φγ = 1γ (x1 , x2 )/ Area(γ), and, if γ ∈ Γ0,f θγ =

p Area(γ)

while, if γ ∈ Γ1,f θ γ = σγ · The formal sum f= 2

p Area(γ).

X

θγ φ γ

(9.2)

2

can be shown to make sense in L [0, 1] . Indeed, the triangle inequality gives X X X k θγ φγ k2 ≤ |θγ |kφγ k2 = |θγ | < ∞. γ

γ

The finiteness of the sum on the extreme right follows from our next result, Theorem 5 below, by taking p = 1. It follows that the (9.2) is an absolutely convergent sum. Finally, note that each atom φγ is a linear combination of the indicators of at most two wedgelets.

9.3

Sparsity Analysis

Theorem 5 The above procedure gives an adaptive atomic decomposition of f in the continuum wedgelet dictionary X θγ φ γ f= γ∈Γf

which achieves sup f ∈Star

α

kθkp < ∞

for p > 2/(α + 1). 19

∆j,l

νj,l,k

\ w+j,l,k

= w-j,l,k

νj,l,k

Figure 8: Decomposition of triangular difference region into Wedgelets. The triangular difference region is built from a number of ‘needles’ ν, where each needle is built from either a single wedgelet or a pair of wedgelets. The figure illustrates construction of a needle from a pair of wedgelets, the one in gray entering ‘constructively’ (by unions) and the one in black entering ‘destructively’ (by differences). A comparison of this result with the lower bounds suggests that p0 = 2/(α + 1) – a result which we develop later below. The proof of this theorem rests on an analysis of objects in class Starα (C) made by the author in [13]. We now need two key observations, proved there: • There are at most K1 2j + K2 dyadic squares at level j that interact with the boundary curve. • The area of any midpoint deflection region ∆j,l associated with such an interacting square is at most C · 2−j · 2−jα . It follows that if we stratify the sum of p-th powers of coefficients by scale we have X

|θγ |

p



γ

∞ Xp X p ( Area(γ)) +

j=µ0 +1 γ∈Γ1,j

Γ0

≤ ≤

X q ( Area(∆j,l ))p

4µ0 · (2−µ0 )p +

∞ X

(K1 2j + K2 )(

j=µ0 +1 ∞ X j −j(1+α)p/2

K10 + K20 ·

2 ·2

p C1 · 2−j · 2−jα )p

.

j=µ0 +1

Consequently, for p > 2/(1 + α), sup Starα

X

|θγ |p < Kα,p (Cα ),

γ



as was to be proved.

20

10

Discretized Wedgelets Dictionary

The above atomic decomposition in wedgelets can be modified slightly so as to operate in a countable dictionary under the constraint of polynomial depth search. Compare [13]. We construct now E j , the collection of all dyadic edgelets at level j (and scale 2−j , as follows. Let S(j, k1 , k2 ) be any dyadic square of side 2−j contained in [0, 1]2 and mark out, in a clockwise traverse of the boundary starting from the northwest corner, a series of equispaced vertices at spacing 25j/2 . Let V(j, k1 , k2 ) be the collection of vertices so obtained. Then take all pairs of such vertices on distinct sides of S(j, k1 , k2 ); the line segment e connecting these vertices is an edgelet. Let E(j, k1 , k2 ) denote the collection of edgelets associated with square S(j, k1 , k2 ) and let E j be the collection of all edgelets made from all dyadic squares at level j: E j = ∪k1 ,k2 E(j, k1 , k2 ); also let V(j) = ∪k1 ,k2 E(j, k1 , k2 ). There are Nj = #E j ≤ 4j · (25j ) edgelets at level j; note that Nj is polynomially growing in j 2 . Now an edgelet e at level j divides the square in which it lives into two pieces with common edge e; these pieces are wedges at level j. We can define now W j , the set of wedgelets at level j, to be the collection of all L2 -normalized indicators of unit squares at level j, and of all L2 -normalized indicators of wedges at level j. We now claim that [1] An atomic decomposition can be performed using just the wedgelets from (W j : j ≥ µ0 ); [2] The decomposition will achieve the same degree of sparsity as the one based on continuum wedgelets; and [3] The decomposition can be made in a polynomial depth-search constrained fashion. To see [1], simply replace the curves βj driving the earlier decomposition by curves β˜j with vertices taken from the set V(j). Then use exactly the same decompositions as before. To see [2], note that small adaptations of the preceding arguments go through as before, ˜ j,l , obtained by symmetric difto give the same types of quantative estimates. The objects ∆ ˜ ˜ ferences of pieces of the corresponding sets Bj and Bj−1 , are no longer perfect triangles; they are only approximate triangles. But the decomposition into ‘needle’-like terms, the number and size properties of the terms, and the anisotropy of the terms all remain the same so the basic quantitative estimates go through as before. See Figure 9. To see [3], note that the polynomial depth search constraint is objeyed because all terms in the expansion out through a position which is O(4µ0 + 2j ) are taken from W µ0 , . . . , W j , and hence from the first O(4j · 25j ) terms. In short, the N -th term in the expansion is taken from among the first CN 7 terms of the dictionary. We have completed the proof of Theorem 1.

11 11.1

Discussion Interpretation

Suppose we now regard the terms φγ as the ‘sparse components’ identified by our ‘procedure’. What interpretation can we make? 11.1.1

Organization

At coarse scales γ ∈ Γ0 , we have already labelled these terms ‘blobs’ (indicators of squares or amorphous wedges), while we have called the terms at finer scales ‘needles’ (indicators of differences of wedges). The fine scale terms are basically pieces of narrow triangles, with length ≈ 2−j and width ≈ 2−jα , α > 1; at fine scales they have an overwhelming tendency to be very elongated; this 21

O(2-jα) ~ βj+1 ~ βj

Figure 9: An approximately-triangular difference region. Here the region bounded by β˜j and β˜j+1 is composed of approximately-triangular regions. In comparison with the earlier situation, now the two curves need not agree precisely at dyadic points at the coarser scale; however they cannot differ by more than 2−j(5/2) which is negiligible. justifies the appelation ‘needle’. In effect, at fine scales, the visual appearance of these needles is exactly as ‘thickened edgelets’. We always, here and in [13], use the term edgelet to refer to a member of the discrete family of segments E j ; these are infinitely narrow. The atoms used in the decomposition are indicators of regions that are in the limit of increasing scale infinitely narrow; the needles are regions trapped between two nearly parallel edgelets, where the distance between the edgelets ≈ 2−jα is small compared to the length ≈ 2−j . Thus, we can think of these needles as obeying the same dyadic organization as the edgelet system, occurring at all scales, locations, and orientations, and, at fine scales, with negligible thickness. 11.1.2

Scaling

The size distribution of the terms in an atomic decomposition is as follows. In a decomposition, there are at most 4µ0 ‘blobs’, where µ0 = µ0 (α, C) depends on the class of objects being analyzed, but not on the specific object in that class. Then, there are at most O(2j ) acute triangles with length of size ≈ 2−j and width of size ≈ 2−jα , for α > 1. Hence the aspect ratio length/width behaves as ≈ 2j(α−1) . The system therefore exhibits following scaling laws [S1] There are O(L) terms of length ≈ 1/L, for large L. [S2] There are O(A) terms in the expansion at aspect ratio ≈ Aα−1 , for large A. These laws concern the ‘active units’ in a specific decomposition. Such a decomposition is taken from a massively overcomplete representing system of wedgelets, in which at level j > µ0 there are O(2(4+η)j ) elements (η > 0) potentially available for use. (In the above we have chosen η = 1/2 but any η > 0 will do.) Now only O(2j ) of these elements are in use in any specific atomic decomposition. The ‘activity rate’ at level j, i.e. the fraction of units at level j which are in use, is therefore of order O(2−(3+η)j ). This gives another scaling law: [S3] A fraction o(L−3 ) of the units of size L are active in any given decomposition.

22

If we view the decomposition as being performed in the wedgelet dictionary W µ0 ∪ W µ0 +1 ∪ ..., with all available terms present in the decomposition, but unused terms having zero coefficients, this sum is increasingly sparse at increasingly fine scales. 11.1.3

What Features are being Measured?

What do the coefficients in the atomic decomposition measure? Essentially, they measure the local curvature of the boundary ∂B. The size of a triangular difference region ∆j,l driving the atomic decomposition is determined by the local curvature of the boundary ∂B in the neighborhood of a dyadic square. Indeed, if the boundary were perfectly straight within a certain square, the triangular difference region would vanish. More precisely, if we have a local Lipschitz parametrization x2 = fi (x1 ) then the ‘width’ of the triangle ∆j,l is given by the ‘midpoint deflection’ δj,l = fi (xj+1,2k+1 ) − (fi (xj,k ) + fi (xj,k+1 ))/2 for appropriate k, where here xj,k = k/2j are dyadic evaluation points. The area of the triangle obeys the inequality Area(∆j,l ) ≤ |δj,` | · (1 + λ) · 2−j . If fi is twice differentiable, then δj,l ≈ fi00 (xj+1,2k+1 )2−2j and so

Area(∆j,l ) ≤ (1 + λ) · |fi00 (xj+1,2k+1 )| · 2−3j .

The decomposition of ∆j,l into needles leads to a series of individual pieces with areas obeying the same relations, so for the subordinate γ’s Area(γ) ≤ (1 + λ) · |fi00 (xj+1,2k+1 )| · 2−3j , and the corresponding θγ (f ) obeys |θγ (f )| ≤ Const. · |fi00 (xj+1,2k+1 )|1/2 · 2−j3/2 . The coefficients at fine levels are therefore in a sense controlled by the square root of the boundary curvature. 11.1.4

Comparisons

Now we make a few remarks comparing the decomposition developed here and ongoing work in vision. Such comparisons are necessarily hazardous. Field [17] and Foldiak [18] have emphasized the potential benefit that ‘sparse coding’ could play in the design of the visual cortex. Field, in particular, has pointed to the sparsity of wavelet expansions as providing inspiring examples. In this paper we have another example of a system which uses sparse codes. It achieves considerably sparser representation of this image class than a wavelet scheme, and the terms are explicitly related to the underlying curvature of the edges. What are the interrelationships among the active units? Obviously there is a certain hereditary relationship: a unit at j > µ0 can only be active if its parent is also active. If a unit is active, only one other unit in the same dyadic square can be active, and this must be a unit which corresponds to a non-intersecting edgelet. So there is considerable ‘inter-unit inhibition’ at work, and this is organized along structural lines, based on inheritance across scales and based on location within a scale. Field [16] has also suggested that, in view of the self-similarity of image statistics, an optimal image representation might also be self-similar, and mentioned wavelet-like representations as possible models for this. In this paper we have another example of a system that has a selfsimilar structure, in the sense that at each scale 2−j , the underlying structure is one of ‘thickened edgelets’. 23

11.2

Nonuniqueness

The nagging question facing this whole project is: how unique is the dictionary reported here? It would obviously lay a greater claim on our attention if we could prove that any dictionary with the optimal sparsity property would have similar features, for example, similar geometric arrangement (needle-like features at all scales locations and orientations). Obviously, since we are dealing only with a kind of ‘almost optimality’, there is no uniqueness. Another way to put it is the following. Suppose we had another L2 -normalized dictionary Ξ, and that the elements of this dictionary allowed sparse synthesis of the blobs and wedges we have used here: X aγ,i ξγ,i φγ = for each dictionary item φ of the form we have used in the above constructions, where X |aγ,i |τ )1/τ ≤ Cτ (Φ, Ξ) ( i

with Cτ not depending on the blob or wedge in question, and where τ < 2/3. Then obviously we have X f = θγ φ γ γ

=

X

θγ

γ

=

X

X

ai,γ ξγ,i

i

bj ξγj ,ij ,

j

where bj is an enumeration of the products θγ ai,γ . Furthermore, if τ < p then X X ( |aγ,i |p )1/p ≤ ( |aγ,i |τ )1/τ . i

Then X

|bj |p =

XX γ

i

|θγ |p |ai,γ |p =

X

|θγ |p

γ

X

|ai,γ |p ≤ Cτ (Φ, Ξ)p · kθ(f )kpp

i

In short, any system which allows τ -sparse synthesis of wedges, with τ < 2/3, will provide in principle equally sparse atomic decompositions of items in Starα for 1 < α ≤ 2. This shows that there is considerable freedom in constructing almost-optimal dictionaries of synthesizing elements. In some sense we have illustrated this point several times already, in reexpressing the triangular regions underlying a specific function’s decomposition into a bounded number of needles, and then in reexpressing each needle as a difference of at most two wedgelets. The fundamental question is whether the freedom to sparsely reexpress the atoms in an almost-optimal dictionary means that the underlying geometry of the expansion can be thereby deformed into something other than the general“blobs and needles” form. There are definite limits to the possibility of such deformation. Certainly, we cannot represent wedges by `p -summable decompositions in wavelets or in sinusoids or in multiscale Gabor functions. Those systems are unable to provide almost optimal decompositions in our sense. So the freedom does not allow us to deform the “blobs and needles” system arbitrarily into any of the classical bases. The classical bases are not able to give efficient atomic decompositions of objects in Starα ; for example, the ‘best p’ for a representation by wavelets is p = 1. If θ(f, {Wavelets}) denotes an adaptively selected coefficient sequence in some nice wavelet basis then max kθ(f, {Wavelets})kp = +∞ min σ(n,f )≤π(n) Starα (C) for p < 1 [13]. Hence wavelets exhibit worse sparsity than the optimal p0 = 2/(α + 1) when α > 1. Similar statements can be made for sinusoids and for multiscale Gabor functions. 24

Perhaps the structure we have identified, is truly, at some sufficiently high level of abstraction an invariant feature of all nice atomic decompositions of objects having curved edges. Perhaps the scaling laws mentioned in Section 11.1.2 are also truly characteristic of nice atomic decompositions of objects having curved edges. But we do not claim to have established here any such characteristic properties, or to have any concrete ideas about to how to investigate such issues at the moment.

References [1] Bell, A.J. and Sejnowski, T.J. (1995) An information-maximization approach to blind separation and blind deconvolution. Neural Computation 7 1129-1159. [2] Berger, T. (1971) Rate-Distortion theory: a mathematical basis for data compression. Englewood Cliffs, NJ: Prentice-Hall. [3] Bergh, J. and L¨ofstrm, J. (1976) Interpolation spaces. An introduction. Grundlehren der Mathematischen Wissenschaften, No. 223. Springer-Verlag, Berlin-New York, 1976. [4] G. David and S. Semmes (1993) Analysis of and on Uniformly Rectifiable Sets. Math Surveys and Monographs 38, Providence: AMS. [5] Chen, S. , Donoho, D.L., and Saunders, M.A. (1999) Atomic Decomposition by Basis Pursuit. SIAM J. Sci Comp., 20, 1, 33-61. [6] R. R. Coifman and M. V. Wickerhauser (1992) “Entropy-based algorithms for best-basis selection”, IEEE Trans. Info. Theory, 38, pp. 713-718. [7] Daugman, J.G. (1990) Self-similar oriented wavelet pyramids: conjectures about neural non-orthogonality. Image Pyramids and Neural Networks, Proc. ECVP-90 Symposium. [8] David, G. and Semmes, S. (1993) Analysis of and on Uniformly Rectifiable Sets. Mathematical Surveys and Monographs, 38, American Math. Soc. [9] DeVore, R.A. and Lorentz, G.G. (1993) Constructive approximation. Grundlehren der Mathematischen Wissenschaften, 303. Springer-Verlag, Berlin. [10] Donoho, D.L. (1993) Unconditional bases are optimal bases for data compression and statistical estimation. Applied and Computational Harmonic Analysis, 1, Dec. 1993, pp. 100-115. [11] Donoho, D.L. (1996) Unconditional bases and bit-level compression. Applied and Computational Harmonic Analysis 3 388-392. [12] Donoho, D.L. (1997) CART and Best-Ortho-Basis: A Connection. Ann. Statist. 25 September 1997, 1870-1911. [13] Donoho, D. L. (1999). Wedgelets: nearly-minimax estimation of edges. Ann. Statist. 27 859-897. [14] Donoho, D.L. (1998) Independent Components Analysis and Computational Harmonic Analysis. Technical Report, Department of Statistics, Stanford University. [15] Donoho, D.L. and Johnstone, I.M. (1995) Empirical Atomic Decomposition. Unpublished manuscript. [16] Field, D.J. (1993) Scale-invariance and Self-similar ‘Wavelet’ transforms: an analysis of Natural Scenes and Mammalian Visual Systems. Wavelets, Fractals and Fourier Transforms M. Farge, J. Hunt, and J.C. Vassilicos, eds. Oxford University Press. [17] Field, D.J. (1995) Visual Coding, Redundancy, and “Feature Detection”. in The Handbook of Brain Theory and Neural Networks, M.A. Arbib, ed. MIT Press, pp. 1012-1016.

25

[18] Foldiak, P. (1995) Sparse coding in the primate cortex. in The Handbook of Brain Theory and Neural Networks, M.A. Arbib, ed. MIT Press, pp. 895-899. [19] Fyfe, Colin, and Baddeley, R. (1995) Finding compact and sparse distributed representations of visual images. Network 6, 333-344. [20] Harpur, G.H and Prager, R.W. (1996) Development of low entropy coding in a recurrent network. Network 7 277-284. [21] van Hateren, J.H. and Ruderman, D.L. (1998) Independent component analysis of natural image sequences yields spatiotemporal filters similar to simple cells in primary visual cortex. Proc. R. Soc. Lond. B 265 [22] van Hateren, J.H. and van der Schaaf, A. (1998) Independent component filters of natural images compared with simple cells in the primary visual cortex. Proc. R. Soc. Lond. B 265:359-366. [23] P. W. Jones (1990) “Rectifiable Sets and the Travelling Salesman Problem.” Inventiones Mathematicae, 102 1-15. [24] Karhunen, J., Hyvarinen, A, Vigario, R., Hurri, J. and Oja, E. (1997) Applications of Neural Blind Separation to Signal and Image Processing. Proc. ICASSP ’97 pp 131-134. [25] B.S. Kashin (1985), On Approximation Properties of Complete Orthonormal Systems, Trudy Math. Inst. Steklov 172 (in Russian), Proc. Steklov Inst. Math. 1987, 207-211. [26] B.S. Kashin and V.N. Temlyakov (1994), On best m-term approximations and the entropy of sets in the space L1 Mat. Zametki 56 57-86 (in Russian), Mathematical Notes 56, 1137-1157 (in English). [27] R. Z. Khas’minskii & V. S. Lebedev (1990) “On the properties of parametric estimators for areas of a discontinuous image”, Problems of Control and Information Theory pp. 375–385. [28] A. P. Korostelev & A. B. Tsybakov (1993), Minimax Theory of Image Reconstruction, Vol. 82 of Lecture Notes in Statistics, Springer-Verlag, New York. [29] Lewicki, M. and Olshausen, B. (1997) Inferring sparse, overcomplete image codes using an efficient coding framework. Proc. NIPS*97. Pages 815-821. [30] Lewicki, M. and Sejnowksi, T. (1998) Learning Overcomplete Representations. To appear, Neural Computation. [31] S. Mallat and Z. Zhang (1993) Matching Pursuit in a time-frequency dictionary. IEEE Trans. Signal Proc. 41 3397-3415. [32] Olshausen, B.A. and Field, D.J. (1996) Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381 607-609. [33] Olshausen, B.A. and Field, D.J. (1997) Sparse coding with an overcomplete basis set: a strategy employed by V1? Vision Research, 37: 3311-3325. [34] Wickerhauser, M.V. (1993) Adapted Wavelet Analysis: Theory and Algorithms. A.K. Peters.

26

Suggest Documents