Learning Optimal Wavelets from Overcomplete

0 downloads 0 Views 237KB Size Report
Perceptual geometry attempts to answer fundamental questions in perception of ..... P. Vaidyanathan, Multirate Systems and Filter Banks, Prentice-Hall, 1993. 8.
Learning Optimal Wavelets from Overcomplete Representations Hamid Eghbalnia and Amir Assadi Department of Mathematics, University of Wisconsin-Madison ABSTRACT Efficient and robust representation of signals has been the focus of a number of areas of research. Wavelets represent one such representation scheme that enjoys desirable qualities such as time-frequency localization. Once the Mother wavelet has been selected, other wavelets can be generated as translated and dilated versions of the Mother wavelet in the 1D case. In the 2D case tensor product of two 1D wavelets is the most often used transform. Overcomplete representation of wavelets has proved to be of great advantage, both in sparse coding of complex scenes and multimedia data compression. On the other hand overcompleteness raises a number of technical difficulties for robust computation and systematic generalization of constructions beyond their original application domains. In the following, we concentrate on wavelet decomposition of images, and in particular, the computations are carried out for images of faces. With more effort, video files could be handled with adaptation of our present techniques. In this paper we propose a novel and quite general geometric method to generate overcomplete families that are parameterized by wellbehaved manifolds, in fact Lie groups. The main philosophical motivation for our construction is a statistical adaptation of Felix Klein's Erlanger Program. That is, knowledge about the symmetry groups of objects in a family D (called the data set, such as collection of multimedia data or images) carries much of the key statistical geometric features of D and the distribution of such features for members of D in a statistical sense. We use function theory and the geometric properties of the parameter space to extract sparse representations aiming at optimal use of the local geometric features of members of D. The better the statistical distribution of such geometric features, the more efficient and robust becomes the wavelet representation of members of the data set. Briefly, the theory begins with generating the overcomplete family from the local and global transformations afforded by the data sets, e.g. images. In the second stage, we formulate an "action integral" on the suitable function space associated with the moduli space of wavelets. Finally, variational formulas that minimize "the action" provide the sparse representation of data in the form of special points on the moduli space. We illustrate this method by studying the problem of sparse representation of images of faces in a collection D. In the specific application to compute sparse representation of facial images, we develop a specific learning machine to take advantage of the rather rich statistical regularity of frontal view of face images. Rather than starting with a mother wavelet and searching for an optimal representation of a single image, we start with a set of images and ask for a mother wavelet that gives the optimal representation for this class of images. We choose the Support Vector Machine paradigm as the learning paradigm. Then, our problem is one of estimating an optimal kernel in the Reproducing Kernel Hilbert Space that satisfies the admissibility conditions of wavelets. We show simulations results using a database of human faces using a set of wavelet kernels. We discuss extensions of this approach using a local to global technique for eigenfunction expansion. Keywords: Overcomplete representation, Learning theory, Variational, Moduli space, Support Vector regression, Sparse representation.

1. INTRODUCTION Our research is motivated by Perceptual geometry, an emerging field of interdisciplinary research whose objectives focus on study of geometry from the perspective of visual perception, and in turn, application of such geometric findings to the ecological study of vision [1]. Perceptual geometry attempts to answer fundamental questions in perception of form and representation of space through synthesis of cognitive and biological theories of visual perception with geometric theories of the physical world. In the context of perceptual geometry, our previous work [refs] has focused on a basic mathematical model for the Gestalt of surfaces, that is, the simplest phase of global perception of surfaces in the environment, as opposed to visual perception in laboratory and psychophysical experiments under controlled parameters. In this paper we further this area through a modeling and computational study of the perceptual space of faces. Study of objects and their corresponding perceived surfaces in perceptual geometry begins with acknowledging that, unlike mathematical theories, Surfaces in the natural world are not ideal objects. The percept of a surface is formed

through a cascade of neuronal signals, beginning with the rays of light reaching the observer eyes. The sequence of neuronal processing in the brain is carried out in a parallel distributed fashion. The neurons in the circuitry employed by the visual process vary according to the circumstances under which the observer "sees" or "looks" at the object, the condition of light, emergence of other stimuli, such as auditory, tactile and so on. The state of alertness, motivation, attention, and affective factors influence the outcome of the visual perception. The detection of salient features depends on both the internal and external circumstances as well. The dynamic features of vision add to the complexity of modeling visual perception. Without dynamics, there is no vision! Nonetheless, surfaces in the world around us are perceived to have a piecewise smooth “global” geometric form that is quite distinct from their small-scale “local” geometric properties. Faces as objects in the natural environment, and as surfaces in the perceptual sense, avail themselves of these same properties. In this context, one can ask: "How do we associate a simple (sparse) descriptor to faces such that their perceptual geometry is well represented?" In this paper we start we begin with a computational model of the signal and its processing system. The starting point for the computational model is a set of signal belong to the same “conceptual” class. Since our interest is in the study of human visual system, the conceptual class of interest is loosely the cognitive class of human faces. This starting point, therefore, is an assumption regarding earlier classification by the system. The computational model begins with an overcomplete representation for an ensemble of signals. In an overcomplete basis, the number of basis vectors exceeds the dimensionality of the underlying (input) vector space. Overcomplete representations have been advocated because they exhibit greater robustness in the presence of noise and can be used to obtain sparse representations for the data in the input space. Overcomplete codes have also been proposed as a model of some of the response properties of neurons in primary visual cortex. In an overcomplete basis, input can not be uniquely represented as a combination of basis vectors. The majority of previous work related to overcomplete basis has focused on finding the best representation of a signal using a fixed overcomplete basis (or dictionary). Since a unique solution may does not exist, and in practical problems noise is often a factor, solutions to overcomplete representations have to rely on some form of regularization. Regularization functionals are a form of Lagrangian for learning and we exploit this analogy in our formulation. Furthermore, In cases where the overcomplete representation has a natural parameterization (for example, wavelets parameterized by the affine group), this parameterization has the potential of being useful in parameterizing aspects of the signal ensemble. This will lead our computational model to the consideration of moduli spaces. The remainder of this paper proceeds as follows. We review the necessary background for the formulation of the problem by reviewing learning methods in the context of regularization methods. Then, we briefly consider wavelets in the group theoretic setting. The next section motivates moduli spaces and formulates the problem. We then present some early results using a database of faces and conclude the paper.

2. OVERCOMPLETE REPRESENTATIONS It is often the case that given a specific space such as a Hilbert space H (for example, L ( R ) ), one is interested in building a basis for the space with special properties. For example, in the case of wavelets [7,8,9], one may look for basis ϕ n which are as smooth as possible as well as localizable. It may turn out that the sequence { ϕ n } may give rise 2

to a non-orthogonal expansion. For example, every different

A f

2

coefficients

 ≤  ∑ cn  n 

can

be

found)

f ∈ H" has a non-unique representation as f = ∑ cnϕ n (that is such

that

the

norm

equivalence

property

2 12

  

≤B f

2

0 < A ≤ B is satisfied. In general, this does not present a problem and there

exists a canonical construction (Frame operators) for finding the coefficients we are looking for. However, given a set

{ f i ∈ H" }i∈I , the question arises as to whether there is a representation for a set signals that satisfies specific desirable

properties.

As a first step in addressing the proposed problem, in this section, we present a method of generating basis for H which relies on group representations. This approach has offers advantages two of which are relevant to this paper. First, it unifies the treatment of wavelets (and short time Fourier or Gabor transform) and shows how wavelet methods can be generalized for representation on a manifold M. Second, it clarifies the role of the group action and in turn offers a natural parameterizing space for the object under study through the wavelet representation. We proceed mostly by example in our presentation and we refer the interested reader to [12 or 10,11]. Let G be a locally compact topological group with a left Haar measure (The same treatment applies for the right Haar

L2 ( R ) . The irreducibility is 2 required since then the set {U(x)g} for a nontrivial fixed g and x in G is dense in L ( R ) (which along with square measure). Suppose U in an irreducible unitary representation of the group G onto

integrability makes an integral representation possible). Then one can obtain the Gabor transform that arises from the representation of the Weyl-Heisenberg group as follows. Define the representation on

L2 ( R ) as:

W (# t , a , b ) f ( x ) = t exp(2π ib( x − a )) f ( x − a ) Noting that this group is the matrix group below, it is easy to check for the group operations.

1 t b   1 a   1   The wavelet transform can be obtained via the representation of the affine group ax + b :

W ( a, b ) f ( x ) = exp ( −a 2 ) f (exp ( −a 2 ) x − b ) The group operation

(a1 , b1 ) ⋅ (a2 , b2 ) = (a1 + a2 ,exp( −a2 )b1 + b2 ) can be easily verified.

This construction can be extended to 2D signals in a general setting. Consider a finite energy signal f defined on the plane. Unitary group operations that can be applied to the signal are elements of the similarity group of the plane namely, translations, dilations, and rotations defined as:

Tf ( x, y ) = f ( x − a, y − b) a, b ∈ R 2  +  Sf ( x, y ) = 1 s f ( x s , y s ) s ∈ R  Rf ( x, y ) = f ( M ( x, y )) M ∈ SO (2)  which can act on the signal f as follows:

T ⋅ S ⋅ Rf ( x, y ) = 1 s f (1 s M ( x − a, y − b)) This group is often written as SIM (2) = measure is

dg =

R 2 ! ( SO (2) × R + ) ( ! stands for semidirect product) and the left Haar

ds dθ dxdy with θ the angle of rotation in SO(2). s3

If we write U = T ⋅ S ⋅ R , then one can show that

U to be a unique irreducible unitary representation of SIM (2) in

L2 ( R 2 ) which is also square integrable. It should be clear that the representation U simply generalizes the

representation of the affine group which is at the heart of 1D wavelet transform. And, just as the Gabor transform is the canonical coherent state associated to the Weyl-Heisenberg group, the last observations allows one to interpret wavelets as the coherent states associated to representation U (that is the elements of the orbit of the mother wavelet under the action of the group). The extension of this approach to higher dimensions is clear.

3. LAGRRANGIANS AND LEARNING In classical mechanics the Lagrangian L is a function of the position and velocity of particles. The most familiar form of the Lagrangian for a particle of mass m moving in a potential V(r ) is

L=

1 2 mr" − V ( r ) 2

An important function related to the Lagrangian L is the action S which is the integral of the Lagrangian written as: t2

S [r (t )] = ∫ L( r (t ), r"(t ))dt t1

The significance of the action comes from the fact that the trajectory followed by the system during its motion is the one which makes S stationary (or minimum). This is known as the principal of least action. Viewed as a principal, this is a restatement of a physical principal in the context of general formulation which offers a principled way of selecting a path among all possible paths a particle could follow. In this regard, the regularization framework in Learning theory is also a principal of least action offering a principled way of selecting a solution in a class of possible solutions. To make the connection clear we proceed as follows. Let

fα be a signal belonging to the set F = { fα fα ∈ A} ,

A is a set of abstract parameters over which a probability distribution P(α ) is defined. For example, fα could be the set of daughter wavelets, and P (α ) models the probability distribution of images of that type in the space of all where

images. We are interested in finding compact or sparse representations for the signals of a certain class, which explicitly take advantage of the statistical structure of the signals. Images can always be represented by arrays of gray level values, but obviously this does not provide a good representation. It could turn out that in a certain class of images certain pixels have always the same value, or are highly correlated with their nearest neighbor. In this case, using the whole array of gray levels to represent a signal of that class is not advisable. One approach to represent a signal consists in expanding the signal over a set of basis functions, and then using the coefficients of the expansion as representation (typical examples are standard Fourier and wavelet series). The disadvantage of these schemes is that they use a fixed set of basis functions, and do not exploit the statistical properties of the signal. Another alternative is Principal Component Analysis (PCA), which provides the best orthogonal basis for the reconstruction of the signal. The problem of PCA is that it provides a compact representation only if there is linear structure underlying the class of signal (that is if signals can be represented by linear combination of few, other prototypical signals). In our earlier research, we have suggested alternatives to PCA using the local-to-global methods from differential topology and geometry [13]. Such nonlinear generalizations are not still adequate for our desired sparse representations. As in Fourier or wavelet expansions we start with a fixed set of basis functions, that we call dictionary:

Φ = {φi ( x )}i =1 where N is possibly very large. The basis functions φi will be called features. The basis functions φi N

are not orthogonal. If they were orthogonal the

N × N matrix M ij = φi , φ j

would be diagonal. The idea is

therefore is to select, by analyzing a set of signals, which basis functions are useful for the representation. Like in PCA we want to use information on the statistics of the signal to find a good representation. However, Unlike PCA we do not wish to find the basis functions based on statistical information; rather to select, from possibly a large set, which basis functions are useful for the representation of the signal.

We formulate here a simple version of the problem, motivated by the following practical problem: We wish to represent images of human faces. We have empirical reasons to believe that over-sampled Haar wavelets would provide a good dictionary. If we represent an image with over-sampled Haar wavelets we could end up with more than a thousand coefficients. We want to discover which set of coefficients can be ignored (deleted) from the reconstruction formula without making, on average, a large error. One way to formulate the problem above is the following. Given the dictionary to a function fα , in the L2 norm, is given by

Φ we know that the best approximation

N

fα ( x ) = ∑ fα , φ#i φi ( x )

(1)

i =1

where

{

}

# = φ# ( x ) Φ i

N

i =1

is the dual basis defined as follows: N

φ#i ( x ) = ∑ M ij−1φ j ( x )

(2)

j =1

where M is the matrix defined above. We propose to find the set of relevant basis functions by solving the following problem: 2 N N     #  min H [ξ ] ≡ E  fα − ∑ fα , φi φi ( x )ξi  + λ ∑ξ i  ξ (3)    i =1 i =1     where ξ i are binary random variables with values in {0,1} , E [⋅] denotes the expectation (integral) with respect to P(α ) and λ is a positive parameter. The variables ξi have the role of ``switches'', which select the subset of basis

functions that we want to use for the reconstruction. In this setting, the analogy to the Lagrangian formulation is clear. This setting in the context of regularization theory has been clearly elaborated in [14]. The features are those basis functions for which

ξi = 1 . The first term in equation (3) measures the reconstruction

error over a set of features and the second term ``counts'' the number of features. The rationale for this variational problem is that we are looking for a set of features which has at the same time a low cardinality and a small average reconstruction error, where the average is done over the set of signals F with probability distribution P (α ) . The

λ controls the number of features that we want, and therefore the sparsity of the representation. If the basis functions φi were orthogonal to each other, we could simply choose as features the basis functions for which the coefficients f , φ# are, on the average, ``large''. However, since the basis functions φ are not orthogonal and

parameter

α

i

i

possibly redundant, the choice of the features is not trivial. Expanding the square and disregarding constant terms in (3) we have: N

H [ξ ] = −2∑ ξi E  fα , φi i =1

N

fα , φ#i  + ∑ ξiξ j E  fα , φ#i i , j =1

Let us now define the correlation function of the random signals

G ( x , y ) = E [ fα ( x ) fα ( y ) ] We now rewrite some of terms in functional (4):

N

fα , φ#i  M ij + λ ∑ ξi i =1

(4)

fα as: (5)

fα , φ#i  = E  ∫ dx dy fα ( x )φi ( x ) fα ( y )φ#i ( y )  =    = ∫ dx dyE [ fα ( x ) fα ( y ) ] φi ( x )φ#i ( y ) = ∫ dx dy G ( x, y ) φi ( x )φ#i ( y )

hi ≡ E  fα , φi 

(6)

and, with a similar computation:

H ij = E  fα , φ#i 

fα , φ#i  M ij = ∫ dx dy G ( x, y ) φ#i ( x )φ#i ( y ) M ij 

(7)

The functional can be therefore written as:

H [ξ ] = ξ ⋅ Hξ + ξ ⋅ (λ1 − 2h ) Since the variables

(8)

ξi are binary, the minimization of the functional (8) is an Integer Programming (IP) problem, which

is well known to be difficult. We have at least four options to solve the problem.

4. MODULI SPACE OF SIGNALS Many mathematical objects naturally occur in families that are described by certain parameters called moduli (if we first divide out an appropriate group of automorphisms). Loosely speaking, a moduli space can be thought of a “useful” parameterization of another space. Often an initial goal for a mathematical theory of some new kind of object is a "classification theorem". This is, roughly speaking, discovering the space of moduli. The next step is usually a detailed investigation of the moduli space to see how various properties of the object depend on the moduli, and to see what values of the moduli give rise to objects with some special properties. By the way of an example, consider the following: an ellipse can be described by five parameters, say the coefficients of its implicit equation. If we divide out by the rigid motions of the plane, Then two moduli will suffice, namely the lengths of the two semi-axes. Notice that when the two semi-axes are equal, an ellipse becomes a circle, which has a bigger symmetry group than the generic ellipse. Here are some more examples: A compact Riemann surfaces X of genus g. To ‘build’ such a surface X, one can start with a compact two-dimensional manifold (where g is just the number of ‘holes’), and then define a holomorphic structure on it. It turns out that the resulting object has two descriptions. The first description is manifested as a complex algebraic curve - the set of simultaneous zeroes of finitely many polynomial equations with complex coefficients. The second description occurs as the quotient of the complex upper-half-plane by a discrete group of isometries of the hyperbolic metric. The deformation of the discrete subgroup, or equivalently, the change of equations causes corresponding changes in the Riemann surface. The space that parametrizes these deformations is a geometric object of great mathematical significance, the ‘moduli space of curves’, Mg. Similarly, the moduli space for pseudo-spherical surfaces can be identified with the space of solutions of the Sine-Gordon equation. The latter contains certain n-parameter families (the pure n-soliton solutions) that correspond to particularly interesting surfaces. (In particular, the 1-solitons form the wellknown Dini family that contains the pseudosphere.) In a number of situations, even when the moduli space is not finite dimensional, it will contain special curves or points with interesting properties. For example, minimal surfaces come in one-parameter families - so-called associate families. In this case, all members of the family are isometric(though not ambient isometric). One can use this associated family parameter as a transformation parameter. This has been extensively used in computer graphics to provide stunningly beautiful morph.

5. SIMULATIONS The setup for our simulation is as follows. We start with a set of faces and we wish to study how we may be able parameterize a sparse representation for the concept of “faceness” using our overcomplete representation. For our overcomplete representation we use the Haar wavelet system. In the case of images, this wavelet system is parameterized by a dyadic set of parameters which are powers of 2 (this is not required for our computation). This system defines an orthogonal set and is not overcomplete. However, for the purposes of our simulation, it is sufficient in the sense that it demonstrates features that motivate a complete study. We then use this system to compute an

expansion of each face in the Haar wavelet basis. The Fourier transform of the Haar wavelet coefficients is then used for our analysis. To demonstrate the validity of the computational scheme we have outlined we compute the Principal Components of the Haar wavelet coefficients in the Fourier basis and reconstruct its first principal component. Figure 1. Presents this results.

The data base of Faces

Figure 1. The First Principal Component.

The First Principal Component presented as a surface with intensities plotted as height.

To study the moduli space we used the overrepresented coefficients and studied set of coefficients according to scale. Specifically, we performed a PCA analysis at each scale and constructed Haar wavelet coefficients from each scale. The Haar wavelet coefficients were grouped according to their eigenvalues (highest to lowest). The reconstructed Haar wavelet coefficients were then used to represent the signal. Figure 2. Presents a summary of these results.

Figure 2. Top figure from left to right represents the first, second and third

Bottom: The Haar coefficients corresponding to the first principal components

principal components Bottom: The Haar coefficients corresponding to the second principal components

Bottom: The Haar coefficients corresponding to the third principal components

6. CONCLUSIONS Our research has a long-term objective: to investigate the cognitive processes that underlie our perception of geometric forms. The time is ripe to address this question in the realm of cognitive neuroscience. While our proposed computational model serves to investigate and support the cognitive theories, it reveals the potential for a refined approach to cognitive and biological models of visual information processing channels. One could generally agree that the increasingly rapid pace of advances in our understanding of the biology of the brain and advances in computation power will open new ways for investigation of information processing in the brain. Thus, we are optimistic that in foreseeable future, we will have the scientific tools to understand neuronal substrates of low-level computations of visual, tactile, motor and auditory processes, that contribute to our perception of symmetry and regularity in structure.

REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.

Assadi, A., Palmer, S., and Eghbalnia, H., Ed. (1998). Learning Gestalt Of Surfaces In Natural Scenes. Proceedings of IEEE Int. Conf. Neural Networks in Signal Processing. Corballis, M. C. a. R., C.E. (1974). “On the perception of symmetrical and repeated patterns.” Percept. Psychophys. 16: 136-142. Dakin, S. C. a. W., R. J. (1996). Detection of bilateral symmetry using spatial filters. Human Symmetry Perception. C. W. Tyler. Utrecht, the Netherlands, VSP BV: 187-207. Davies, E. R. (1997). Machine vision : theory, algorithms, practicalities. San Diego, Academic Press. Haralick and Shapiro (1988). Computer and Robot Vision Vol. 1. Hubel, D. (1995). Eye, Brain, and Vision, W.H. Freeman and Company. P. Vaidyanathan, Multirate Systems and Filter Banks, Prentice-Hall, 1993. M. Vetterli and C. Herley, “Wavelets and Filter Banks: Theory and Design,” IEEE Transactions on Signal Processing, vol. 40-9, September, 1992. M. Wickerhauser, Adapted Wavelet Analysis from Theory to Software, A. K. Peters, 1994. R. Coifman, Y. Meyer, M. Wickerhauser, “Wavelet Analysis and Signal Processing,” Yale University, New Haven, Connecticut, 1991, preprint. M. Wickerhauser, “Acoustic Signal Compression with Wavelet Packets,.” Wavelets: A Tutorial in Theory and Applications, Academic Press, 1992. C. Heil, D. Walnut, “Continuous and Discrete Wavelet Transforms”, SIAM Review, 31:4, pp 628-666. Assadi, A., Eghbalnia, H., “Nonlinear Methods for Clustering and Reduction of Dimensionality”. In: IJCNN'99 1999 International Joint Conference on Neural Networks. Girosi, F., “An Equivalence between Sparse Approximations and Support Vector Machines”, Neural Computation, 10:6, 1455-1480, 1998