Matching pursuits with time-frequency dictionaries

IEEE TRANSACTIONS ON SIGNAL PROCESSING. VOL. 41. NO. 12, DECEMBER 1993

3397

Matching Pursuits With Time-Frequency Dictionaries Stephane G. Mallat, Member, IEEE, and Zhifeng Zhang

Abstract-We introduce an algorithm, called matching pursuit, that decomposes any signal into a linear expansion of waveforms that are selected from a redundant dictionary of functions. These waveforms are chosen in order to best match the signal structures. Matching pursuits are general procedures to compute adaptive signal representations. With a dictionary of Gabor functions a matching pursuit defines an adaptive time-frequency transform. We derive a signal energy distribution in the time-frequency plane, which does not include interference terms, unlike Wigner and Cohen class distributions. A matching pursuit isolates the signal structures that are coherent with respect to a given dictionary. An application to pattern extraction from noisy signals is described. We compare a matching pursuit decomposition with a signal expansion over an optimized wavepacket orthonormal basis, selected with the algorithm of Coifman and Wickerhauser. I. INTRODUCTION

E can express a wide range of ideas and at the same time easily communicate subtle difference between close concepts, because natural languages have large vocabularies, that include words with close meanings. For information processing, low level signal representations must also provide explicit information on very different properties, while giving simple cues to differentiate close patterns. The numerical parameters should offer compact characterizations of the elements we are looking for. The wide scope of patterns embedded in complex signals and the precision of their characterization, also motivate decompositions over large and redundant dictionaries of waveforms. Linear expansions in a single basis, whether it is a Fourier, wavelet, or any other basis, are not flexible enough. A Fourier basis provided a poor representation of functions well localized in time, and wavelet bases are not well adapted to represent functions whose Fourier transforms have a narrow high frequency support. In both cases, it is difficult to detect and identify the signal patterns from their expansion coefficients, because the information is diluted across the whole basis. Similar examples can be found for any type of basis. Such decompositions are similar to a text written with a small vocabulary. Although this vocabulary might be sufficient to express all

W

Manuscript received August 28, 1992; revised May 5, 1993. The Guest Editor coordinating the review of this paper and approving it for publication was Dr. Ahmed Tewfik. This work was supported in part by the Air Force Office of Scientific Research under Grant F49620-1-0102, in part by the Office of Naval Research under Grant NOOOI4-91-J-1967, and in part by the Alfred Sloan Foundation. The authors are with Department of Computer Science, Courant Institute of Mathematical Sciences, New York University. New York, NY 10012. IEEE Log Number 9212192.

ideas, it requires to use circumvolutions that replace unavailable words by full sentences. Flexible decompositions are particularly important for representing signal components whose localizations in time and frequency vary widely. The signal must be expanded into waveforms whose time-frequency properties are adapted to its local structures. Such waveforms are called time-frequency atoms. For example, impulses need to be decomposed over functions well concentrated in time, while spectral lines are better represented by waveforms which have a narrow frequency support. When the signal includes both of these elements, the time-frequency atoms must be adapted accordingly. One must therefore introduce a procedure that chooses the waveforms that are best adapted to decompose the signal structures, among all the time-frequency atoms of a large dictionary. Section II briefly reviews the properties of time-frequency atoms and their relations to window Fourier transforms and wavelet transforms. We introduce an algorithm called matching pursuit, that decomposes any signal into a linear expansion of waveforms that belong to a redundant dictionary of functions. These waveforms are selected in order to best match the signal structures. Although a matching pursuit is nonlinear, like an orthogonal expansion, it maintains an energy conservation which guaranties its convergence. It is closely related to projection pursuit strategies, developed by Friedman and Stuetzle [7] for statistical parameter estimation. The general algorithm in the Hilbert space framework is explained in Section III and the finite dimensional case is further studied in Section IV. The application of matching pursuits to adaptive timefrequency decomposition is described in Section V. The signal is decomposed into waveforms selected among a dictionary of time-frequency atoms, that are the dilations, translations, and modulations of a single window function. We derive a time-frequency energy distribution, by adding the Wigner distribution of the selected time-frequency atoms. Contrarily to the Wigner distribution or Cohen's class distributions, this energy distribution does not include interference terms and thus provides a clear picture in the time-frequency plane. Qian and Chen [14] have developed independently a similar algorithm to expand signals over time-frequency atoms. A fast implementation of the matching pursuit for dictionary of Gabor time-frequency atoms is described in Section VI, with numerical examples. A matching pursuit decomposition provides an interpretation of the signal structures. If a structure does not

IOS3-S87X/93$03.00 © 1993 IEEE

3398

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 41, NO, 12, DECEMBER 1993

correlate well with any particular dictionary element, it is subdecomposed into several elements and its information is diluted, Section VII formally defines coherent signal structures with respect to a given dictionary, and explains how to detect them, An application to the extraction of patterns from noisy signals is described, A matching pursuit is a greedy algorithm that chooses at each iteration a waveform that is best adapted to approximate part of the signal. Section VIII compares this locally adaptive method to the algorithm of Coifman and Wickerhauser [4], which selects the basis that is best adapted to the global signal properties, among all bases of a wavepacket family, Numerical results show that the global optimization does not perform well for highly nonstationary signals, as opposed the greedy approach of a matching pursuit. On the other hand, the best basis algorithm is efficient to represent simpler signals that have stationary properties, Notations

The space L 2 (R) is the Hilbert space of complex valued functions such that

Ilfll

=

i~:

If(tW dt

The inner product of (j, g)

E

< +00,

(I)

L 2(R)2 is defined by

\+00

(f, g)

=

Loo f(t)g(t) dt

(2)

where g (t) is the complex conjugate of g (t), The Fourier transform off(t) E L 2 (R) is writtenJ(w) and defined by

\+00

J(w) =

II.

Loo f(t)e- iw1 dt,

(3)

TIME-FREQUENCY ATOMIC DECOMPOSITIONS

Decompositions of signals over family of functions that are well localized both in time and frequency have found many applications in signal processing and harmonic analysis. Such functions are called time-frequency atoms. Depending upon the choice of time-frequency atoms, the decomposition might have very different properties. Window Fourier transforms and wavelet transforms are examples of time-frequency signal decomposition that have been studied thoroughly [2], [5], [13], [15]. To extract informations from complex signals, it is often necessary to adapt the time-frequency decomposition to the particular signal structures. This section discusses the adaptivity requirements. A general family of time-frequency atoms can be generated by scaling, translating and modulating a single window function g(t) E L 2 (R). We suppose that get) is real, continuously differentiable and 0(1 /(t2 + 1». We also impose that I gil = I, that the integral of g (t) is nonzero and that g (0) O. For any scale s > 0, frequency modulating ~ and translation u, we denote 'Y = (s, u, 0

'*

and define g'Y(t) =

l C~ u) ei~l. g

(4)

The index 'Y is an element of the set r = R+ x R 2 • The factor I /.[; normalizes to 1 the norm of g'Y (t). If g (t) is even, which is generally the case, g'Y (t) is centered at the abscissa u. Its energy is mostly concentrated in a neighborhood of u, whose size is proportional to s. Let g(w) be the Fourier transform of g (t). Equation (4) yields g'Y(w) = .[; g(s(w - ~»e-i(w-Ou.

(5)

Since Ig (w) I is even, Ig'Y (w) I is centered at the frequency w = ~. Its energy is concentrated in a neighborhood of ~, whose size is proportional to 1/s. The family 5) = (g'Y(t»'Y E[ is extremely redundant, and its properties have been studied by Torresani [17]. To represent efficiently any function f (t), we must select an appropriate countable subset of atoms (g'Yu(t»nEN, with 'Yn = (SfP Un' ~n), so thatf(t) can be written

+00 f(t)

=

~

n=

ang'Yu(t).

(6)

-00

Depending upon the choice of the atoms g'YU (t), the expansion coefficients an give explicit information on certain types of properties of f(t). Window Fourier transforms and wavelet transforms correspond to different families of time-frequency atoms, that are frames or bases ofL 2 (R).

In a window Fourier transform, all the atoms g'Yn have a constant scale Sn = So and are thus mainly localized over an interval whose size is proportional to so. If the main signal structures are localized over a time-scale of the order of so, the expansion coefficients an give important insights on their localization and frequency content. However, a window Fourier transform is not well adapted to describe structures that are much smaller or much larger than so. To analyze components of varying sizes, it is necessary to use time- frequency atoms of different scales. In opposition to the window Fourier transform, the wavelet transform decomposes signals over time-frequency atoms of varying scales, called wavelets. A wavelet family (g'Yn (t»nEN is built by relating the frequency parameter ~n to the scale Sn with ~n = ~o/ Sn' where ~o is a constant. The resulting family is composed of dilations and translations of a single function, multiplied by complex phase parameter. The expansion coefficients an of functions over wavelet families characterize the scaling behavior of signal structures. This is important for the analysis of fractals and singular behaviors. However, expansion coefficients in a wavelet frame do not provide precise estimates of the frequency content of waveforms whose Fourier transforms is well localized, especially at high frequencies. This is due to the restriction on the frequency parameter ~n' that remains inversely proportional to the scale SnFor signals f (t) that include scaling and highly oscil-

MALLAT AND ZHANG: MATCHING PURSUITS WITH TIME-FREQUENCY DICTIONARIES

latory structures, one can not define a priori the appropriate constraints on the scale and modulation parameters of the time-frequency atoms gy" (t) used in the expansion (6). We need to select adaptively the elements of the dictionary ~ = (gy(t))YEr, depending upon the local properties of f(t).

III.

MATCHING PURSUIT IN HILBERT SPACES

The general issue behind adaptive time-frequency decompositions is to find procedures to expand functions over a set of waveforms, selected appropriately among a large and redundant dictionary. We describe a general algorithm, called matching pursuit, that performs such an adaptive decomposition. Let H be a Hilbert space. We define a dictionary as a family ~ = (gy)yEf of vectors in H, such that I gyll = 1. Let V be the closed linear span of the dictionary vectors. Finite linear expansions of vectors in 1) are dense in the space V. We say that the dictionary is complete if and only if V = H. For the dictionary of time-frequency atoms described in Section II, H = L 2 (R), and each vector gy is an atom defined by (4). Finite linear expansions of time-frequency atoms are dense in L 2 (R) [17], hence this dictionary is complete. Letf E H. We want to compute a linear expansion off over a set of vectors selected from 1), in order to best match its inner structures. This is done by successive approximations offwith orthogonal projections on elements of 1). Let gyO E 1). The vector f can be decomposed into

= < f,

f

g'Yo) g'Yo

+ Rf

(7)

where Rf is the residual vector after approximating f in the direction of gyo' Clearly g'Yo is orthogonal to Rf, hence

IIfl1

2

=

2

I is quantized by approximating it to the closest scalar stored in the codebook. Vector quantizations algorithms can be extended with a multistage strategy [9]. After quantizing a given vector, the remaining error is quantized once more, and the process continues iteratively. A matching pursuit is similar to a multistage shape-gain vector quantizer. However, a matching pursuit does not quantize the inner products (Rnj, g-Yn as opposed to this vector quantizer. For information processing applications, matching pursuits use very redundant dictionaries of infinite size, whereas vector quantizers are based on finite dictionaries that are best adapted to data compression. Another major difference is that vector quantizations are performed in spaces of low dimension, generally smaller than 16. For example, image quantizers are based on blocks of less than 4 by 4 pixels. On the contrary, a matching pursuits is performed in a signal space H whose dimension N is equal to the total number of signal samples, which is typically several thousands. The underlined mathematical and algorithmic issues are thus quite different.

>,

IV.

MATCHING PURSUIT IN FINITE SPACES

When the signal space H has a finite dimension N, the matching pursuit has specific properties that are studied in this section. The dictionary :D may have an infinite number of elements and we suppose that it is complete. We describe an efficient implementation of matching pur-

suit algorithms and prove that the norm of the residues decays exponentially. When the dictionary is very redundant, the search for the vectors that match best the signal residues can mostly be limited to a subdictionary :D a = (g-Y)-YEr" C :D. We suppose that r a is a finite index set included in r such that for any j E H

I (j,

sup

-yEr a

g-y>

I

~ ex sup

-yEr

I (j,

g-y>

I·

(30)

Depending upon ex and the dictionary redundancy, the set can be much smaller than r. The matching pursuit is initialized by computing the inner products « j, g-y > )-YEf,,' and continues by induction as follows. Suppose that we have already computed «R"j, g-y >)-YEf,,' for n ~ 0. We search in :D a for an element gYn such that

ra

I (Rnj,

g"in>

I=

sup

-yEr a

I (Rnj,

g-y>

I.

(31)

To find a dictionary element that matches j even better than g"in' we then search with a Newton method for an index "in in a neighborhood of 'Yn in r where I (j, g-y> I reaches a local maxima. Clearly

I (Rnj,

g-yn>

I

~

I (Rnj,

g"in>

I

I (Rnj,

~ ex sup -yEr

g-y>

I·

(32)

Let us observe that the choice function mentioned in Section III is defined indirectly by this double search strategy. Once the vector g-Y-n is selected, we compute the inner product of the new residue R n + Ijwith any g-y E :D a , with an updating formula derived from (12) (R n + Ij, g-y> = (Rnj, g-y> - (Rnj, g-yn> (g-Yn' g-y >. (33) Since we previously stored (Rnj, g-y >and (Rnj, g-Yn >, this update requires only to compute (g-Yn' g-y >. Dictionaries are generally built so that this inner product is recovered with a small number of operations. The number of times we subdecompose the residues of a given signal j depends upon the desired precision E. The number of iterations is the minimum p such that

The energy conservation (17) proves that this equation is equivalent to p-I

I jll - n=O I; I (Rnj,

g-yn>

12 :;;

E

2

1 jll.

(35)

Since we do not compute the residue R n j, at each iteration we test the validity of (35) to stop the decomposition. The number of iterations p depends upon the decay rate of IIR n j II. It can vary widely depending upon the signals but is much smaller than N is most applications. The energy of the residual error can be decreased with the back-pro-

3402

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 41, NO. 12, DECEMBER 1993

jection algorithm described in Section III. Many types of dictionaries do not contain any subfamily of less than N + 1 vectors that are linearly dependent. In this case, once the matching pursuit has selected N different vectors, these vectors spans the whole signal space H. Hence, after backprojection there is no more residual error and f is recovered as a linear expansion of the N selected vectors. However, this basis of H might be badly conditioned which slows down the convergence of the back-projection algorithm. The decay of IIR nf II depends upon the correlation between the residues and the dictionary elements. Let us define the correlation ratio of a function f E H with respect to :D as "\ (f)

=

1'\

~~~

I (f,

g1 )

I

(36)

Ilfll

The following lemma guaranties that for any f E H, A(f) is larger than a strictly positive constant. Lemma 1: Let :D be a complete dictionary in a finite dimensional space H, I(A) = inf A(f)

order to well approximated. This means that the information off is diluted across the dictionary. The extraction of coherent signal structures is further studied in Section VII. V.

MATCHING PURSUIT WITH TIME-FREQUENCY DICTIONARIES

For dictionaries of time-frequency atoms, a matching pursuit yields an adaptive time-frequency transform. It decomposes any functionf(t) E L 2 (R) into a sum of complex time-frequency atoms that best match its residues. This section studies the properties of this particular matching pursuit decomposition. We derive a new type of time-frequency energy distribution by summing the Wigner distribution of each time-frequency atom. Since a time-frequency atom dictionary is complete, Theorem 1 proves that a matching pursuit decomposes any function f E L 2 (R) into +00

(42)

> O.

fEB

The proof of this lemma is in Appendix B. The value of 1(A) is the cosine of the maximum possible angle between a direction of H and the closest direction of a dictionary vector. If :D is an orthogonal basis, one can prove that 1 (A) = l/../N. The next lemma guaranties that IIR nf II decays exponentially in a finite dimensional space, with a rate proportional to a? 12 ( A). Lemma 2: Letf E H. For any m > 0 IIRmfl1 $ Ilfllo - ciI 2(A))m/2. (38)

Proof the matching pursuit chooses a vector g-Yn that satisfies I (Rnf, g-yn) I ~ a sup I (Rnf, g-y) 1= aA(Rnf)IIRnfll. -yEf

(43)

These atoms are chosen to best match the residues off. The matching pursuit algorithm depends upon a choice function that selects at each iteration a vector g-yn among all vectors that satisfy (11). Appendix C proves that we can define choice functions for which the matching pursuit is covariant by dilation, translation and modulation. Let us denote (g-y::)nEN and (g-y})nEN' with 'Y~ = (s~, u~, ~~) and 'Y~ = (s~, u~, ~,~)' the family of time-frequency atoms selected to decompose respectively fO (t) and f I (t). Appendix C proves that there exists a class of choice functions such that

(39)

Since IIR + f11 = IIR nfl1 2 - I (Rnf, g-y,,> 12, IIRn+1fll $ IIR nfll(1 - a 2 A2(R nf))I/2 n

1

(44)

2

and hence, for any m

(40)

if and only if for all n

> 0

o

a

m-l

IIRmfl1

$

Ilfll

IT 0 -

n=O

s~

Sn == - ,

a 2A2(R nf))1/2

o

Un

~

0

u~ - c

= ---

a

~~ = a(~~ - b)

(45)

and

o

(41)

(46)

The lower the correlation ratios of a particular signal f and its residues, the slower the decay of their norm. If the signal f is the sum of a few high energy components that belong to the dictionary, the correlation ratios off and its residues is high so their norm decrease quickly. These high energy components can be viewed as "coherent structures" with respect to the dictionary. If the residues of f have low-correlation ratios, their norm decay slowly and f must be expanded over many dictionary vectors in

The translation, modulation, and dilation of a function appears as simple modifications of the selected atom indexes. The covariance through dilation, translation and modulation is important to perform a signal analysis that takes into account any of these transformations. From the decomposition of any f(t) within a time-frequency dictionary we derive a new time-frequency energy distribution, by adding the Wigner distribution of each selected atom. Let us recall that the cross Wigner distri-


bution of two functions f (t) and h (t) is defined by W[j, h](t, w)

=

~ 27r

\+00

J-oo

f(t

+

!.-) Ii (t _ !.-) e2 2

iWT

dT.

(47)

The Wigner distribution of f(t) is Wf(t, w) = W[j, f] (t, w). Since the Wigner distribution is quadratic, we derive from the atomic decomposition (42) off (t) that +00

+00

+00

+ ~

~

n=O m=O.m*fl

(R"f,g'"r") (48)

The double sum corresponds to the cross terms of the Wigner distribution. It regroups the terms that one usually tries to remove in order to obtain a clear picture of the energy distribution of f(t) in the time-frequency plane. We thus only keep the first sum and define +00

£f(t, w)

= ~

n=O

2

I (R"f, g'"rn) 1 Wg'"r,,(t, w).

Wg'"r(t, w)

c-

= Wg - s -U. sew

- 0)

butions rI ]. The importance of these marginal properties for signal processing is however not clear. When the signalf(t) is real, to get a decomposition with real expansion coefficients, one must use dictionaries of real time-frequency atoms. For any I' = (s, t u), with ~ *- 0, and any phase r/> E [0, 27f[, we define g(-Y.1»

= K(-Y.1» .[;;

(t -

u) cos

(~t

-s-

g

+ r/».

(54)

The constant K('"r.1>J is adjusted so that II g(-Y.1» II = 1. The phase r/> that was hidden in the complex numbers, now appears explicitly as a parameter of the real atoms. The dictionary of real time-frequency atoms is defined by :D = (g(-Y.1»)Efx [0.27r[, with r = R+ x R 2 . The matching pursuit performed with this dictionary decomposes any real signal f (t) into +00

where the indexes ("Yn' r/>Il) = (SIl' U'" ~n, r/>n) are chosen to best match the residues off. For any I' = (s, t u), real atoms are related to complex atoms by

(49)

A similar decomposition algorithm over time-frequency atoms was derived independently by Qian and Chen [14], in order to define this energy distribution in the time-frequency plane. From the well known dilation and translation properties of the Wigner distribution and the expression (43) of a time-frequency atom, we derive that for I' =(s,~,u)

3403

(50)

where 1'- = (s, -t u). However, one can show that the real matching pursuit decomposition (55) is not equivalent to the complex decomposition (42), because the two vectors g'"r (t) and g'"r- (t) are not orthogonal. The time-frequency energy distribution of a real function f (t) is derived from its matching pursuit decomposition, by summing the Wigner distribution of the underlined complex atoms +00

and hence,

£f(t, w) = ~

£f(t, w)

~

n=O

1

. Wg

(R"f, g'"r")

2

C-

U " ,Sn(W __ Sn

Wg(t, w) dt dw

(R"f, g('"r".1>,,)

. ~ (Wg'"r,,(t, W)

1

~n)).

= ~ ~

1l~0

+ 2

•

(57)

(53)

We can thus interpret £f (t, w) as an energy density of f in the time-frequency plane (t, w). Unlike the Wigner and the Cohen class distributions, it does not include cross terms. It also remains positive if Wg(t, w) is positive, which is the case when g (t) is Gaussian. On the other hand, the energy density £f (t, w) does not satisfy marginal properties, as opposed to certain Cohen class distri-

I

(R"f, g(-Y".1>n»

2 1

) t - U ll . ( Wg ( -S-n-. Sn(W - ~n)

(52)

so the energy conservation (22) implies £f(t, w) dt dw = II f11

Wg'"rn (t, w)).

+00

£f(t, w)

= II g 11 2

+

2 1

By inserting (50) in this expression, we obtain (51)

The Wigner distribution also satisfies

r: r: r: r:

I

n=O

+00

ll t - -U , Wg ( Sn(W Sn

~n) ) ) .

(58)

This distribution also satisfies the energy density property (53).

In signal processing applications of time-frequency matching pursuits, we process directly the discrete parameters ("Yn, r/>n) = (Sn, ~n' U Il , r/>n) and (Rnf, g'¥n) of the selected atoms, rather than the energy density £f (t, w). Indeed, these parameters carry all the necessary information and are much easier to manipulate than the twodimensional map £f (t, w). This energy distribution is

3404

IEEE TRANSACTIONS ON SIGNAL PROCESSING. VOL 41. NO. 12. DECEMBER 1993

rather used for the visualization of the structure book infonnation. If g (t) is the Gaussian window

get) = 21/4e-1Tt2

(59)

N. To this dictionary of atoms, we add the canonical basis

of discrete Diracs and the discrete Fourier basis of complex exponentials. For l' = (1, p, 0), g-y(n) is a discrete Dirac centered at p. For l' = (N, 0, k), g-y(n) =

1/ JNe i (21Tk/N)/I.

then (60)

The time-frequency atoms g-y (t) are then called Gabor functions. The time-frequency energy distribution Ef(t, w) is a sum of Gaussian blobs whose locations and variances along the time and frequency axes depend upon the parameters (sn, Un' ~n)' As explained in Section IV, to implement efficiently a matching pursuit, we must avoid computing the inner products of the signal residues with all the dictionary vectors. The following theorem guaranties that, if we discretize appropriately the Gabor dictionary, one can obtain a subdictionary that satisfies the property (30). Theorem 2: Let 11 U and 11 ~ be respectively a time and a frequency discretization interval that satisfy 11~

l1u = 271"

1 be an elementary dilation factor. Let r a be the discrete subset of r = R+ x R2 , of all indexes l' = (a), pa }l1u, ka-) 11 0, for (j, p, k) E Z3. There exists a constant ex > such that for all f E L 2 (R)

°

sup

I (f,

g-y)

I~

ex sup I (f, g-y)

-YET a

I.

(62)

-yET

Appendix D gives a proof of this theorem. The fast numerical implementation of a matching pursuit in a Gabor dictionary is based on this theorem. VI.

DISCRETE MATCHING PURSUIT IN GABOR DICTIONARIES

We explain the discrete implementation of a matching pursuit for a dictionary of Gabor time-frequency atoms. Numerical examples are shown at the end of this section. We suppose that our signal is real and has N samples. The space H is the set of infinite discrete signals of period N. Due to the limitations of the sampling rate and the signal size, the scale S can only vary between 1 and N. The window function g (t) is the nonnalized Gaussian given by (59). To obtain a discrete and periodic signal, at any scale s, the window function is unifonnly sampled and periodized over N points gs(n)

=

K I

s

"\IS

~ LJ p~-oo

(n - PN)

g -- .

(63)

S

The constant Ks nonnalizes the discrete nonn of gs. For any integer :s; p < Nand :s; k < N, we denote l' = (s, p, 2 7I"k / N) and define the discrete Gabor atom

°

°

g-y(n) = g,(n -

p)ei(hk/Nln.

(64)

The discrete complex Gabor dictionary is the set of all such atoms for S E ] 1, N [ and p, k integers between and

°

Similarly to (56), for any l' = (s, p, 271"k/N) and cf> E [0, 271"[, real discrete time-frequency atoms are given by (65) with Kl such that II g-y.1> II = 1. Appendix E describes an efficient implementation of a matching pursuit with this real discrete Gabor dictionary and gives infonnation to obtain a copy of a matching pursuit software. The implementation follows the general algorithm described in Section IV. We compute the inner products of the signal residues with the complex Gabor atoms (64) and recover the phase from the complex coefficients. As suggested by Theorem 2 and the implementation algorithm of Section IV, we only compute the inner product of the signal residues with a subset 5)a = (g-Y)-YET of the complex Gabor dictionary. The index set r a is composed of all l' = (a), pa)l1u, ka-)110, with a = 2, l1u = 1/2, 11~ = 71", < j < log2 N, :s; p < Nr) + I and :s; k < 2) + 1 • We also add the discrete Dirac and Fourier bases to 5)a' The number of vectors in 5)a is O(N log2 N). The implementation of the matching pursuit iterations is further described in Appendix E. The search over 5)a finds the approximate scale, time and frequency localization of the main signal structures. These values are then refined with a Newton search strategy to recover the time-frequency parameters that best match the signal components. Each iteration requires O(Nlog N) operations and as much CPU time as a fast Fourier transfonn subroutine applied to a signal of N samples. Fig. l(a) is a signal f of 512 samples that is built by adding chirps, truncated sinusoidal waves and wavefonns of different time-frequency localizations. No Gabor function have been used to construct this signal. Fig. 1(b) shows the time-frequency energy distribution Ef(t, w). Since Ef(t, w) = Ef(t, -w), we only display its values for w ~ 0. Each Gabor time-frequency atom selected by the matching pursuit is an elongated Gaussian blob in the time-frequency plane. We clearly see appearing two chirps that cross each other, with a localized time-frequency wavefonn at the top of their crossing point. We can also detect closely spaced Diracs, and truncated sinusoidal waves having close frequencies. Several isolated localized time-frequency components also appear in this energy distribution. The curve (a) in Fig. 2 gives the decay of 10glO IIRnf II / II f II as a function of the number of iterations n. For n :s; 130, IIRnfl1 has a relatively faster decay. These iterations correspond to the coherent signal structures, as shown in Section VII. For n ~ 130, the decay rate is almost constant. This confinns the exponential decay proved by Lemma 2. For any n ~ 0, the backprojection algorithm described in Section III recovers a Q

°

°

°

MALLAT AND ZHANG: MATCHING PURSUITS WITH TIME-FREQUENCY DICTIONAR[ES

3405

2r------~------_,_------~------~-----__,,.....,

-2

100

200

300

500

400

(a)

(b)

Fig. I. (a) Signal of 512 samples built by adding chirps, truncated sinusoidal waves and wavefonns of different time-frequency localizations. (b) Time-frequency energy distribution Ef(t, w) of the signal shown in (a). The horizontal axis is time. The vertical axis is frequency. The highest frequencies are on the top. The darkness of this time-frequency image increases with the value Ef(t, w).

o ""-------,------..,.--------,--------r------r-----, -[

-2

-3

-4

-5

-6

-7

-8 '--

---'--

o

[00

'--

200

-'-

300

~

400

--'--...L___'

500

Fig. 2. The curve (a) gives the decay of log,o IIR"f 11/ I f II, as function of the number of iterations n, for the signal f of Fig. I(a). The curve (b) gives the decay of logJ() I pw,J 11/ I f II.

3406

IEEE TRANSACTIONS ON SIGNAL PROCESSING. VOL. 41, NO. 12. DECEMBER 1993 2000r-----~----~----~-----~----~--____,

1500

1000

o 1/V\/W'\fVo_ _RI

(a)

(b)

Fig. 3. (a) speech recording of the word' 'greasy," sampled at 16 kHz. (b) Time-frequency energy distribution of the speech recording shown in (a). We see the low-frequency component of the "g," the quick burst transition to the "ea" and the harmonics of the "ea." The "s" has an energy spread over high frequencies.

better approximation off from the n atoms selected from the dictionary, The reconstruction error is then the orthogonal projection off on the space Wnthat is orthogonal to the n vectors selected by the matching pursuit. The back-projection requires much less computation than the matching pursuit. The curve (b) in Fig, 2 gives the decay oflog IO I pwJ 11/ I f II, For n :'5 130, IIRnf I z I pwJ II, It means that the matching pursuit computes a close approximation of the orthogonal projection off on the n vectors selected from the dictionary, For n = 300, IIRnfl1 = l.511 pwJ II, For n = N = 512, 10gl0 I pwJ 11/ I f I drops to -00 because pwJ = 0, This indicates that the N vectors selected by the matching pursuit are linearly independent and thus define a basis of the signal space H. The relative gain of the back-projection is important when the number of iterations is of the order of the dimension of the signal space. For almost all signals f, the decays of IIRnfl1 and I Pwmfll have the same qualitative behavior as in Fig. 2.

Fig, 3(a) is the graph of a speech recording corresponding to the word "greasy," sampled at 16 kHz. From the time-frequency energy displayed in Fig. 3(b), we can see the low-frequency component of the "g" and the quick burst transition to the "ea" has many harmonics that are lined up but we can also see localized high-frequency impulses that correspond to the pitch. The "s" component has a time-frequency energy spread over a high-frequency interval. Most of the signal energy is characterized by few time-frequency atoms. For n = 250 atoms, IIRnf II / II f I = 0,169, although the signal has 5782 samples, and the sound recovered from these atoms is of excellent quality. Fig. 4(a) shows a signal obtained by adding a Gaussian white noise to the speech recording given in Fig. 3(a), with a signal to noise ratio of 1,5 db. Fig. 4(b) is the timefrequency energy distribution of this noisy signal. The white noise generates time-frequency atoms spread across the whole time-frequency plane, but we can still distin-


3407

2500,-------,------~----~----~~----~---_,

2000

(a)

(b)

Fig. 4. (a) Signal obtained by adding a Gaussian white noise to the speech recording shown in Fig. 3(a). The signal to noise ratio is 1.5 db. (b) Time-frequency energy distribution of the noisy speech signal. The energy distribution of the white noise is spread across the whole time-frequency plane.

guish the time-frequency structures of the original signal because their energy is better concentrated in this plane. VII. NOISE AND COHERENT STRUCTURES Generally, the notion of noise versus signal information is ill-defined. Even though a signal component might carry a lot of information, it is often considered as noise if we can not make sense out of it. In a crowd of people speaking a language we do not understand, surrounding conversations are generally perceived as a noise background. However, our attention will be attracted by an remote conversation in a language we know. In this case, what is important is not the information content but whether this information is in a coherent format with respect to our system of interpretation. A matching pursuit decomposition in a given dictionary defines a system of interpretation for signals. We study the notion of coherence and describe an algorithm that isolates signal structures that are coherent with respect to a given dictionary. Coherent signal components have a strong correlation

with some dictionary vectors. The more coherent a signal, the larger the correlation ratios of the signal residues l\(Rnf) = sup 'fEr

I (Rnj, g'f> I IIRnjl1

(66)

The matching pursuit selects vectors g'fn that almost best correlate the signal residues. Let us denote

Equation (11) implies that I _ l\(Rnf) :5 l\(Rnf) :5 - l\(Rnf).

(68)

IX

For any h

E

H, the choice procedure implies that}.. (h) =

I h I ). Hence, }.. (h) only depends upon the position of h / I h lion the unit sphere of the signal space H. Let W }.. (h /

be a discrete Gaussian white noise. For any n ~ 0, the average value of }"(Rnf) measured with a uniform probability distribution over the unit sphere, is equal to the

3408

IEEE TRANSACTIONS ON SIGNAL PROCESSING. VOL. 41. NO. 12. DECEMBER 1993

expected value E(i.. (R n W». Indeed, after normalization, the realizations of a discrete Gaussian white noise have a uniform probability distribution over the unit sphere of H. We define the coherent structures of a signal f as the first m vectors (gy)O:5 n < m that have a higher than average correlation with Rnf. In other words, fhas m coherent structures if and only if for 0 :5 n < m 'A(Rnf) > E('A(R n W» (69)

0.2

and 400

(70)

Equation (13) proves that cay of IIRnf I by

'A (Rnf)

'A (Rnf) is related to the de-

IIR" + If 2 IIR nfl1 2 .

800

Fig. 5. The curve (c) is a plot of E ( );. (R" W) as a function of 11. for a discrete Gaussian white noise of 5762 samples. The curves (a) and (b) give the values of);. (R"!) for the speech signal in Fig. 3(a) and the noisy speech signal in Fig. 4(a).

11

1-

(71)

ror filter bank algorithm [15]. Through our numerical experiments with wavepacket dictionaries, we intend to One can verify that for a Gabor dictionary, the signal compare matching pursuit decompositions with the best shown in Fig. lea) has m = 130 coherent structures that basis algorithm of Coifman and Wikerhauser [4], that secorrespond to the iterations where the IIR nf I has a relalects an "optimal" orthonormal basis within the wavetively faster decay in Fig. 2. packet dictionary. This highlights the respective advanFor all the dictionaries that we studied numerically, we tages of procedures that globally adapt the sig?al have observed that when n increases, E('A(RnW» conrepresentation versus the greedy strategy o~ ~ matchmg verges quickly to a constant E ('A (R oo W». In fact, the pursuit, that locally optimizes the decomposition. n process R W seems to converge to a process ROO W that For signals of N samples, each vector g_, of a wavewe call dictionary noise, whose properties are now being packet dictionary is indexed by l' = (j, p, k), with 0 :5 studied. The realizations of a dictionary noise have an enj :5 log2 (N), 0 :5 P :5 Tj N, 0 :5 P :5 2 j. Such a vector ergy that is uniformly spread across the whole dictionary. has a similar time-frequency localization properties as a For a Gabor dictionary this process is a stationary white discrete window function, dilated by 2 j, centered at 2 j (p noise, that is not Gaussian. The curve (c) in Fig. 5 gives + 1/2), and modulated by sinusoidal wave of frequency the value of E('A(R n W» as a function of n, for a discrete 27f2-j (k + 1/2). The wavepacket dictionary:D = (g,,)yEl' Gaussian white noise of 5762 samples, decomposed in a includes (N + 1) log2 (N) vectors. For any discrete signal Gabor dictionary. The limit is E('A(ROO W» ~ 0.0506. fen) of N samples, the inner products ( f, gy) )YE[ are The curve (a) in Fig. 5 gives the value of ,,(Rnf) as a computed with a filter bank algorithm based on qu.adrature function of n for the speech recording f shown in Fig. mirror filters, that requires 0 (N log2 (N» operatIOns [4]. 3(a). The number of coherent structures is the abscissa of The implementation of the matching pursuit decomposithe first intersection between curves (a) and (c), which is tion follows the general outline of the algorithm described located at n = 698. We have observed numerically that in Section IV. In this case, we set the optimality factor ex after removing the coherent structures from a signal f, the to I and search over the whole dictionary :D because it is residue R mf behaves like a realization of the dictionary not to large. To compute the inner product updating fornoise R oo W. This property remains to be studied more mula (33), we calculate the inner product of wavepacket precisely. The curve (b) in Fig. 5 gives the value .of vectors from the coefficients of the quadrature mirror filn 'A (R f) for the noisy speech signal in Fig. 4(a). The nOIse ters [4]. Each matching pursuit iteration requires O(N log2 has destroyed the low-energy coherent structures and only (N» operations. 76 coherent structures remains at an SNR of 1.5 db. Fig. Fig. 7(a) shows the structure book (R"f, gy,,), I'n)nEN 6(a) is the time-frequency energy distribution of these of the signal in Fig. l(a), with the display conventions of m = 76 coherent structures. Fig. 6(b) is the signal reconCoifman and Wickerhauser [4]. The wavepacket structed from these time-frequency atoms. The SNR of dictionary is built with the Daubechies 6 quadrature mirthe reconstructed signal is 6.8 db. The white noise has ror filters [5]. The horizontal and vertical axes of Fig. 10 been removed and this signal has a good auditory quality are respectively the time and frequency axes. Each vector because the main time-frequency structures of the original g )'11' f'or 'VIII = (J' n' k11' p) n' is represented by a rectangle speech signal have been retained. which is centered at the time 2 jll (PII + 1/2) and at the frequency 27f2- j "(k n + 1/2). This rectangle has a width VIII. W AVEPACKET DICTIONARY of 2 jll along time and 2 -j" 7f along frequencies. It gives an A wavepacket dictionary is a family of orthonormal approximate idea of the localization in time and frequen~y bases composed of vectors that are well localized both in of the atom gw but it reality gy" is much more spread m time and frequency. It is computed with a quadrature mir- time and frequency than the zone indicated by its rectan-


3409

(a) 2000r------~-----~------~-----~-----~----___,

1500

----~-------~-----

-1500

o

1000

3000

soun

(bJ Fig. 6. (a) Time-frequency energy distribution of the In 76 coherent structures of the noisy speech signal shown in hg. 4. (b) Signal reconstructed from the 79 coherent structures shown in (a). The white noise has been removed.

(a)

(b)

Fig. 7. (a) Time frequency display of the wavepaeket structure hook of thc signal shown in Fig. I(a). Each rectangle roughly represents the location and time-frequency sprcad of a selected wavepaeket function (hi Time-frequency display of the signal in Fig. I(a) deeomposcd in the hest wavepackct orthonormal basis.

IEEE TRANSACTIONS ON SIGNAL PROCESSING. VOL. 41, NO. 12. DECEMBER 1993

3410

gle. Wavepacket functions are not as well localized in time and frequency as Gabor functions. When the scale 2 J increases, these atoms have a complicated time-frequency localization studied by Coifman, Meyer and Wickerhauser [3]. The time-frequency image obtained with this wavepacket dictionary is similar to the energy distribution in Fig. l(b), obtained with Gabor dictionary. Some signal features do not appear as clearly because wavepackets are not as well localized in time and frequencies as Gabor functions. Moreover, wavepacket functions do not include a phase parameter and thus can not match signal components as well Gabor functions. We must also mention that the Gabor dictionary includes Gabor functions translated in time and frequency over a much finer grid than wavepackets, so that the different time-frequency signal features can be located more precisely. Although the Gabor dictionary is much larger than the wavepacket dictionary, the matching pursuit does not require much more calculations because we limit most of the computations to a subdictionary ~" that is approximately of the same size as the wavepacket dictionary. By combining the vectors of a wavepacket dictionary, Coifman and Wickerhauser [4] proved that we can build 2 N different orthonormal bases. They have introduced an algorithm that finds the orthonormal basis (gl',,)1 '" n'" N of :D which minimizes the entropy N

L: I (j,

n= I

gl'n>1

2

10g2

(I (j, gl'n> 12 ).

(72)

The choice of this "optimal" orthonormal basis is thus obtained through a global minimization over all the signal components. Fig. 7(b) displays the structure book « j, gl'n 'Yn)nEN that is obtained by decomposing the sig~al of Fig. l(a) in the optimal wavepacket orthonormal basIs. One can hardly distinguish many of the signal components, including the two chirps. The entropy optimization creates a competition between the signal components that are in the same frequency range, but have different timefrequency signatures. Since the signal is not stationary, the global entropy minimization is driven by the transients of highest energy. It leads to a choice of orthonormal basis that is well adapted to represent the corresponding transients, but not to represent other signal structures that have different time-frequency behaviors. For highly nonstationary signals, the entropy minimization produces mismatch between the "best" orthonormal basis and many local signal components. On the contrary, a matching pursuit is a greedy algorithm that locally optimizes the choice of the wavepacket function, for each signal residue. It can thus adapt itself to varying structures. On the other hand, this greedy strategy requires more computations than the best basis decomposition algorithm, whose total complexity is 0 (N log N). The best basis algorithm is thus better suited to represent simpler signals that have stationary properties. The global optimization is then valid locally, and yields good results.

>,

IX. CONCLUSION Matching pursuits provide extremely flexible signal representations since the choice of dictionaries is not limited. We showed that time-frequency dictionaries yield adaptive decompositions where signal structures are represented by atoms that match their time-frequency signature. The properties of the signal components are explicitly given by the scale, frequency, time and phase indexes of the selected atoms. This representation is therefore well adapted to information processing. Compact signal coding is another important domain of application of matching pursuits. For a given class of signals, if we can adapt the dictionary to minimize the storage for a given approximation precision, we are guaranteed to obtain better results than decompositions on orthonormal bases. Indeed, an orthonormal decomposition is a particular case of matching pursuit where the dictionary is the orthonormal basis. For dictionaries that are not orthonormal bases, we must code the inner products of the structure book but also the indexes of the selected vectors. This requires to quantize the inner product values and use a dictionary of finite size. The matching pursuit decomposition is then equivalent to a multistage shape-gain vector quantization in a very high dimensional space. For information processing or compact signal coding, it is important to have strategies to adapt the dictionary to the class of signal that is decomposed. Time-frequency dictionaries include vectors that are spread between the Fourier and Dirac bases. They are regularly distributed of the unit sphere of the signal space and are thus well adapted to decompose signals over which we have little prior information. When enough prior information is available, one can adapt the dictionary to the probability distribution of the signal class within the signal space H. Learning a dictionary is equivalent to finding the important inner structures of the signals that are decomposed. Classical algorithms such as LBG to optimize codebooks [9] do not converge to satisfying solutions in such high dimensional vector spaces. Finding strategies to optimize dictionaries in high dimensions is an open problem that shares similar features with learning problems in neural networks. ApPENDIX A

PROOF OF THEOREM

This appendix is a translation in the matching pursuit context of Jones's proof [11] for the convergence of projection pursuit regressions. Lemma 3: Let h n = (Rnj, gl') gl'n' For any n ~ 0 and m ~ 0

Proof Since hm = (Rmj, gl'm> gl'm I(h m, Rnj>1 =

= I (Rmj, gl'm> (gl'm' Rnj>1

Ilhmll 1(gl'm'

Rnj>l·

(74)


Equation (11) implies

3411

Similarly, (84)

o

(75)

Lemma 4: If (Sn)nEZ is a positive sequence such that :5 +00, then

E::o s~

lim inf Sn 2.,; Sk = O. +00 k=O

Hence,

which proves that (R"f)/lEN is a Cauchy sequence. Let

(76)

n~

1/-

Proof" For any E > 0, we choose n such that :5 E/2. Since limk~ +00 Sk = 0, we can choose k large enough such that Sk E7=o Sj :5 E/2. Let Sj be the minimum element for indexes between n + 1 and k

Et:;, s;

j

J

Sj 2.,; Sk = Sj 2.,; Sk k=O k=O

+ Sj

2.,; Sk k=n+\

We know that lim" ~

+00

I I = O. This implies that Roof E W. Since, +00

(86)

J

E/2

:5

2.,; S~:5 k=n+\

+

o

E.

(77)

and

E::o I

P -, +00

:5

4

q

(88)

PROOF OF LEMMA

(80)

The energy conservation equation (13) proves that the sequence (1IRnfll)"EN is monotonically decreasing and thus converges to some value Roo. Let E > 0, there exist K> o such that for all m > K, IIR mfI1 2 :5 R~ + E 2 • Let p > O. We want to estimate IIRmf - Rm+Pfll, for m > 2 K. Equation (17) proves that E::o I = O.

(90)

Since f has a norm 1, the inner product which each element of :D can not be zero since ;D is complete and thus includes at least a basis of H. This contradicts our assumption, which finishes the proof. 0

We can decompose ApPENDIX C

IIRmf - Rm+Pfll

DILATION, TRANSLATION, AND MODULATION

:5IIRmf-Rqfll + Equation (80) for N

=

m and M

IIRmf- R qf11 2

IIRm+Pf-Rqfll.

(82)

= q implies

:5 E 2

+ ~E2. ex

COVARIANCE

We say that a subset A of f is admissible and associated to f E L 2 (R) if (83)

A = {{3

E

f: I1

2:

ex sup I =

A choice function C is said to be covariant if and only if for any admissible set A, C(A) = (so, Uo, ~o) implies that

(t---c)

Equation (95) proves that the set also satisfies A(a,b,c)

= {iJ

E

r:

defined in (92),

A(a.b.c)

I1 ~ ex sup II}· yEr

(97)

The covariance of the choice function implies that if C (A) = 1'6 = (so, Uo, ~o), then C(A(a,b.c)) = 1'8 = (so/a, (uo - c)/a, a(~o - b)). We can thus derive that Rfl (t) = - d Rfo ~

(t---c)

e

ibr

(98)

.

a

Similarly, we can prove by induction that for any n Rnfl (t) =

~ Rnfo

(t -

~

a

and ifl'~ = (sno Un' ~j) then I'~ a(~n - b)), and

=

c)

~o

(99)

eibr

(Sn/ a , (un

c)/ a,

= deiC(b-~,,) .

(100)

Conversely, if the residues offO (t) and f I (t) satisfies these equalities, (42) proves that (94) is satisfied. Hence, a matching pursuit based on covariant choice functions is covariant by dilation, translation and modulation.

0, lim sup I1 = o.

D

PROOF OF THEOREM

~ER

2 j

We can then derive that there exists U'I that is the supremum of all U such that (SI, U, 0 E A. Since A is closed, there exists ~ such that (Slo UI, 0 E A. Since lim

Iwl

~oo

I g(w)1 = 0

and l(w) is absolutely integrable, we can prove that for I' = (Slo UI, 0, lim ~ ~

+00

I1

= o.

We can finally derive that ~ I that is the supremum of all ~ such that (Slo Ulo 0 EA. Since A is closed, (s" U" ~I) EA. This finishes the proof of the lemma. D Let us prove the covariance of a matching pursuit based

j

We denote gj.p,k(t) = gy (t) for I' = (a , pa l1u, ka- J 110 E r". Since l1u = I1U27r and l1ul1~ < 27r, Daubechies [6] proved that for the Gaussian window g (t) 2 specified by (59), (gO.p,k(t))(p.k)EZ1 is a frame of L (R). The dual frame is given by (gO.p.k(t))(p.k)EZ2, where get) 2 E L (R) and _

gj.p.k(t)

1

= J;;J

_

g

(t - a

pa j

j

I1U) e ika-j!J.~r .

(101)

The dual window g(t) has an exponential decay and its Fourier transform (w) also has an exponential decay [5]. For any fE L 2 (R),

g

+00

+00


With a change of variable, one can derive that for any Z

} E

f(t)

+00

+00

= I;

I;

Let 1'0 = (s, u, 0 s ~ a j + ] /2

E

Since gj.p. k

= gy

(103) ApPENDIX

rand)

E

Z be such that a j

-

1 2 /

0 such that CI

. I g(t + a- j u - p.:lu)1 dt.

Igj.p.k(W) II gyO (w)1

+

R

where Real (z) is the real part of the complex number z. For any residue Rnf, I I = K(y.¢) IReal (e- i¢