IMAGE REPRESENTATION WITH GABOR WAVELETS AND ITS APPLICATIONS
Rafael Navarro Instituto de Optica "Daza de Valdés" (CSIC). Serrano 121. 28006 Madrid. Spain Phone #: (341) 561 6800; FAX: (341) 564 5557 E-mail:
[email protected] Antonio Tabernero Facultad de Informática. Universidad Politécnica de Madrid. Boadilla del Monte. 28660 Madrid. Spain. Phone #: (341) 336 6935; FAX: (341) 336 6942 E-mail:
[email protected] Gabriel Cristóbal Instituto de Optica "Daza de Valdés" (CSIC). Serrano 121. 28006 Madrid. Spain Phone #: (341) 561 6800; FAX: (341) 564 5557 E-mail:
[email protected]
(July, 12, 1995)
To be published in: ADVANCES IN IMAGING AND ELECTRON PHYSICS Edtited by Peter W. Hawkes Academic Press Inc. Orlando, FL
1
Index
I. Introduction II. Joint space-frequency representations and wavelets. A. Joint representations. Wigner distribution, spectrogram and block transforms. B. Wavelets. C. Multiresolution pyramids. D. Vision oriented models. III. Gabor Schemes of Representation. A. Exact Gabor expansion for a continuous function. B. Gabor expansion of discrete signals. C. Quasicomplete Gabor transform. IV. Vision Modeling. A. Image representation in the visual cortex. B. Gabor functions and the RFs of cortical cells. C. Sampling in the human visual system. V. Image Coding, Enhancement and Reconstruction. A. Image coding and compression. B. Image enhancement and reconstruction. VI. Image Analysis and Machine Vision. A. Edge Detection. B. Texture. C. Motion Analysis. D. Stereo. VII. Conclusion. Acknowledgements. References.
2
I. INTRODUCTION In image analysis and processing, there is a classical choice between spatial or frequency domain representations. The former, consisting of a two-dimensional (2D) array of pixels, is the standard way to represent discrete images. This is the typical format used for acquisition and display, but it is also common for storage and processing. Space representations appear in a natural way, and they are important for shape analysis, object localization and description (either photometric or morphologic) of the scene. There is much processing that can be done in the space domain: histogram modification, pixel and neighbor operations, and many others. On the other hand, there are many tasks that we can perform in the Fourier (spatial frequency) domain in a more natural way, such as filtering, correlations, etc. These two representations have very important complementary advantages, that we often have to combine when developing practical applications. An interesting example is our own visual system, which has to perform a variety of complex tasks in real time and in parallel, processing non-stationary signals. Figure 1 (redrawn from Bartelt et al., 1980) illustrates this problem with a simple non-stationary 1-D temporal signal. (It is straightforward to extend the following discussion to 2D images or even 3D signals, such as image sequences.) The four panels show different ways of representing a signal corresponding to two consecutive musical notes. The upper-left panel shows the signal as we would see it when displayed in an oscilloscope. Here we can appreciate the temporal evolution, namely, the periodic oscillations of the wave, and the transition from one note to the next. Though this representation is complete, it is hard with a simple glimpse to say much about the exact frequencies of the notes. The Fourier spectrum (upper right) provides an accurate global description of the frequency contents of the signal, but it does not tell us much about the timing and order of the notes. Despite the fact that either of these two descriptions may be very useful for sound engineers, the music performer would rather prefer a stave (bottom left) of a musical score, that is a conjoint representation in time (t axis) and frequency (log ν axis). The Wigner distribution function (Wigner, 1932), bottom right, provides a complete mathematical description of the joint time-frequency domain (Jacobson and Wechsler, 1988), but at the cost of a very high redundancy (doubling the dimension of the signal). A regular sampling of a signal with N elements in the spatial (or frequency) domain will require N2 samples in the conjoint domain defined by the Wigner distribution, to be stored and analyzed. Although this high degree of redundancy may be necessary in some especially difficult problems (Cristobal et al. 1991), such an expensive redundancy can not be afforded in general, particularly in vision and image processing tasks (2D or 3D signals). The musician will prefer a conjoint but compact (and meaningful) code, like the stave: only two samples (notes) are required to represent the signal in the example of Fig. 1. Such kind of conjoint but compact codes is more likely to be found in biology, combining usefulness with maximum economy. ####### Insert Fig. 1. About here #######
3
A possible approach to build a representation with these advantages is to optimally sample the conjoint domain trying to diminish redundancy without loosing information. The uncertainty principle tells us that there exists a limit for joint (space-frequency) localization (Gabor, 1946; Daugman, 1985); i.e. if we apply a fine sampling in the space (or time) domain, then apply a coarse frequency sampling, and viceversa. The uncertainty product limits the minimum area for sampling the conjoint domain. Gabor in his "Theory of Communications" (1946), observed that Gaussian wave packets (Gabor wavelets or Gabor functions) minimize such conjoint uncertainty, being optimal sampling units, or logons, of the conjoint domain. The left panel of Fig. 2 shows the "classical" way of homogeneously sampling this 2D space-frequency conjoint domain. The right panel represents a smarter sampling, as the one used in wavelet or multiscale pyramid representations; and presumably by our own visual system. Here, the sampling area is kept constant, but the aspect ratio of the sampling units changes from one frequency level to the next. This is a more smart sampling because it takes into account that low frequency features will tend to occupy a large temporal (or spatial) interval requiring a rather coarse sampling, whereas high frequencies require fine temporal (or spatial) sampling. In both cases, the sampling density is very important. A critical sampling (Nyquist) will produce the minimum number of lineraly independent elements (N) to have a complete, representation of the signal; a lower sampling density will cause aliasing artifacts, whereas oversampling will produce a redundant representation (this will be further discused below). ####### Insert Fig. 2 About here ####### One of most exciting features of wavelet and similar representations is that they appear to be worth for almost every signal processing application (either acoustical, 1D; 2D images or 3D sequences), including the modeling of biological systems. However, despite several early developments of the basic theory (Haar, 1910), it has been only in the eighties when the first applications to image processing have been published. Wigner (1932) introduced a complete joint representation of the phase space in Quantum Mechanics; Gabor (1946) proposed Gaussian wave packets, logons or information quanta, for optimally packing information. Cohen (1966) developed a generalized framework for phase space distributions functions, showing that most of these conjoint image representations belong to a large class of bilinear distributions. Any given representation is obtained by choosing an appropriate kernel in the generalized distribution. Until recently, these theoretical developments were not accompanied with practical applications in signal processing. Apart from the availability of much cheaper and more powerful computers, several factors have accelerated this field in the 80ths and 90ths. On the one hand, Gabor functions were succesfully applied to model the responses of simple cells in the brain's visual cortex, both in 1D (Marcelja, 1980), and 2D (Daugman, 1980). On the other hand, Bastiaans (1981) and Morlet et al. (1982) provided the theoretical basis for a practical implementation of the Gabor and other expansions. Further generalizations of the Gabor expansion (Daugman, 1988; Porat and Zeevi, 1988) and the development of wavelet theory 4
(Grossman and Morlet, 1984; Meyer, 1988; Mallat, 1989b; Daubechies, 1990) have opened a broad fields of applications. In particular, wavelet theory has constituted an unifying framework merging ideas coming from Mathematics, Physics and Engineering. One of the most important applications has been image coding and compression, due to its technological relevance. In fact, many conjoint schemes of representation, as for instance multiresolution pyramids (Burt and Adelson, 1983), or the Discrete Cosine Transform (Rao, 1990) used in JPEG and MPEG image and video standards, were specifically directed to image compression. A Gabor function, or Gaussian wave packet, is a complex exponential with a Gaussian modulation or envelope. From now on, we will use the variable t (time) for the 1D case, and x,y for 2D (despite the fact that this review is mainly focused on 2D images, it is simpler and more convenient to use a 1D formulation that can be easily generalized to the 2D case.) In one dimension, the mathematical expression of a Gabor function is:
[
]
[
]
gt 0 ,ω 0 (t) = a ⋅ exp −πa 2 (t − t 0 ) exp i ω 0 t + φ . 2
(1)
The two labels, t 0 ,ω 0 stand for the temporal and frequency localization or tuning. The parameter a determines the halfwidth of the Gaussian envelope, and φ is the phase offset of the complex exponential. The most characteristic property of the Gabor functions is that have the same mathematical expression in both domains. The Fourier transform of gt 0 ,ω 0 (t) will be: Gt 0 ,ω 0 (ω) =
1 1 2 ⋅ exp −π 2 (ω − ω 0 ) exp −i ωt 0 − φ′ a a
[
]
(2)
where, φ′ = ω 0 t 0 + φ. This property, which allows fast implementations in either the space or frequency domains, along with their optimal localization (Gabor, 1946), will yield to a series of interesting applications. Moreover, by changing a single parameter, the bandwidth a, we can continuously shift the time-frequency, or in 2D the space/spatial-frequency localization, from one domain to the other. For instance, visual models (as well as for most applications) use a fine spatial sampling (high localization), and a coarse sampling of the spatial-frequency domain (see Section IV). In addition to the two, space (or time) and Fourier, possible computer implementations (Navarro and Tabernero, 1991), Bastiaans (1982) also proposed a parallel optical generation of the Gabor expansion. Subsequently, several authors (Freysz et al., 1990; Li and Zhang., 1992; Sheng et al., 1992) have reported optical implementations. In the two-dimensional case, it is common to use cartesian spatial coordinates, but polar coordinates for the spatial-frequency domain:
{[
]}
g x 0 ,y 0 ,f 0 ,θ 0 = exp i 2πf 0 ( x cosθ0 + y sinθ0 ) + φ ⋅ g auss( x − x 0 , y − y 0 ) where the Gaussian envelope has the form:
5
(3a)
[
]
2 2 g auss( x, y ) = a ⋅ exp −πa 2 ( x cosθ0 + y sinθ0 ) + γ 2 ( x sinθ0 − y cosθ0 )
(3b)
The four labels x 0 , y 0 , f 0 ,θ 0 , stand for the spatial and frequency localization. The parameters a and γ define the bandwidth and aspect ratio of the Gaussian envelope respectively (we have restricted the Gaussian to have its principal axis along the θ 0 direction); φ is again the phase offset. Apart from the interesting properties above mentioned, Gaussian wave packets (or wavelets), GW, also have some drawbacks. Their lack of orthogonality makes the computation of the expansion coefficients difficult. A possible solution is to find a biorthogonal companion basis, that facilitates the computation of the coefficients for the exact reconstruction of the signal (Bastiaans, 1981). This solution is computationally expensive, and the interpolating biorthogonal functions can have a rather complicated shape. Several practical solutions for finding the expansion coefficients have been proposed, as for instance, the use of a relaxation network (Daugman, 1988). By oversampling the signal to some degree, we can obtain dual functions more similar in shape to the Gabor basis (Daubechies, 1990). The redundancy inherent to oversampling is of course a bad property for coding and compression applications. However, for control systems, redundancy, and lack of orthogonality are desirable properties that are necessary for robustness. Biological vision (and sensory systems in general) lacks orthogonality, producing a redundancy that is highly expensive, this being the price of robustness. The use of a redundant sampling, permits to design quasicomplete Gabor representations (Navarro and Tabernero, 1991), that are simple, robust and fast to implement, providing reconstructions with a high signal-to-noise ratio (SNR) and high visual quality. A minor drawback is that Gabor functions are not pure passband, which is a basic requirement for being an admisible wavelet (but their DC response is very small anyway - less than 0.002 for a 2D, one octave bandwidth Gabor Function-). These drawbacks have motivated the search for other basis functions, orthogonal when possible. This, along with the wide range (still increasing) of applications, and the merging of ideas coming from different fields has produced the appearance of many different schemes of image representation in the literature (we will review the most representative schemes in Section II, before focusing on GW in Section III.) Almost every author seems to have a favorite scheme and basis function, depending on his/her area of interest, personal background, etc. In our case, there are several reasons why GW (Gabor functions) constitute our favorite basis for image representation. Apart from optimal joint localization (as pointed out by Gabor), good behavior of Gaussians and robustness, perhaps the most interesting property is that they probably have the broadest field of application. For a given application (for example coding, edge detection, motion analysis, etc.) one can find and implement an optimal basis function. For instance, Canny (1986) has shown that Gaussian derivatives are optimal for edge detection in noisy environments. Gabor functions are probably not optimal for most applications but they perform well in almost all cases, and in most of them even nearly optimal. This can be intuitively explained in terms of the central limit theorem (Papoulis, 1989); i.e., that the cumulative convolution of many different kernels will result in a Gaussian convolution. The following is not a rigorous but only intuitive 6
discussion: The good fit obtained with GW to the responses of cortical neurons could be, roughly speaking, a consequence of the central limit theorem in the sense that from the retina to the primary visual cortex, there is a series of successive neural networks. In a rough linear approach, we can realize each neural layer as a discrete convolution. Thus, the global effect would be approximately equivalent to a single Gaussian channel. Although this idea is far from having a rigurous demostration, it has been indeed applied to the implementation of multiscale Gabor filtering (Rao and Ben-Arie, 1993). On the other hand, with the central limit theorem in mind, one could tend to think that when trying to optimize a basis function (a filter) for many different tasks simultaneously, the resulting filter could tend to show a Gaussian envelope. The field of application of GW and similar schemes of image representations is huge, and continuously increasing. They are highly useful in almost every problem of image processing, coding, enhancement and analysis and low-mid level vision (including modeling biological vision). Moreover, multiscale and wavelet representations have provided important breakthroughs in image understanding and analysis. Furthermore, Gabor functions are a widely used tool for visual testing in psychophysical and physiological studies. Gaussian envelopes are very common in grating stimuli to measure contrast sensitivity, to study shape, texture or motion perception, etc. (Caelli and Moraglia, 1985; Sagi, 1990; Geri et al., 1995; Watson and Turano, 1995). Although these applications are beyond the scope of this review, we want to mention them due to their increasing relevance. All these facts suggest that GW are specially suitable to build general purpose environments for image processing, analysis and artificial vision systems. Here, we have classified the most relevant applications in three groups: Modeling of early processing in the human visual system in Section IV; applications to image coding, enhancement and reconstruction in Section V; and applications to image analysis and machine vision in Section VI. Prior to these applications, we first review the main conjoint image representations in Section II; and then Section III specifically treats of Gabor representations.
7
II. JOINT SPACE-FREQUENCY REPRESENTATIONS AND WAVELETS
A. Joint representations. Wigner distribution, spectrogram and block transforms. B. Wavelets. C. Multiresolution pyramids. D. Vision oriented models. A. JOINT REPRESENTATIONS. WIGNER DISTRIBUTION, SPECTROGRAM AND BLOCK TRANSFORMS Stationary signals or processes are statistically invariant over space or time (e. g. white noise or sinusoids), and thus we can apply a global description or analysis to them (e. g. Fourier transform). As in the example of Fig.1, an image composed by several differently textured objects will be non stationary. Images can also be affected by non stationary processes. For instance, optical defocus will produce a spatially invariant blur in the case of a flat object that is perpendicular to the optical axis of the camera. However, in the 3D world, defocus will vary with the distance from the object to the camera, and hence it will be non stationary in general. The result is a spatially variant blur that we can not describe as a conventional convolution. Spatially variant signals and processes can be better characterized by conjoint time-frequency, or space/spatial-frequency representations.
8
Wigner distribution function Wigner (1932) introduced a bilinear distribution as a conjoint representation of the phase space in Quantum Mechanics. Later, Ville (1948) derived the same (Wigner or Wigner-Ville) distribution in the field of signal processing. As we have mentioned before, we will be using the variable t for the 1D case (equivalent expressions can be derived for the 2D spatial domain, or higher dimensions). For a continuous and integrable signal f(t), the symmetric Wigner distribution is given by (Claasen and Mecklenbrauker, 1980): ∞
W f (t,ω) =
∫
−∞
s s f t+ f* t − e −iωs ds 2 2
(4)
where s is the integrating variable, ω the is frequency variable and f* stands for the complex conjugate of f. The WD belongs to the Cohen class of bilinear distributions (Cohen, 1966), in which each member is obtained by introducing a particular kernel, φ(ξ,α ) , in the generalized distribution (Jacobson and Wechsler, 1988). These bilinear distributions C(t,ω), can be expressed as the 2D Fourier transform of weighted versions of the ambiguity function: 1 C(t,ω) = 2π
∞ ∞
∫ ∫ A(ξ,α)φ(ξ,α) e
− i ( ξt +αω )
dξdα
(5)
−∞−∞
where A(ξ,α ) is the ambiguity function A(ξ,α ) =
∞
∫
−∞
ξ ξ f t + f* t − e − iαt dt . 2 2
(6)
The Wigner distribution, due to its bilinear definition, contains cross terms, complicating its interpretation, especially in pattern recognition applications.
9
Complex spectrogram Another way to obtain a conjoint representation is through the complex spectrogram, that can be expressed as a windowed Fourier transform: ∞
F(t,ω) =
∫ w( s − t ) f ( s ) e
−iωs
ds
(7)
−∞
were w(s) is the window that introduces localization in time (or space). The signal can be recovered from the complex spectrogram by the inversion formula (Helstrom, 1966): 1 f (s) = 2π
∞
∞
∫ dω ∫ F(t,ω) w(s − t) e
−∞
iωs
dt .
(8)
−∞
The Wigner-Ville distribution can be considered as a particular case of the complex spectrogram, where the shifting window is the signal itself (complex conjugated). Both the spectrogram and the Wigner-Ville distribution belong to the Cohen class (with kernels φ = W w (t,ω ) and φ = 1 respectively), are conjoint, complete and invertible representations, but at the cost of a high redundancy. When the window w is a Gaussian, we can make a simple change, calling: gt,ω (s) = w(s − t ) e iωs .
(9)
Then, gt,ω (s) is a Gabor function, and Eq. 7 becomes: ∞
F(t,ω) =
∫ f (s) g*
t ,ω
(s)ds = f (s), gt ,ω (s)
.
(10)
−∞
Therefore, we can obtain the "gaussian" complex spectrogram at any given point (t, ω) as the inner product between the signal f and a localized Gabor function. The decomposition of a signal into its projections on a set of displaced and modulated versions of a kernel function, appears in Quantum Optics, and other areas of Physics. The elements of the set {g t,ω (s)} are the coherent states associated to the Weyl-Heisenberg group, that sample the phase space (t,w). The spectrogram of Eq. 10 provides information about the energy content of the signal at (t, ω), because the inner product captures similarities between the signal f and the "probe" function g t,ω that is localized in the joint domain. To recover the signal in the continuous case, we rewrite Eq. 8 as: 1 f (s) = 2π
∞
∞
∫ dω ∫
−∞
f (s), gt ,ω (s) gt ,ω (s)dt.
−∞
10
(11)
The window function does not need to be Gaussian in general. However, as we said in the introduction, Gabor functions have the advantage of maximum joint localization; i.e. they achieve the lower bound of the joint uncertainty. This has also been demonstrated in the 2D case for separable Gabor functions (Daugman, 1985). Signal uncertainty is commonly defined in terms of the variances of the marginal energy distributions associated with the signal and its Fourier transform. An alternative definition of informational uncertainty (Leipnik, 1959) has been introduced in terms of the entropy of the joint density function. Interestingly, Leipnik (1960) found that Gabor functions (among others) are entropy minimizing signals. (see Stork and Wilson (1990) for a more recent discussion about alternative metrics or measures of joint localization.) Block Transforms Both, the WD and the complex spectrogram involve a high redundancy and permit to exactly recover the signal in the continuous case. In practical signal processing applications, we have to work with a discrete number of samples. In the Fourier transform, the complex exponentials constitute the basis functions, both in the continuous and discrete cases. For the last case, signal recovery is guaranteed for band limited signals with a sampling frequency greater or equal than the Nyquist frequency. The WD also permits signal recovery in the discrete case (Claasen and Mecklenbrauker, 1980). In the case of the discrete spectrogram, with a discrete number of windows, image reconstruction is guaranteed only under certain conditions (this will be discussed in the Section III). When looking for a complete, but compact discrete joint image representation, one can think of dividing the signal in non-overlapping blocks, and independently process each block (contrarily to the case of overlapping continuously shifted windows). Each block is a localized (in space, time, etc.) portion of the signal. Then if we apply an invertible transform to each block, we will be able to recover the signal whenever the set of blocks is complete. This is the origin of a series of block transforms, of which the discrete cosine transform, DCT, is the most representative example (Rao, 1990). Current standards for image and video compression are based on the DCT. However, the sharp discontinuities between image blocks may produce ringing and other artifacts after quantization, specially at low bit-rate transmission, that are visually annoying. We can eliminate these artifacts by duplicating the number of blocks, in what is called the lapped orthogonal transform, LOT (Malvar, 1989). This is a typical example of oversampling, which generates a linear dependence (redundancy) that improves robustness (this is further discussed in Section III). We will see later that if we apply a block-like decomposition in the Fourier domain, we can obtain a multiscale or multiresolution transform. In block transforms orthogonality is guaranteed, but there is not a good joint localization.
11
B. WAVELETS In wavelet theory, the signal is represented with a set of basis functions that sample the conjoint domain time-frequency (or space/spatial-frequency), providing a local frequency representation with a resolution matched to each scale, so that: f (t) =
∑ ci Ψ i (t)
(12)
i
where Ψ i(t) are the basis functions and ci are the coefficients that constitute the representation in that basis. The key idea of a wavelet transform is that the basis functions are obtained by translations and dilations of a unique wavelet. A wavelet transform can be viewed as a decomposition into a set of frequency channels having the same bandwidth on a logarithmic scale. The application of wavelets to signal and image processing is very recent (Mallat, 1989b; Daubechies, 1990), but their mathematical origins date back in 1910, with the Haar (1910) orthogonal basis functions. After Gabor's seminal theory of communications (1946), wavelets and similar ideas were used in solving differential equations, harmonic analysis, theory of coherent states, computer graphics, engineering applications, etc. (See for instance: Chui, 1992a; Chui, 1992b; Daubechies, 1992; Meyer, 1993; Fournier 1994, for reviews on wavelets.) Grossman and Morlet (1984) introduced the name wavelet (continuous case) in the context of Geophysics. Then, the idea of multiresolution analysis was incorporated along with a systematic theoretical background (Meyer, 1988; Mallat, 1989b; Meyer, 1993). In the continuous 1D case, the general expression of a wavelet basis function is: Ψ b,a =
1 t−b Ψ( ) a a
a, b ∈ℜ, a ≠ 0
(13)
where the translation and dilation coefficients (b and a respectively) of the basic function continuously vary. In Electrical Engineering, this is called a "constant" Q resonant analysis. The continuous wavelet transform W of a function f ∈L2 (ℜ), i.e. square integrable, is: 1 W (a, b) = a
∞
∫
f ( t )Ψ * (
−∞
t−b )dt = f , Ψ b,a . a
(14)
The basis function Ψ must satisfy the admissibility condition of finite energy (Mallat, 1989a). This implies that its Fourier transform is pure band-pass having a zero DC response Ψ(0) = 0 . Thus, the function Ψ must oscillate above and below zero as a wave packet, which is the origin of the name wavelet. The wavelet transform, WT, has a series of important properties. We only list a few of them. The WT is an isometry, up to a proportional coefficient,
12
(
)
L2 (ℜ) → L2 ℜ + × ℜ (Grossman and Morlet, 1984). It can be discretized by sampling both the scale (frequency) and position (space or time) parameters as shown in Fig.2b. Another property is that wavelets easily characterize local regularity, which is interesting in texture analysis. In the discrete case, more interesting in signal processing, there exist necessary and sufficient conditions that the basis functions have to meet so that the WT has an inverse (Daubechies, 1992). A specially interesting class of discrete basis functions is orthogonal wavelets. A large class of orthogonal wavelets can be related to quadrature mirror filters (Mallat, 1989b). There are important desirable properties of wavelets that are not fully compatible with orthogonality, namely, small (or finite at least) spatial support, linear phase (symmetry) and smoothness. This last property is very important in signal representation to avoid annoying artifacts, such as ringing and aliasing. The mathematical description of smoothness has been made in terms of the number of vanishing moments (Meyer, 1993), which determines the convergence rate or wavelet approximation to a smooth function. Finite impulse response (small support) is necessary for having spatial localization. Among these desirable features, ortogonality is a very restrictive condition that may be relaxed to meet other important properties, such as better joint localization. In particular, the use of lineraly dependent (redundant) bi-orthogonal basis functions (Daubechies, 1990) permits to meet smoothness, symmetry and localization requirements while keeping most of the interesting properties derived from orthogonality.
C. MULTIRESOLUTION PYRAMIDS Multiresolution pyramids are a different approach to joint representations (Burt and Adelson, 1983). The basic idea is similar to that of the block transforms, but applied to the frequency domain. Let {W i(ω)} be a set of windows that completely cover the Fourier domain, i. e. ∑ W i (ω ) = 1. Then, we can decompose the Fourier transform F of the signal in a series of bands so that: f (t) =
∑ i
˜f (t ) = 1 i 2π
∞
∑ ∫ F(ω)[W i (ω ) e iωt ]dω .
(15)
i −∞
Here we have represented the signal as the sum of filtered versions, ˜f i (t), one for each window W i (band). This produces a representation that is localized in space (or time) and frequency (depending on the width of the window). The product within the bracket is a sort of Fourier (complex) wavelet that forms a complete basis. The set of windows {W i(ω)} can be implemented as a bank of filters. Mallat (1989a) has shown that there exists a one-to-one correspondence between the coefficients of a wavelet expansion and those of multiresolution pyramid representations, as illustrated in Fig. 3. This is done through a mother wavelet Ψ and a scaling function φ. Fig. 4 shows an example of scaling function both in spatial and Fourier domains as well as its associated wavelet function also in both 13
domains. The basic idea is to split the signal into its lower and higher frequency components. ###### Insert Figure 3 about here ###### ###### Insert Figure 4 about here ###### One of the main applications of multiresolution representations is coding and compression, in which each frequency band is sampled to achieve a maximum rate of compression. The origin of the name pyramid comes from the fact that the sampling rate depends on the bandwidth of each particular subband (Tanimoto and Pavlidis, 1975). Therefore, if we put the samples of each band on top of the previous one we obtain a pyramid. There are basically two different strategies for sampling. Critical sampling is used to eliminate redundancy so that the conjoint representation has no more samples than the original signal. Although we can obtain higher rates of compression with a critical sampling, it has an important cost. Namely, we end up with a representation that is not robust (loosing a single sample will cause very disturbing effects) and that is not translational invariant (i.e. a small displacement of the signal will produce a representation that is completely different) which preclude its application to vision (Simoncelli et al., 1992). In some applications, it is possible to solve the translation dependence by a circular shift of the data (Coifman and Donoho, 1995). However, a much more robust representation is obtained by a Nyquist sampling of each band, i.e. taking samples with a frequency double to the maximum frequency present in the band. The result will be a shiftable and robust multi-scale transform, at the cost of some redundancy. One practical problem is to design filters with a finite impulse response, simultaneously having a good frequency resolution. One solution is to use Quadrature Mirror Filters consisting of couples of low-pass and high-pass that are in phase quadrature (Esteban and Galand, 1977). This constitutes an orthogonal basis that permits to obtain a good localization in both domains, avoiding aliasing artifacts, and an exact reconstruction of the signal. ######## Insert Figure 5 about here ######## The extension to 2D (for application to image processing) of most of the analysis made above in 1D is straightforward. Figure 5 shows an example of a multiscale wavelet transform (b) of a lady portrait (a), including the application to compression: after thresholding the coefficients as explained in Section VA (d) and image recovered (c) from (d).
D. VISION-ORIENTED MODELS One striking fact of joint multiscale representations and wavelets is that a similar representation has been found in the human visual system (see Section IV). Marr (1982) and co-workers established the basis for the modern theory of computational vision defining the primal sketch. It consisted in detecting edges (abrupt changes in the gray levels of images) by applying a Laplacian of a Gaussian operator, and then extracting the zero-crossings. This is done at different scales (resolutions). Using 14
scaled versions of this operator, Burt and Adelson (1983) constructed the Laplacian pyramid. Each layer is constructed by duplicating the size of the Laplacian operator , so that both the peak frequency and the bandwidth is divided by two. In their particular pyramid implementation, they first obtained low-pass filtered versions of the image using Gaussian filters, then subtracting the result from the previous version. Then, they subsampled the low-pass filtered version and repeated the process several times. Consequently, the Nyquist sampling of low-pass filtered versions of the image gives (1/2)2 less samples, producing the pyramid scheme. This yields to an overcomplete representation with 4/3 more coefficients than the original image. One important experimental finding in human vision is orientation selectivity, that is not captured by the Laplacian pyramid. Consequently, Daugman (1980) used 2D Gabor functions, GF, to fit experimental data, and Watson (1983) implemented a computational model of visual image representation with GF. By sampling the frequency domain in a lossless and polar-separable way, Watson (1987a) introduced an oriented pyramid called the Cortex Transform that permitted a complete representation of the image. The filters, 4 orientations by 4 frequencies plus lowpass and high-pass residuals, are constructed in the Fourier domain as the product of a circularly symmetric dom filter with an orientation selectivity fan filter (see Fig. 6a). The impulse response of the cortex filter (Fig. 6b) rougly resembles a 2D Gabor function with ringing artifacts. ######## Insert Figure 6 about here ######## Marr (1982), Young (1985; 1987) and others have proposed Gaussian derivatives, GD, as an alternative to Gabor functions, for modeling the receptive fields of simple cortical cells. Figure 7 shows the 4 first derivatives in 1D, and their frequency responses: G0, G1, G2 and G3 respectively correspond to the Gaussian and its first, second ant third derivatives. Cauchy filters (Klein and Levi, 1985) or even Hermite polynomials with a Gaussian envelope (Martens, 1990a; Martens, 1990b) have also been used, but in much less extent. Gabor functions turn out to be a particular case of Hermite polynomials when the degree of the polynomial tends to infinity. GD are commonly used in the literature as an alternative to Gabor functions, having very similar properties, but with the additional advantage of being pure bandpass (i.e. meeting the admissibility condition of wavelets), but at the cost of a lower flexibility; i.e. fixed orientations, etc. (GD are orthogonal only when centered on a fixed origin of coordinates, but under translation they lose their orthogonality.) To solve this problem, steerable filters can be synthesized in any arbitrary orientation as a linear combination of a set of basis filters (Freeman and Adelson, 1991). Figure 8 shows examples of steerable filters constructed from the second derivatives of a Gaussian G2 and their quadrature pairs H2 . Figure 9 illustrates the design of steerable filters in the Fourier domain. Based on steerable filters, Simoncelli et al. (1992) have proposed a shiftable multiscale transform. Recently, Perona (1995) has developed a method to generate deformable kernels to model early vision. ######## Insert Figure 7 about here ######## ######## Insert Figure 8 about here ######## 15
######## Insert Figure 9 about here ######## Trying to improve the biological plausibility of spatial sampling, Watson and Ahumada (1989) proposed a hexagonal-oriented quadrature pyramid, with basis functions that are orthogonal, self-similar and localized in space and spatialfrequency. However this scheme presents some unrealistic features such as multiple orientation selectivity. ######## Insert Figure 10 about here ######## In summary, there is a large variety of schemes of image representation, that have appeared in different fields of application, including vision modeling. In particular, Fig. 10 shows 1D profiles and frequency responses for Gabor functions with different frequency tuning. We have mentioned briefly Gabor functions in this Section, but we will give a detailed analysis next. For a thorough comparative evaluation and optimal filter design for several of the more used decomposition techniques see Akansu and Haddad (1992).
16
III. GABOR SCHEMES OF REPRESENTATION A. Exact Gabor expansion for a continuous function. B. Gabor expansion of discrete signals. C. Quasicomplete Gabor transform. To introduce Gabor schemes of representation, let us consider the question of reconstructing a signal from a sampled version of the complex spectrogram (Section II.A). It was shown (Eq. 10) that a sample of the spectrogram at time t and frequency ω could be seen as the projection of the signal onto a modulated and displaced version of the window, g t,ω (s). Instead of a continuum, now we have only a discrete set of functions:
{gnm (s)} = {gnT,mW (s)} = {w(s − nT) ⋅ eimWs } ,
with n,m integers,
(16)
that sample the joint domain at points (nT,mW). Recovering the signal from the sampled spectrogram is equivalent to reconstruct f(s) from its projections on that set, that is, with a summation on the indexes (n,m) , instead of a double integral in t and ω . Another related problem would be to express f(s) as a linear combination of the set of functions {gnm (s)}: f (s) =
∑ ∑ anm gnm (s) n
m
(17)
In the continuous case, Eq. 11 provides us with the answer to both questions, since it uses both the projections f , gt ,ω and the functions gt ,ω to recover the signal. One could say that in that case, the same set of functions is used for the analysis (obtaining the projections), and synthesis (regenerating the signal). As we will see, that is not true, in general, when one counts only on a discrete number of projections. In that case, expressing f(s) as an expansion of a set of functions may constitute a problem different from using the projections of f(s) onto that set to recover it. These two problems are closely related, as we will see in Section III.A. The Gabor expansion arises when the basis functions gnm (s) in Eq. 17 are obtained by displacements and modulations of a Gaussian window function (w(s), in Eq. 16). Gabor (1946) based his choice of the window on the fact that the Gaussian has a minimal support in the joint domain. Later on, Daugman (1985) showed that this was also the case in 2D. There are many possibilities when designing a Gabor expansion. Apart from choosing the width of the Gaussian envelope (that determines the resolution in both domains), and the phase of the complex exponential, the key issue is to decide the sampling intervals T (time or space) and W (frequency), that govern the degree of overlapping of the "sampling" functions g nm (s). Intuitively, it seems clear that a sampling lattice too sparse (T,W large) will not allow the exact 17
reconstruction of the signal. The original choice of Gabor (1946) was TW=2π, which corresponds to the Nyquist density. This is the minimum required to preserve all the information, and therefore, is called the critical sampling case. Schemes with TW 0, B < ∞ ,
(20)
so that if f − g → 0, the sum of the squared differences of the projections should also tend to zero. The above condition can be expressed using operator notation as AI ≤ Τ ≤ BI ,
(21)
with I the identity matrix. A set of functions that generates an operator Τ complying with the above conditions is said to form a frame (Duffin and Schaeffer, 1952). The constants A, B are called frame bounds, and determine some important properties. A frame can be seen as a generalization of the concept of a linear basis in a Hilbert space, being able to generate the space, but they can have, in general, "too many" vectors. An irreducible frame will be a basis with linearly independent elements; otherwhise, the frame is redundant with elements that are not linearly independent. There are two advantages in using redundant frames. First, redundant frames are not orthogonal, and as we mentioned before (Low-Balian theorem), relaxing the orthogonality condition permits elements with a better localization. Secondly, the linear dependence of the elements of a redundant frame implies robustness, in the sense that a combination of elements can "do the work" of another element that is lost, destroyed, etc. Orthogonal bases are a particular case of non redundant, linearly independent frames whose functions present bad localization properties. Using Τ, we can construct a dual set of functions, that also constitutes a frame as
Ψˆ i (s) = Τ −1Ψ i (s) .
(22)
ˆ , then: The dual frame is very useful because if we denote the dual operator as T ˆ f= T
{ f , Ψˆ } ,
(23)
i
19
ˆ =T ˆ * T = I . In practice, this means that if we generate a sequence from a and T * T ˆ , by applying the coresponding adjoint operator in the function using T or T inverse operation (as in Eq. 19) we now recover the signal:
(
)
ˆ f (s) = T * f (s) = T * T
({ f , Ψˆ }) = ∑ f , Ψˆ
Ψ i (s)
(24a)
f , Ψ i Ψˆ i (s)
(24b)
i
i
i
ˆ * (T f (s)) = T ˆ* f (s) = T
({ f , Ψ }) = ∑ i
i
Eq. 24a corresponds to the Gabor expansion when { Ψ i } is a set of Gabor functions. It shows that the coefficients of the expansion are to be computed as projections not on the original set, but on the dual one. On the other hand, Eq. 24b tells us how to recover the signal from samples of the spectrogram ( f , Ψ i ). Therefore, both equations provides the answer to the two problems stated at the beginning of the chapter, showing that they are closely related through the concept of the dual set. If we use a particular synthesis window, we must use the corresponding dual function for the analysis and vice versa. It is important to note that if the frame is redundant, other set of functions in the space can be obtained that could play the role of the dual set and lead to the reconstruction formulas of Eq. 24, having them multiple expansions. The Gabor expansion can then be easily computed if we can generate the dual set. This can be found from Eq. 22, using the following expansion for Τ-1: ˆ = T −1Ψ = Ψ i i
k
∞
2 2T I− Ψi A + B k=0 A + B
∑
(25)
The convergence of this series depends on the frame bounds A and B. One can distinguish three cases: ˆ = Ψ . The synthesis • A=B=1 : from Eq. 21, we see that Τ = I, and consequently Ψ i i and analysis windows are the same, and therefore, the frame constitutes an orthogonal basis. • A=B : “Tight” frames; in this case Τ = AI, and the dual set is the same, except for a constant. •
A ≠ B : General case. If A≈B the frame is called to be “snug”.
In the first two cases, it is trivial to recover the signal from its projections. The case of “snug” frames is important because when B/A ≈ 1 we have a good convergence in Eq. 25 and the dual set is not too different from the original. Consequently, these “snug” frames provide a good direct recovery of the signal, to a first approximation, from its projections on the frame functions; i.e. taking only the term k=0 in Eq. 25:
20
f (s) ∝ ∑ f , Ψ i Ψ i .
(26)
i
An adequate choice of the seed function from which the set is generated allows us to build up a tight frame with all its convenient properties (Daubechies et al., 1986). However, for the Gabor expansion, the seed function is a Gaussian. Thus, we have to study under which conditions a frame can be developed, and try to make it a tight or, at least, "snug" frame. The parameters that we can vary in the Gabor expansion are the sampling rates in time (T) and frequency (W). If TW ≥ 2π , the sampling lattice is too loose, and the signal can not be completely recovered (Bargmann et al., 1971). The choice in many schemes (Gabor, 1946; Bastiaans, 1981, etc.) was then TW = 2π, i.e. the critical sampling corresponding to the Nyquist density, where the coefficients of the expansion can be associated to the degrees of freedom of the signal. However, as Gaussians are smooth functions, the Low-Balian theorem states that they cannot constitute a frame for this critical sampling. In practice this means that with a critical sampling the reconstruction, although possible in theory, will lack stability, reducing its interest in many practical applications. Finally, when TW < 2π (i.e. oversampling the signal) we can obtain a frame (although not a convenient "tight" frame). However, by increasing the amount of oversampling ( WT = π , WT = π/2, and so on) we can get "snug" enough frames (with B/A ≈ 1). Fortunately, oversampling by a factor of 2 or 4 (over the Nyquist frequency) is enough in practice, although stability will improve as we increase oversampling. For a given value of TW < 2π, any choice of T,W will generate a frame, but we can see intuitively that large values of T will create a bad spatial sampling, and the reciprocal stands for large W. In general, the best results (that is, the "snugger" frames ) are obtained by adjusting T to the width of the Gaussian. If we desired to increase the temporal resolution we should reduce both T and the Gaussian width. The evolution of the dual functions for different degrees of oversampling (TW = 2πλ) was studied by Daubechies (1990) for the continuous case and is shown in Fig. 10. Similar studies for the discrete case can be found in Wexler and Raz (1990); Qian et al. (1992); Qian and Chen (1993). A moderate oversampling (by a factor of four; λ = 0.25), is enough to have dual functions with very little difference with respect to the original Gaussian envelope (except for scaling). With this modest redundancy it is possible to obtain an approximate direct reconstruction of the signal, after Eq. 26. As we move closer to a critical sampling ( λ → 1), the dual functions depart from the Gaussian shape, and consequently the later approximation will no longer hold. For critical sampling, we obtain a dual function that is nonsquare-integrable (lower-right panel in Fig. 11) that reflects the unstable nature of the reconstruction. ######## Insert Figure 11 about here ######## The convergence of the dual functions (the analysis window) to the original Gaussians (the synthesis window) as oversampling increases admits an intuitive 21
explanation. We can consider oversampling as an intermediate step between the reconstruction of the signal from the continuous spectrogram (Eq. 11, corresponding to an infinitely dense sampling lattice) and the critical sampling (using the sparsest possible lattice in Eq. 17). While in the later case, the analysis and synthesis windows can be very different, in the former, they are proportional by a factor 2π (Eq. 11). Therefore, by increasing oversampling we can get a more robust representation with dual functions closer to Gaussians. Biorthogonal functions Bastiaans (1981) introduced the idea of using functions that were biorthogonal to the set of Gabor functios, a concept that later Daubechies (1990) generalized in the context of dual frames. Indeed, these biorthogonal functions turn out to be the dual set since Eq. 22 implies biorthogonality among the dual sets: ˆ ,Ψ = δ Ψ i j ij
(27)
Baastians (1981) was the first to use the above relationship to compute the analytical expression of the biorthogonal functions in the Gabor expansion (for the critical sampling TW = 2π). From a Gaussian synthesis window ω(s) of width T: 2 w(s) = T
1
s 2 −π T
2
e
(28)
ˆ he obtained the corresponding biorthogonal analysis windows w(s) of the dual set: 1 ˆ w(s) = 2T
1
2 K
0 π
−3 2
∑ (s−1)
n+ 1 2 ≥
1 −π n+ n 2
e
2
(29)
T
where K0 = 1.8540746. He also showed that the biorthogonal or dual set can be ˆ ˆ generated by translations and modulations of w(s) . Both functions, w(s), w(s) are shown in Figure 12. The dual function is badly localized in time (Low-Balian theorem). Consequently, the coefficients anm will capture information from the signal far from the time nT. Moreover, its odd behavior with sharp spikes will cause stability problems in practice (some considerations of the effect of quantization in the Gabor expansion can be found in Porat and Zeevi, 1988). In spite of this problems, the use of biorthogonal functions is widespread in the computation of the Gabor expansion (Baastians 1981,1985 in 1D, and Porat and Zeevi, 1988 in 2D). Some authors (Kritikos and Farnum, 1987) have introduced approximations that facilitate the computation of the biorthogonal function for a more general type of window. ######## Insert Figure 12 about here ######## The Zak transform
22
In this section we introduce the Zak transform, ZT, first proposed in the context of solid-state physics and the reciprocal lattice (Zak, 1967). This transform, another joint representation of a signal, permits re-formulating the problem without the explicit use of biorthogonal or dual functions. The Zak transform of f(t) is the Fourier transform of the sequence { f (t + mT )} : ˜f (t,ω ) =
∑ f (t + mT) e
−imωT
.
(30)
m
˜f (t,ω ) is periodic in ω with a period Ω = 2π/T, and quasiperiodic in t, so that: ˜f (t + kT,ω + mΩ) = ˜f (t,ω ) e imωT .
(31)
The Zak transform maps the information contained in f(t ) into a square of area TxΩ= 2π in the joint domain. This again is related to the Nyquist sampling density. Bastiaans (1985) implicitly used the Zak transform, but Eizinger (1988; Eizinger et al., 1989) was the first to explicitly apply it to the Gabor expansion. Later work has been done about implementing the ZT in the discrete case (Zeevi and Gertner, 1992; Bastiaans, 1994). The ZT is interesting here because it permits to translate the biorthogonality relationship of Eq. 27 into a product: f (t ), g(t ) = δ ij ⇒ T ˜f (t,ω ) ˜g* (t,ω ) = 1
(32)
We can then invert the ZT of the Gaussian window (provided that it has no zeros) to compute the ZT of the desired dual biorthogonal function. Finally, by computing the inverse ZT one obtains the desired dual function. However, we do not need to explicitely obtain these dual functions. Taking Zak transforms in both sides of the Gabor expansion (Eq. 17) we obtain: ˜f (t,ω ) = w ˜ (t,ω )
∑ ∑ anm e −inωT e imΩt . n
(33)
m
Thus, the Fourier transform a(t,ω ) of the sequence of coefficients anm is given by: a(t,ω ) =
˜f (t,ω ) . ˜ (t,ω ) w
(34)
and from it, the coefficients anm can be obtained via the inverse Fourier transform of a sequence. ˜ (t,ω) has zeros, it means that nonIf the Zak transform of the window function w ˜ (t,ω)˜h(t,ω) = 0. Then zero functions ˜h(t,ω) can be built in such a way that w ˜h(t,ω) are homogeneus solutions of the biorthogonal equation (Eq. 32), and consequently adding them to the biorthogonal window generates new permisible reconstruction functions, causing a multiplicity of solutions. Similarly to what happened in the discussion in terms of frames, one can take advantage of the multiplicity of solutions when oversampling to generate well-behaved dual 23
functions using the ZT (Zibulski and Zeevi, 1993). B. GABOR EXPANSION OF DISCRETE SIGNALS In the above analysis we have considered continuous signals to be expanded through a set of continuous functions. In digital signal processing, an A/D converter will sample the input function, which will be only known at a finite number of points (N). Therefore, in practice we will be mostly interested in the Gabor expansion of discrete signals, that will be distinguished using brackets notation. As it is usual when working with finite discrete sequences we assume that we are dealing with a periodic sequence of period N. In the discrete case the Gabor expansion becomes a system of linear equations. Let us suppose that M Gabor functions are being used for the expansion. Again, in what follows, we will use a single index to denote the different Gabor functions, although two (for 1D signals) or four (for 2D signals) should be used to indicate the sampling of the joint domain (nT,mW). As before, we want to find a set of coefficients {ci} so that the Gabor expansion is as close as possible to the original signal f[k] (in the N sampled points) according to some criterion: f [k ] ≈
M
∑ c i gi [ k ]
k = 0,...N
(35a)
i=1
that we can rewrite as a vector-matrix product: F ≈ GC .
(35b)
where F is a vector whose N components are the input samples, and C is the vector formed by the M coefficients of the expansion (the unknowns). G is a NxM matrix whose M columns correspond to the M Gabor functions sampled at the N points. A common criterion of similarity is least squares error, where the goal is to minimize the norm of the error vector:
( F− GC)+ ( F− GC)
(36)
Where the superscript + denotes the transpose of the complex conjugate. Again, using an orthogonal basis, the solution C could be easily built through the projections on the basis vectors (now using a discrete version of the inner product). However, since Gabor functions are not orthogonal, one has to solve the well-known set of normal equations: G+GC = G+ F
(37)
Where now G+G is a MxM square matrix. The main advantage of this least-squares approach is that the above result is a rather general solution. It does not depend on the particular type of expansion or sampling used. The solution obtained is the closest to the original signal in the least square sense. We only need to solve a (large) linear system with N equations (the number of samples) and M unknowns 24
(the number of Gabor functions used to reconstruct the signal). Next we review the main approaches to practically solve Eq. 37. Iterative methods: Daugman's neural network. The first method for calculating the Gabor expansion by solving a linear system of equations was developed by Daugman (1988), and applied to 2D signals (images). He pointed out the difficulties of directly solving the system: for a typical 256x256 pixel image, we would need to solve a system of at least 65536 equations. However, he also noted that the joint localization of Gabor functions would lead to very sparse matrices, thus opening the door to special techniques. Daugman utilized a neural network, implementing a steepest descent method to minimize the cost function of Eq. 36. From an initial guess of the coefficients he reconstructs the image and computes the resulting error. Then, each coefficient is updated by an amount proportional to the inner product of its corresponding Gabor function and the reconstruction error. As there is only one minimum, the net cannot be trapped in local minima, and the convergence towards the desired coefficients is ensured. As this procedure is based on iterative methods for solving a linear system, its convergence can be improved using standard techniques in numerical analysis (Braithwaite and Beddoes, 1992). Fig. 13a shows an image of 256x256 pixels, while Fig. 13b presents the 4D coefficient set {anmrs} of the Gabor expansion computed by Daugman. The sampling intervals in the spatial domain were 16 pixels in both directions, so that all the coefficients corresponding to different frequencies are grouped in the Figure around the spatial sampling positions (16n, 16m). ######## Insert Figure 13 about here ########
Direct methods for the inversion of G+G Despite the large size of the resulting linear system, the approach of trying to solve it directly is not so inefficient as one might think. Once the particular set of Gabor functions that will be used to sample the joint domain has been chosen, the matrix G that defines the system is fixed, independent of the input. Then, the bulk of the process, the inversion of the matrix, has to be done only once. Most approaches consist of factoring the matrix of Eq. 37 using, for instance, QR decomposition (Lau et al., 1993), Singular value Decomposition (Ebrahimi and Kunt, 1991), or Toeplitz matrices (Yao, 1993). Once this very time consuming task (in the order of N3 operations, with N the number of coefficients) is accomplished, the computation of the Gabor expansion for a particular signal is a much faster process (requiring N 2 operations), since it only involves a matrix multiplication. In the critical sampling case N=M and the coefficients can be obtained as C=G -1F although the ill-conditioning of G may force the use of Singular Value Decomposition techniques to invert it (Ebrahimi and Kunt, 1991). On the other hand, undersampling leads to an overdetermined system (M